2017년 2월 2일 목요일

Minsky 서버에서 nvidia-docker를 이용하여 Caffe Alexnet training 수행하기

앞선 포스팅에서 말씀드린 바와 같이, nvidia-docker를 이용하면 다양한 환경의 deep learning framework을 사용자 간의 application 충돌 없이 손쉽게 사용 가능합니다.   이번에는 NVIDIA P100 GPU를 탑재한 ppc64le 환경인 Minsky 서버에서 Caffe를 docker image로 build하여 Alexnet training을 수행해 보겠습니다.  

먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다.  이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.


root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial

# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt

RUN apt-get update && apt-get install -y cuda

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
    dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
    dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
    dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
    rm -rf /tmp/temp && \
    apt-get update && apt-get install -y caffe-nv libnccl1 && \
    rm -rf /var/lib/apt/lists/*

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"

# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe


이제, 이 dockerfile로 build를 시작합니다.  물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.

root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
 ---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
 ---> Using cache
 ---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
 ---> Using cache
 ---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
 ---> Using cache
 ---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
 ---> Using cache
 ---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb &&     dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb &&     dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb &&     rm -rf /tmp/temp &&     apt-get update && apt-get install -y  caffe-nv libnccl1 &&     rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
 ---> Running in 875c714591e5
 ---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
 ---> Running in e78eaede0f62
 ---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
 ---> Running in a95e655fee4f
 ---> be9b92d51239


이제 docker image를 확인합니다.  이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다.  원래 필요없는 것은 빼시는 것이 좋습니다.

root@minsky:/data/mydocker# docker images
REPOSITORY                     TAG                  IMAGE ID            CREATED             SIZE
bsyu/caffe                     ppc64le-xenial       2705abb3bbc5        13 seconds ago      4.227 GB
bsyu/p2p                       ppc64le-xenial       2fe1b4ac3b03        17 hours ago        2.775 GB
registry                       latest               781e109ba95f        44 hours ago        612.6 MB
127.0.0.1/ubuntu-xenial        gevent               4ce0e6ba8a69        44 hours ago        282.5 MB
localhost:5000/ubuntu-xenial   gevent               4ce0e6ba8a69        44 hours ago        282.5 MB
ubuntu/xenial                  gevent               4ce0e6ba8a69        44 hours ago        282.5 MB
bsyu/cuda8-cudnn5-devel        cudnn5-devel         d8d0da2fbdf2        46 hours ago        1.895 GB
bsyu/cuda                      8.0-devel            dc3faec17c11        46 hours ago        1.726 GB
bsyu/ppc64le                   cuda8.0-devel        dc3faec17c11        46 hours ago        1.726 GB
cuda                           8.0                  dc3faec17c11        46 hours ago        1.726 GB
cuda                           8.0-devel            dc3faec17c11        46 hours ago        1.726 GB
cuda                           devel                dc3faec17c11        46 hours ago        1.726 GB
cuda                           latest               dc3faec17c11        46 hours ago        1.726 GB
cuda8-cudnn5-runtime           latest               8a3b0a60e741        46 hours ago        942.2 MB
cuda                           8.0-cudnn5-runtime   8a3b0a60e741        46 hours ago        942.2 MB
cuda                           cudnn-runtime        8a3b0a60e741        46 hours ago        942.2 MB
cuda8-runtime                  latest               8e9763b6296f        46 hours ago        844.9 MB
cuda                           8.0-runtime          8e9763b6296f        46 hours ago        844.9 MB
cuda                           runtime              8e9763b6296f        46 hours ago        844.9 MB
ubuntu                         16.04                09621ebd4cfd        6 days ago          234.3 MB
ubuntu                         latest               09621ebd4cfd        6 days ago          234.3 MB
ubuntu                         xenial               09621ebd4cfd        6 days ago          234.3 MB
nvidia-docker                  deb                  332eaa8c9f9d        6 days ago          430.1 MB
nvidia-docker                  build                8cbc22512d15        6 days ago          1.012 GB
ppc64le/ubuntu                 14.04                c040fcd69c12        3 months ago        227.8 MB
ppc64le/ubuntu                 latest               1967d889e07f        3 months ago        168 MB
ppc64le/golang                 1.6.3                6a579d02d32f        5 months ago        704.7 MB


Docker image를 nvidia-docker로 수행해 봅니다.  Caffe 버전을 확인할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>

commands:
  train           train or finetune a model
  test            score a model
  device_query    show GPU diagnostic information
  time            benchmark model execution time

  Flags from tools/caffe.cpp:
    -gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
      '-gpu all' to run on all available GPUs. The effective training batch
      size is multiplied by the number of devices.) type: string default: ""
    -iterations (The number of iterations to run.) type: int32 default: 50
    -model (The model definition protocol buffer text file.) type: string
      default: ""
    -sighup_effect (Optional; action to take when a SIGHUP signal is received:
      snapshot, stop or none.) type: string default: "snapshot"
    -sigint_effect (Optional; action to take when a SIGINT signal is received:
      snapshot, stop or none.) type: string default: "stop"
    -snapshot (Optional; the snapshot solver state to resume training.)
      type: string default: ""
    -solver (The solver definition protocol buffer text file.) type: string
      default: ""
    -weights (Optional; the pretrained weights to initialize finetuning,
      separated by ','. Cannot be set simultaneously with snapshot.)
      type: string default: ""


그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다.  /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash

root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem      Size  Used Avail Use% Mounted on
none            845G  184G  619G  23% /
tmpfs           256G     0  256G   0% /dev
tmpfs           256G     0  256G   0% /sys/fs/cgroup
/dev/sda2       845G  184G  619G  23% /etc/hosts
shm              64M     0   64M   0% /dev/shm

root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory


그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash

root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem      Size  Used Avail Use% Mounted on
none            845G  184G  619G  23% /
tmpfs           256G     0  256G   0% /dev
tmpfs           256G     0  256G   0% /sys/fs/cgroup
/dev/nvme0n1    2.9T  290G  2.5T  11% /nvme
/dev/sda2       845G  184G  619G  23% /etc/hosts
shm              64M     0   64M   0% /dev/shm


root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel   caffe_alexnet_train_iter_50000.caffemodel   data
caffe_alexnet_train_iter_102000.solverstate  caffe_alexnet_train_iter_50000.solverstate  ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel      caffe_alexnet_train_iter_51000.caffemodel   ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate     caffe_alexnet_train_iter_51000.solverstate  imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel       caffe_alexnet_train_iter_56250.caffemodel   kkk
caffe_alexnet_train_iter_28.solverstate      caffe_alexnet_train_iter_56250.solverstate  lost+found
caffe_alexnet_train_iter_37500.caffemodel    caffe_alexnet_train_iter_6713.caffemodel    solver.prototxt
caffe_alexnet_train_iter_37500.solverstate   caffe_alexnet_train_iter_6713.solverstate   train_val.prototxt



이제 caffe docker image를 이용하여 Alexnet training을 수행합니다.  아주 잘 되는 것을 보실 수 있습니다.


root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032     1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119     1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659     1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191     1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721     1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641     1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285     1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449     1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492     1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807     1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096     1 solver.cpp:429]     Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148     1 solver.cpp:429]     Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459     1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503     1 solver.cpp:261]     Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540     1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...


이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다.  GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.

Thu Feb  2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107                Driver Version: 361.107                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 0002:01:00.0     Off |                    0 |
| N/A   74C    P0   242W / 300W |   9329MiB / 16280MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 0003:01:00.0     Off |                    0 |
| N/A   69C    P0   256W / 300W |   8337MiB / 16280MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 0006:01:00.0     Off |                    0 |
| N/A   75C    P0   244W / 300W |   8337MiB / 16280MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 0007:01:00.0     Off |                    0 |
| N/A   67C    P0   222W / 300W |   8337MiB / 16280MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    121885    C   ./caffe                                       9317MiB |
|    1    121885    C   ./caffe                                       8325MiB |
|    2    121885    C   ./caffe                                       8325MiB |
|    3    121885    C   ./caffe                                       8325MiB |
+-----------------------------------------------------------------------------+


그 pid를 통해 추적해보면 저 caffe라는 process의 parent는 docker-containerd 임을 알 수 있습니다.

root@minsky:/data/mydocker# ps -ef | grep 121885
root     121885 121867 99 11:27 ?        01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root     121992 116723  0 11:39 pts/0    00:00:00 grep --color=auto 121885

root@minsky:/data/mydocker# ps -ef | grep 121867
root     121867 106109  0 11:27 ?        00:00:00 docker-containerd-shim 61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c /var/run/docker/libcontainerd/61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c docker-runc
root     121885 121867 99 11:27 ?        01:34:04 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root     121996 116723  0 11:39 pts/0    00:00:00 grep --color=auto 121867


이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.

댓글 없음:

댓글 쓰기