HW 엔지니어를 위한 Deep Learning: Minsky 서버에서 nvidia-docker를 이용하여 Caffe Alexnet training 수행하기

앞선 포스팅에서 말씀드린 바와 같이, nvidia-docker를 이용하면 다양한 환경의 deep learning framework을 사용자 간의 application 충돌 없이 손쉽게 사용 가능합니다. 이번에는 NVIDIA P100 GPU를 탑재한 ppc64le 환경인 Minsky 서버에서 Caffe를 docker image로 build하여 Alexnet training을 수행해 보겠습니다.

먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다. 이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.

root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial

# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt

RUN apt-get update && apt-get install -y cuda

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y caffe-nv libnccl1 && \
rm -rf /var/lib/apt/lists/*

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"

# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe

이제, 이 dockerfile로 build를 시작합니다. 물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.

root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
---> Using cache
---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
---> Using cache
---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
---> Using cache
---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
---> Using cache
---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && rm -rf /tmp/temp && apt-get update && apt-get install -y caffe-nv libnccl1 && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
---> Running in 875c714591e5
---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
---> Running in e78eaede0f62
---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
---> Running in a95e655fee4f
---> be9b92d51239

이제 docker image를 확인합니다. 이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다. 원래 필요없는 것은 빼시는 것이 좋습니다.

root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/caffe ppc64le-xenial 2705abb3bbc5 13 seconds ago 4.227 GB
bsyu/p2p ppc64le-xenial 2fe1b4ac3b03 17 hours ago 2.775 GB
registry latest 781e109ba95f 44 hours ago 612.6 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
ubuntu/xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 46 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda devel dc3faec17c11 46 hours ago 1.726 GB
cuda latest dc3faec17c11 46 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 46 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 46 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 46 hours ago 844.9 MB
cuda runtime 8e9763b6296f 46 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 6 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 6 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 6 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 6 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 6 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB

Docker image를 nvidia-docker로 수행해 봅니다. Caffe 버전을 확인할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>

commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time

Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file.) type: string
default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""

그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다. /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash

root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm

root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory

그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash

root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/nvme0n1 2.9T 290G 2.5T 11% /nvme
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm

root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel caffe_alexnet_train_iter_50000.caffemodel data
caffe_alexnet_train_iter_102000.solverstate caffe_alexnet_train_iter_50000.solverstate ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel caffe_alexnet_train_iter_51000.caffemodel ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate caffe_alexnet_train_iter_51000.solverstate imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel caffe_alexnet_train_iter_56250.caffemodel kkk
caffe_alexnet_train_iter_28.solverstate caffe_alexnet_train_iter_56250.solverstate lost+found
caffe_alexnet_train_iter_37500.caffemodel caffe_alexnet_train_iter_6713.caffemodel solver.prototxt
caffe_alexnet_train_iter_37500.solverstate caffe_alexnet_train_iter_6713.solverstate train_val.prototxt

이제 caffe docker image를 이용하여 Alexnet training을 수행합니다. 아주 잘 되는 것을 보실 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032 1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119 1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659 1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191 1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721 1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641 1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285 1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449 1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492 1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807 1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096 1 solver.cpp:429] Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148 1 solver.cpp:429] Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459 1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503 1 solver.cpp:261] Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540 1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...

이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다. GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.

Thu Feb 2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 74C P0 242W / 300W | 9329MiB / 16280MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 69C P0 256W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 75C P0 244W / 300W | 8337MiB / 16280MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 67C P0 222W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 121885 C ./caffe 9317MiB |
| 1 121885 C ./caffe 8325MiB |
| 2 121885 C ./caffe 8325MiB |
| 3 121885 C ./caffe 8325MiB |
+-----------------------------------------------------------------------------+

그 pid를 통해 추적해보면 저 caffe라는 process의 parent는 docker-containerd 임을 알 수 있습니다.

root@minsky:/data/mydocker# ps -ef | grep 121885
root 121885 121867 99 11:27 ? 01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121992 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121885

root@minsky:/data/mydocker# ps -ef | grep 121867

root 121867 106109 0 11:27 ? 00:00:00 docker-containerd-shim 61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c /var/run/docker/libcontainerd/61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c docker-runc

root 121885 121867 99 11:27 ? 01:34:04 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt

root 121996 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121867

이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.

HW 엔지니어를 위한 Deep Learning

2017년 2월 2일 목요일

Minsky 서버에서 nvidia-docker를 이용하여 Caffe Alexnet training 수행하기

댓글 없음:

댓글 쓰기