먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다. 이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.
root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial
# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt
RUN apt-get update && apt-get install -y cuda
RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y caffe-nv libnccl1 && \
rm -rf /var/lib/apt/lists/*
# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe
이제, 이 dockerfile로 build를 시작합니다. 물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.
root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
---> Using cache
---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
---> Using cache
---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
---> Using cache
---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
---> Using cache
---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && rm -rf /tmp/temp && apt-get update && apt-get install -y caffe-nv libnccl1 && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
---> Running in 875c714591e5
---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
---> Running in e78eaede0f62
---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
---> Running in a95e655fee4f
---> be9b92d51239
이제 docker image를 확인합니다. 이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다. 원래 필요없는 것은 빼시는 것이 좋습니다.
root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/caffe ppc64le-xenial 2705abb3bbc5 13 seconds ago 4.227 GB
bsyu/p2p ppc64le-xenial 2fe1b4ac3b03 17 hours ago 2.775 GB
registry latest 781e109ba95f 44 hours ago 612.6 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
ubuntu/xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 46 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda devel dc3faec17c11 46 hours ago 1.726 GB
cuda latest dc3faec17c11 46 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 46 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 46 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 46 hours ago 844.9 MB
cuda runtime 8e9763b6296f 46 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 6 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 6 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 6 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 6 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 6 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
Docker image를 nvidia-docker로 수행해 봅니다. Caffe 버전을 확인할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time
Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file.) type: string
default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""
그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다. /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash
root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory
그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash
root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/nvme0n1 2.9T 290G 2.5T 11% /nvme
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel caffe_alexnet_train_iter_50000.caffemodel data
caffe_alexnet_train_iter_102000.solverstate caffe_alexnet_train_iter_50000.solverstate ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel caffe_alexnet_train_iter_51000.caffemodel ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate caffe_alexnet_train_iter_51000.solverstate imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel caffe_alexnet_train_iter_56250.caffemodel kkk
caffe_alexnet_train_iter_28.solverstate caffe_alexnet_train_iter_56250.solverstate lost+found
caffe_alexnet_train_iter_37500.caffemodel caffe_alexnet_train_iter_6713.caffemodel solver.prototxt
caffe_alexnet_train_iter_37500.solverstate caffe_alexnet_train_iter_6713.solverstate train_val.prototxt
이제 caffe docker image를 이용하여 Alexnet training을 수행합니다. 아주 잘 되는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032 1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119 1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659 1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191 1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721 1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641 1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285 1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449 1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492 1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807 1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096 1 solver.cpp:429] Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148 1 solver.cpp:429] Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459 1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503 1 solver.cpp:261] Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540 1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...
이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다. GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.
Thu Feb 2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 74C P0 242W / 300W | 9329MiB / 16280MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 69C P0 256W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 75C P0 244W / 300W | 8337MiB / 16280MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 67C P0 222W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 121885 C ./caffe 9317MiB |
| 1 121885 C ./caffe 8325MiB |
| 2 121885 C ./caffe 8325MiB |
| 3 121885 C ./caffe 8325MiB |
+-----------------------------------------------------------------------------+
그 pid를 통해 추적해보면 저 caffe라는 process의 parent는 docker-containerd 임을 알 수 있습니다.
root 121885 121867 99 11:27 ? 01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121992 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121885
root@minsky:/data/mydocker# ps -ef | grep 121867
root 121867 106109 0 11:27 ? 00:00:00 docker-containerd-shim 61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c /var/run/docker/libcontainerd/61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c docker-runc
root 121885 121867 99 11:27 ? 01:34:04 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121996 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121867
이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.
댓글 없음:
댓글 쓰기