HW 엔지니어를 위한 Deep Learning: P100

레이블이 P100인 게시물을 표시합니다. 모든 게시물 표시

2017년 2월 2일 목요일

Minsky 서버에서 nvidia-docker를 이용하여 Caffe Alexnet training 수행하기

앞선 포스팅에서 말씀드린 바와 같이, nvidia-docker를 이용하면 다양한 환경의 deep learning framework을 사용자 간의 application 충돌 없이 손쉽게 사용 가능합니다. 이번에는 NVIDIA P100 GPU를 탑재한 ppc64le 환경인 Minsky 서버에서 Caffe를 docker image로 build하여 Alexnet training을 수행해 보겠습니다.

먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다. 이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.

root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial

# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt

RUN apt-get update && apt-get install -y cuda

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y caffe-nv libnccl1 && \
rm -rf /var/lib/apt/lists/*

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"

# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe

이제, 이 dockerfile로 build를 시작합니다. 물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.

root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
---> Using cache
---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
---> Using cache
---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
---> Using cache
---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
---> Using cache
---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && rm -rf /tmp/temp && apt-get update && apt-get install -y caffe-nv libnccl1 && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
---> Running in 875c714591e5
---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
---> Running in e78eaede0f62
---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
---> Running in a95e655fee4f
---> be9b92d51239

이제 docker image를 확인합니다. 이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다. 원래 필요없는 것은 빼시는 것이 좋습니다.

root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/caffe ppc64le-xenial 2705abb3bbc5 13 seconds ago 4.227 GB
bsyu/p2p ppc64le-xenial 2fe1b4ac3b03 17 hours ago 2.775 GB
registry latest 781e109ba95f 44 hours ago 612.6 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
ubuntu/xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 46 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda devel dc3faec17c11 46 hours ago 1.726 GB
cuda latest dc3faec17c11 46 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 46 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 46 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 46 hours ago 844.9 MB
cuda runtime 8e9763b6296f 46 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 6 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 6 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 6 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 6 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 6 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB

Docker image를 nvidia-docker로 수행해 봅니다. Caffe 버전을 확인할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13

root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>

commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time

Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file.) type: string
default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""

그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다. /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash

root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm

root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory

그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash

root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/nvme0n1 2.9T 290G 2.5T 11% /nvme
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm

root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel caffe_alexnet_train_iter_50000.caffemodel data
caffe_alexnet_train_iter_102000.solverstate caffe_alexnet_train_iter_50000.solverstate ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel caffe_alexnet_train_iter_51000.caffemodel ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate caffe_alexnet_train_iter_51000.solverstate imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel caffe_alexnet_train_iter_56250.caffemodel kkk
caffe_alexnet_train_iter_28.solverstate caffe_alexnet_train_iter_56250.solverstate lost+found
caffe_alexnet_train_iter_37500.caffemodel caffe_alexnet_train_iter_6713.caffemodel solver.prototxt
caffe_alexnet_train_iter_37500.solverstate caffe_alexnet_train_iter_6713.solverstate train_val.prototxt

이제 caffe docker image를 이용하여 Alexnet training을 수행합니다. 아주 잘 되는 것을 보실 수 있습니다.

root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032 1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119 1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659 1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191 1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721 1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641 1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285 1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449 1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492 1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807 1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096 1 solver.cpp:429] Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148 1 solver.cpp:429] Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459 1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503 1 solver.cpp:261] Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540 1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...

이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다. GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.

Thu Feb 2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 74C P0 242W / 300W | 9329MiB / 16280MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 69C P0 256W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 75C P0 244W / 300W | 8337MiB / 16280MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 67C P0 222W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 121885 C ./caffe 9317MiB |
| 1 121885 C ./caffe 8325MiB |
| 2 121885 C ./caffe 8325MiB |
| 3 121885 C ./caffe 8325MiB |
+-----------------------------------------------------------------------------+

그 pid를 통해 추적해보면 저 caffe라는 process의 parent는 docker-containerd 임을 알 수 있습니다.

root@minsky:/data/mydocker# ps -ef | grep 121885
root 121885 121867 99 11:27 ? 01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121992 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121885

root@minsky:/data/mydocker# ps -ef | grep 121867

root 121867 106109 0 11:27 ? 00:00:00 docker-containerd-shim 61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c /var/run/docker/libcontainerd/61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c docker-runc

root 121885 121867 99 11:27 ? 01:34:04 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt

root 121996 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121867

이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.

2016년 12월 15일 목요일

인터넷에 연결되지 않은 Minsky 서버에 CUDA와 PowerAI toolkit 설치하기

기본적으로 CUDA 및 PowerAI toolkit은 local debian package를 제공하므로, internet access 없이도 설치가 가능합니다. 그러나 이들을 설치하기 위해서는 먼저 ubuntu OS에 여러가지 추가적인 package들이 설치되어야 하는데, 그것들은 Ubuntu 16.04.1 LTS의 base OS image에는 들어있지 않고 internet 상의 APT repository에 존재합니다. 따라서 결국 internet access가 필요합니다.

이 문제를 해결하기 위해서, 다소 우악스럽지만 가장 단순한 방법은 internet 상의 ubuntu APT repository를 local disk에 옮겨 놓는 것입니다. 꼭 CUDA와 PowerAI를 위해서 뿐만 아니라, 사용하다보면 몇가지 추가 OS package가 필요한 경우도 있으므로 인터넷 연결이 없는 곳에 서버를 설치하기 위해서는 이를 한번 구성해두는 것이 필요하긴 합니다.

0. 준비물 및 기본 절차 설명

- Ubuntu 16.04.1 LTS 버전이 설치된 Linux on POWER 1대(GPU는 없어도 됨)와, 거기에 연결되는 100GB 이상의 USB 외장 디스크
- 먼저 internet으로부터, 이 laptop에 Ubuntu 16.04.1 LTS의 APT repository로부터 deb file들을 apt-get 명령을 이용해 download 받습니다.
- 이후 그 file들을 USB 외장 디스크에 옮겨담아 datacenter 안에 가지고 들어가서 Minsky 서버에 그 file들을 upload 합니다.
- 그렇게 file을 upload한 뒤, Minsky 서버에서 dpkg-scanpackages 명령을 통해 Minsky 서버의 local disk에 APT repository를 구성합니다.

1. Ubuntu 16.04.1 LTS의 공식 APT repository로부터 전체 debian package를 laptop 또는 USB disk에 download 받기

$ apt-cache dumpavail | grep -oP "(?<=Package: ).*" >> packagelist

$ head packagelist
a11y-profile-manager
a11y-profile-manager-doc
a11y-profile-manager-indicator
account-plugin-facebook
account-plugin-flickr
account-plugin-google
accounts-qml-module-doc
acct
acl
acpid

$ cat packagelist | wc -l
52681

--> 총 5만2천개가 넘는 debian package를 download 받아야 하며, 대략 70GB 예상

$ for i in `cat packagelist`
> do
> apt-get download ${i}:ppc64el
> done
...
Fetched 6,490 B in 0s (42.8 kB/s)
Get:1 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el libgrail6 ppc64el 3.1.0+16.04.20160125-0ubuntu1 [49.9 kB]
Fetched 49.9 kB in 0s (167 kB/s)
Get:1 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el libgraphite2-3 ppc64el 1.3.6-1ubuntu1 [62.9 kB]
Fetched 62.9 kB in 0s (211 kB/s)
Get:1 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el libgraphite2-dev ppc64el 1.3.6-1ubuntu1 [14.6 kB]
Fetched 14.6 kB in 0s (65.4 kB/s)
Get:1 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el libgraphite2-doc all 1.3.6-1ubuntu1 [567 kB]
Fetched 567 kB in 1s (386 kB/s)
...

2. 이렇게 download 받은 file들을 USB disk를 이용해서 Minsky 서버로 upload

민스키의 앞쪽 USB port에 USB disk를 연결한 뒤, 다음과 같이 인식시키고 mount.

$ sudo fdisk -l
--> USB disk가 /dev/sdc1 등으로 인식된 것을 확인

$ sudo mount /dev/sdc1 /mnt

$ cp -r /mnt/* ~/files/apt
--> ~/files/apt directory로 전체 copy

$ sudo umount /mnt

3. Minsky 서버에서 아래와 같이 local APT repository를 구성

(~/files/apt directory에 전체 deb file들을 복사해놓은 상황)

u0017496@sys-84793:~/files/apt$ sudo dpkg-scanpackages . /dev/null > Packages

u0017496@sys-84793:~/files/apt$ sudo gzip -9c Packages > Packages.gz

--> debian package file들을 scan하여 Packages.gz file을 만드는 작업입니다. 이 file이 있어야 APT repository로 인식됩니다.

u0017496@sys-84793:~/files/apt$ ls -l Pa*
-rw-rw-r-- 1 u0017496 u0017496 5407830 Dec 14 22:07 Packages.gz

u0017496@sys-84793:~/files/apt$ sudo cp /etc/apt/sources.list /etc/apt/sources.list.org

u0017496@sys-84793:~/files/apt$ sudo vi /etc/apt/sources.list
deb file:///home/u0017496/files/apt ./
--> 원래의 /etc/apt/sources.list에는 인터넷 상의 APT repository 주소들이 있습니다. 그중 맨 윗줄에 이와 같이 1줄을 넣습니다. 물론 file:// 뒤에는 deb file들과 Packages.gz이 들어있는 그 directory를 써야 합니다.

그리고나서 Release file을 다음과 같이 apt-ftparchive 명령어로 만듭니다.

u0017496@sys-84793:~/files/apt$ sudo apt-ftparchive release . > Release

이제 이 file들에 gpg sign을 해야 합니다. 하지 않을 경우 "The repository 'file:/home/apt ./ Release' is not signed"와 같은 error가 나면서 APT repository가 활성화되지 않습니다.

u0017496@sys-84793:~/files/apt$ sudo gpg --gen-key
gpg (GnuPG) 1.4.20; Copyright (C) 2015 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Please select what kind of key you want:
(1) RSA and RSA (default)
(2) DSA and Elgamal
(3) DSA (sign only)
(4) RSA (sign only)
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048)
Requested keysize is 2048 bits
Please specify how long the key should be valid.
0 = key does not expire
<n> = key expires in n days
<n>w = key expires in n weeks
<n>m = key expires in n months
<n>y = key expires in n years
Key is valid for? (0)
Key does not expire at all
Is this correct? (y/N) y

You need a user ID to identify your key; the software constructs the user ID
from the Real Name, Comment and Email Address in this form:
"Heinrich Heine (Der Dichter) <heinrichh@duesseldorf.de>"

Real name: XXXX
Email address: XXXX@gmail.com
Comment:
You selected this USER-ID:
"XXX <XXXX@gmail.com>"

Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
pub 2048R/7A19D61F 2017-03-13
Key fingerprint = XXXXXXXXXXXXXXXXX7A19D61F
uid XXX <XXX@gmail.com>
sub 2048R/XXXXX 2017-03-13

u0017496@sys-84793:~/files/apt$ sudo gpg --list-keys
/root/.gnupg/pubring.gpg
------------------------
pub 2048R/7A19D61F 2017-03-13
uid XXX <XXX@gmail.com>
sub 2048R/XXXXX 2017-03-13

u0017496@sys-84793:~/files/apt$ sudo gpg --output /home/pubkey-export-file --armor --export 7A19D61F
u0017496@sys-84793:~/files/apt$ sudo sudo apt-key add /home/pubkey-export-file
OK

u0017496@sys-84793:~/files/apt$ sudo gpg --clearsign -o InRelease Release
u0017496@sys-84793:~/files/apt$ sudo gpg -abs -o Release.gpg Release

이제 준비가 되었습니다. apt-get update를 수행합니다.

u0017496@sys-84793:~/files/apt$ sudo apt-get update
Get:1 file:/home/apt ./ InRelease
Ign:1 file:/home/apt ./ InRelease
...
Get:8 file:/opt/DL/repo Release.gpg
Ign:8 file:/opt/DL/repo Release.gpg
Reading package lists... Done
W: file:///home/apt/./Release.gpg: Signature by key XXXXXXXXXXXXXX uses weak digest algorithm (SHA1)

apt-get 명령을 통해 새로 /etc/apt/sources.list로부터 APT repository를 읽어들이게 합니다. 위와 같이 weak algorithm 관련 warning이 있을 수 있으나 무시.

이제 local APT repository 구성이 끝났습니다. 시험 삼아 telnet을 설치/삭제해보십시요.

u0017496@sys-84799:~$ sudo apt-get install telnet

4. CUDA 및 PowerAI (power-mldl)의 package들을 Minsky 서버에 USB disk 등을 통해 upload 합니다.

u0017496@sys-84799:~/files$ ls -l
total 2197160
-rw-rw-r-- 1 u0017496 u0017496 1318461544 Dec 7 19:58 cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
-rw-r--r-- 1 u0017496 u0017496 32904888 Nov 14 09:57 libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb
-rw-r--r-- 1 u0017496 u0017496 5189230 Nov 14 09:58 libcudnn5-doc_5.1.5-1+cuda8.0_ppc64el.deb
-rw-r--r-- 1 u0017496 u0017496 39605204 Nov 14 09:58 libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb
-rw-rw-r-- 1 u0017496 u0017496 196132114 Oct 28 14:53 mldl-repo-local_1-3ibm2_ppc64el.deb

--> cuda 관련 1개, 그리고 cudnn 관련 3개, 그리고 PowerAI (mldl) 관련 1개 총 5개가 필요합니다. 다음과 같은 NVIDIA 및 IBM 홈페이지에서 download 받을 수 있습니다.

https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
https://download.boulder.ibm.com/ibmdl/pub/software/server/mldl/mldl-repo-local_1-3ibm2_ppc64el.deb
https://developer.nvidia.com/cudnn

5. 아래와 같이 이 debian package들을 dpkg 명령으로 설치합니다.

u0017496@sys-84799:~/files$ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
u0017496@sys-84799:~/files$ sudo dpkg -i mldl-repo-local_1-3ibm2_ppc64el.deb

--> 이 명령들은 실제 CUDA나 PowerAI를 설치하는 것이 아니라, Minsky 서버에 CUDA 및 PowerAI의 local APT repository를 만들어주는 것입니다.

u0017496@sys-84799:~/files$ sudo apt-get update

--> 새로 만들어진 CUDA 및 PowerAI의 local APT repository를 읽어들입니다. 이제 설치 준비가 끝났습니다.

6. CUDA 및 PowerAI 설치

u0017496@sys-84799:~/files$ sudo apt-get install cuda

u0017496@sys-84799:~/files$ sudo dpkg -i libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb
u0017496@sys-84799:~/files$ sudo dpkg -i libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb
u0017496@sys-84799:~/files$ sudo dpkg -i libcudnn5-doc_5.1.5-1+cuda8.0_ppc64el.deb

--> 먼저 CUDA를 설치한 뒤 cudnn debian package들을 설치합니다. 이것들은 PowerAI 이전에 설치되어야 합니다.

u0017496@sys-84799:~/files$ sudo apt-get install power-mldl

--> 여러가지 Ubuntu OS의 prereqs을 아까 설치한 local APT repository로부터 끌어오면서 CUDA 및 PowerAI가 설치됩니다.

** Alternative way : 위와 같이 전체 Ubuntu APT repository를 다 download 받는 것은 시간과 공간이 많이 필요한 일입니다. CUDA와 PowerAI에 필요한 OS prerequites만 download 받는 방법도 있습니다.

먼저 아래와 같이, 인터넷에 연결된 Linux on POWER 서버(Ubuntu 16.04.1 LTS)에서 CUDA와 PowerAI의 repository를 설치합니다.

u0017496@sys-84799:~/files$ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
u0017496@sys-84799:~/files$ sudo dpkg -i mldl-repo-local_1-3ibm2_ppc64el.deb

그리고 난 뒤, apt-cache 명령을 이용해서 cuda 및 power-mldl package를 설치하기 위한 prereq file들에 대한 정보를 file로 받아냅니다.

u0017496@sys-84793:~/files$ sudo apt-get update

u0017496@sys-84793:~/files$ apt-cache depends --recurse -i cuda > prereqs
u0017496@sys-84793:~/files$ apt-cache depends --recurse -i power-mldl >> prereqs
u0017496@sys-84793:~/files$ cat prereqs | sed -e 's/.*Depends://' -e 's/[<>]//g' -e 's/\s//g' | sort -u > prereqs.txt
u0017496@sys-84793:~/files$ wc -l prereqs.txt
544 prereqs.txt

이 file들을 다음과 같이 apt-get download 명령으로 internet 상의 APT repository로부터 download 받습니다.

u0017496@sys-84793:~/files$ mkdir os-prereqs
u0017496@sys-84793:~/files$ cd os-prereqs
u0017496@sys-84793:~/files/os-prereqs$ cp ../prereqs.txt .
u0017496@sys-84793:~/files/os-prereqs$ for i in `cat prereqs.txt`
> do
> apt-get download $i
> done

여기서 받은 file들을 인터넷 접속이 안되는 Minsky에 올린 뒤, 앞서 언급한 3번 step부터 작업을 하면 됩니다.

* 단, 이 방법은 fully test된 것이 아니므로, 시간과 disk 공간이 허락하는 한, 전체 APT repository를 download 받으시길 권장드립니다.