HW 엔지니어를 위한 Deep Learning: DDL

레이블이 DDL인 게시물을 표시합니다. 모든 게시물 표시

2018년 1월 24일 수요일

infiniband를 이용한 caffe DDL에서의 색다른 error와 그 해결책

caffe-ibm이 자랑하는 기능 중 하나인 DDL (Distributed Deep Learning)은 여러대의 GPU 서버에 들어있는 GPU들을 OpenMPI로 연결하여 하나의 큰 모델을 training할 수 있도록 해주는 기능입니다. 당연히 여러대의 GPU 서버를 연결하는 network의 latency와 bandwidth에 큰 영향을 받습니다. 여기서는 minsky1과 minsky2라는 hostname의 서버에, IP over Infiniband를 구성하고, 그 interface에 각각 ib1과 ib2라는 IP name을 /etc/hosts에 등록하여 별도의 고속 private network을 구성하여 DDL을 해봤습니다.

이 경우 4-GPU 서버가 2대이고 하나의 network으로 물려있으므로 그 topology를 알려주는 rank file은 아래와 같이 하면 됩니다.

$ cat 4x2x1.rf

rank 0=ib1         slot=0:0-3
rank 2=ib1         slot=0:4-7
rank 4=ib1         slot=1:0-3
rank 6=ib1         slot=1:4-7

rank 1=ib2         slot=0:0-3
rank 3=ib2         slot=0:4-7
rank 5=ib2         slot=1:0-3
rank 7=ib2         slot=1:4-7

그런데, 막상 돌려보니 아래와 같이 error 메시지가 나옵니다.

$ mpirun -x PATH -x LD_LIBRARY_PATH -n 8 -rf 4x2x1.rf caffe train --solver=solver.prototxt -gpu 0 -ddl "-mode b:4x1x1 -dev_sync 1"
--------------------------------------------------------------------------
Failed to create a completion queue (CQ):

Hostname: minsky1
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: minsky1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Failed to create a completion queue (CQ):

Hostname: minsky2
Requested CQE: 16384
Error: Cannot allocate memory

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: minsky2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots. Please review your rank-slot
assignments and your host allocation to ensure a proper match. Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

Host: minsky1
--------------------------------------------------------------------------

이건 2가지 error입니다. 하나는 "Cannot allocate memory"이고, 다른 하나는 "either not

allocated or oversubscribed its slots" 인데, 각각 다른 원인에 의한 것입니다.

1) "Cannot allocate memory" error

이건 limits 값 때문입니다. 다음과 같이 ulimit 값을 보면 max locked memory가 기본으로는 64로 되어 있습니다.

$ ulimit -a

core file size (blocks, -c) 0

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 15880

max locked memory (kbytes, -l) 64

max memory size (kbytes, -m) unlimited

open files (-n) 1024

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0

stack size (kbytes, -s) 8192

cpu time (seconds, -t) unlimited

max user processes (-u) 15880

virtual memory (kbytes, -v) unlimited

file locks (-x) unlimited

이걸 풀어주기 위해서는 아래와 같이 limits.conf의 맨 끝에 해당 user에 대한 limit를 풀어준 뒤, 반드시 re-login을 하셔야 합니다. 이걸로 해결됩니다.

$ sudo vi /etc/security/limits.conf

...

user1 soft memlock -1

user1 hard memlock -1

2) "either not allocated or oversubscribed its slots" error

이건 정말 제가 예상 못 했던 것인데, 구글링을 해보니 뜻 밖에도 mpirun 등의 MPI command는 IP name이 아니라 hostname에 민감한 것 같습니다. 즉, 저 GPU 서버들의 hostname이자 ethernet interface의 IP name이 minsky1 (10.1.1.1), minsky2 (10.1.1.2)이고, infiniband interface의 IP name이 ib1 (9.1.1.1), ib2 (9.1.1.2)인데, 이렇게 hostname과 rank file 안에 들어가는 IP name이 각각 다르면 안되나 봅니다.

이 경우 다음과 같이 두 서버의 hostname을 IB interface의 이름인 ib1, ib2로 각각 바꿔주면 해결이 됩니다.

$ hostnamectl set-hostname ib1

$ hostnamectl set-hostname ib2

2017년 10월 27일 금요일

caffe DDL을 이용한 Alexnet training

지난 9월 포스팅(http://hwengineer.blogspot.kr/2017/09/ibm-powerai-40-caffe-distributed-deep.html)에서 PowerAI에 포함된 DDL(Distributed Deep Learning), 즉 MPI를 이용한 분산처리 기능에 대해 간단히 설명드린 바 있습니다. 이번에는 그것으로 ILSVRC2012의 128만장 image dataset을 caffe alexnet으로 training 해보겠습니다.

제가 잠깐 빌릴 수 있는 Minsky 서버가 딱 1대 뿐이라, 원래 여러대의 Minsky 서버를 묶어서 하나의 model을 train시킬 수 있지만 여기서는 1대의 서버에서 caffe DDL을 수행해보겠습니다. 잠깐, 1대라고요 ? 1대에서 MPI 분산처리가 의미가 있나요 ?

예, 없지는 않습니다. Multi-GPU를 이용한 training을 할 때 일반 caffe와 caffe DDL의 차이는 multi-thread냐, multi-process냐의 차이입니다. 좀더 쉽게 말해, 일반 caffe에서는 하나의 caffe process가 P2P를 통해 여러개의 GPU를 사용합니다. 그에 비해, caffe DDL에서는 GPU당 1개씩 별도의 caffe process가 떠서, 서로간에 MPI를 이용한 통신을 하며 여러개의 GPU를 사용합니다.

이를 그림으로 표현하면 아래와 같습니다.

실제로, 일반 caffe를 사용할 경우 nvidia-smi로 관찰해보면 다음과 같이 caffe의 PID가 모두 같지만, caffe DDL에서는 각 GPU를 사용하는 caffe PID가 서로 다릅니다.

일반 caffe :

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 30681 C caffe 15497MiB |
| 1 30681 C caffe 14589MiB |
| 2 30681 C caffe 14589MiB |
| 3 30681 C caffe 14589MiB |
+-----------------------------------------------------------------------------+

caffe DDL :

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31227 C caffe 15741MiB |
| 1 31228 C caffe 14837MiB |
| 2 31229 C caffe 14837MiB |
| 3 31230 C caffe 14837MiB |
+-----------------------------------------------------------------------------+

자, 대략 차이를 이해하셨으면, alexnet training을 한번은 일반 caffe로, 또 한번은 caffe DDL로 training해보시지요. 물론 모두 같은 Minsky 서버, 즉 4-GPU 시스템 1대를 써서 테스트한 것입니다. 각각의 성능 측정은 128만장을 2-epochs, 즉 2회 반복 training할 때 걸린 시간으로 측정하겠습니다.

일반 caffe :

test@ubuntu:/nvme$ caffe train --solver=models/bvlc_alexnet/solver.prototxt -gpu all

caffe DDL :

test@ubuntu:/nvme$ mpirun -x PATH -x LD_LIBRARY_PATH -n 4 -rf 4x1x1.rf caffe train --solver=models/bvlc_alexnet/solver.prototxt -gpu 0 -ddl "-mode n:4x1x1 -dev_sync 1"

지난번에 잠깐 설명드린 것을 반복하자면 이렇습니다.

- mpirun은 여러대의 서버 노드에 동일한 명령을 동일한 환경변수 (-x 옵션)을 써서 수행해주는 병렬환경 명령어입니다.
- 4x1x1.rf라는 이름의 파일은 rank file입니다. 이 속에 병렬 서버 환경의 toplogy가 들어있습니다.
- -n 4라는 것은 MPI client의 총 숫자이며, 쉽게 말해 training에 이용하려는 GPU의 갯수입니다.
- -gpu 0에서, 왜 4개가 아니라 gpu 0이라고 1개로 지정했는지 의아하실 수 있는데, MPI 환경에서는 각각의 GPU가 하나의 learner가 됩니다. 따라서 실제 물리적 서버 1대에 GPU가 몇 장 장착되어있든 상관없이 모두 -gpu 0, 즉 GPU는 1개로 지정한 것입니다.
- "-mode b:4x1x1"에서 b라는 것은 가능하면 enhanced NCCL을 이용하라는 뜻입니다. 4x1x1은 4장의 GPU를 가진 서버 1대가 하나의 rack에 들어있다는 뜻입니다.
- dev_sync에서 0은 GPU간 sync를 하지 말라는 것이고, 1은 통신 시작할 때 sync하라는 뜻, 2는 시작할 때와 끝낼 때 각각 sync하라는 뜻입니다.

여기서 사용된 rank file 4x1x1.rf 속의 내용은 아래와 같습니다.

rank 0=minksy slot=0:0-3
rank 1=minksy slot=0:4-7
rank 2=minksy slot=1:0-3
rank 3=minksy slot=1:4-7

128만장 x 2-epochs를 처리하기 위해서는, solver.prototxt와 train_val.prototxt 속에 표시된 batch_size와 max_iter의 곱이 128만장 x 2 = 256만장이면 됩니다. batch_size를 조절함에 따라 training 속도가 꽤 달라지는데, 여기서는 256부터 512, 768 순으로 늘려가며 테스트해보겠습니다.

위 표에서 보시다시피, batch_size가 작을 때는 일반 caffe의 성능이 더 빨랐는데, batch_size가 점점 커지면서 caffe DDL의 성능이 점점 더 빨라져서 결국 역전하게 됩니다. batch_size와 MPI를 이용한 DDL의 성능과의 상관 관계가 있을 것 같기는 한데, 아직 그 이유는 파악을 못 했습니다. Lab에 문의해봤는데, batch_size와는 무관할 것이라는 답변을 받긴 했습니다.

여기서 batch_size를 1024보다 더 키우면 어떻게 될까요 ?

...
F1025 17:48:43.059572 30265 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memoryF1025 17:48:43.071281 30285 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x3fffb645ce0c google::LogMessage::Fail()
@ 0x3fffb69649cc caffe::Solver<>::Step()
@ 0x3fffb645f284 google::LogMessage::SendToLog()
...

일반 caffe든 caffe DDL이든 batch_size가 1100만 되어도 이렇게 out-of-memory (OOM) error를 내며 죽어버립니다. 그러니 아쉽게도 더 큰 batch_size에서는 테스트가 안되는 것이지요.

batch_size가 너무 커서 OOM error가 난다면 그걸 또 피해가는 방법이 있습니다. 역시 caffe-ibm에 포함된 LMS(large model support)입니다. 아래와 같이 -lms 옵션을 주면 caffe나 caffe DDL이나 모두 batch_size=1200 정도까지는 무난히 돌릴 수 있습니다. -lms 800000이라는 것은 800000KB 이상의 memory chunk는 GPU 말고 CPU에 남겨두라는 뜻입니다. (http://hwengineer.blogspot.kr/2017/09/inference-gpu-sizing-ibm-caffe-large.html 참조)

일반 caffe with LMS :

test@ubuntu:/nvme$ caffe train -lms 800000 --solver=models/bvlc_alexnet/solver.prototxt -gpu all

caffe DDL with LMS :

test@ubuntu:/nvme$ mpirun -x PATH -x LD_LIBRARY_PATH -n 4 -rf 4x1x1.rf caffe train -lms 800000 --solver=models/bvlc_alexnet/solver.prototxt -gpu 0 -ddl "-mode n:4x1x1 -dev_sync 1"

그 결과는 아래와 같습니다. 확실히 batch_size가 커질 수록 일반 caffe보다 caffe DDL의 성능이 더 잘 나옵니다.

궁금해하실 분들을 위해서, caffe DDL을 수행할 경우 나오는 메시지의 앞부분과 뒷부분 일부를 아래에 붙여놓습니다.

--------------------------------------------------------------------------

[[31653,1],2]: A high-performance Open MPI point-to-point messaging module

was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)

Host: minsky

Another transport will be used instead, although this may result in

lower performance.

--------------------------------------------------------------------------

ubuntu: n0(0) n1(0) n2(0) n3(0)

I1025 18:59:45.681555 31227 caffe.cpp:151] [MPI:0 ] spreading GPUs per MPI rank

I1025 18:59:45.681725 31227 caffe.cpp:153] [MPI:0 ] use gpu[0]

I1025 18:59:45.681541 31228 caffe.cpp:151] [MPI:1 ] spreading GPUs per MPI rank

I1025 18:59:45.681725 31228 caffe.cpp:153] [MPI:1 ] use gpu[1]

I1025 18:59:45.681541 31229 caffe.cpp:151] [MPI:2 ] spreading GPUs per MPI rank

I1025 18:59:45.681726 31229 caffe.cpp:153] [MPI:2 ] use gpu[2]

I1025 18:59:45.681735 31229 caffe.cpp:283] Using GPUs 2

I1025 18:59:45.681541 31230 caffe.cpp:151] [MPI:3 ] spreading GPUs per MPI rank

I1025 18:59:45.681726 31230 caffe.cpp:153] [MPI:3 ] use gpu[3]

I1025 18:59:45.681735 31230 caffe.cpp:283] Using GPUs 3

I1025 18:59:45.681735 31227 caffe.cpp:283] Using GPUs 0

I1025 18:59:45.681733 31228 caffe.cpp:283] Using GPUs 1

I1025 18:59:45.683846 31228 caffe.cpp:288] GPU 1: Tesla P100-SXM2-16GB

I1025 18:59:45.683897 31227 caffe.cpp:288] GPU 0: Tesla P100-SXM2-16GB

I1025 18:59:45.683955 31230 caffe.cpp:288] GPU 3: Tesla P100-SXM2-16GB

I1025 18:59:45.684010 31229 caffe.cpp:288] GPU 2: Tesla P100-SXM2-16GB

I1025 18:59:46.056959 31227 caffe.cpp:302] [MPI:0 ] name = minsky root = 1

I1025 18:59:46.067212 31228 caffe.cpp:302] [MPI:1 ] name = minsky root = 1

I1025 18:59:46.070734 31230 caffe.cpp:302] [MPI:3 ] name = minsky root = 1

I1025 18:59:46.071211 31229 caffe.cpp:302] [MPI:2 ] name = minsky root = 1

I1025 18:59:46.073958 31227 solver.cpp:44] Initializing solver from parameters:

test_iter: 1000

test_interval: 1000

base_lr: 0.01

display: 500

max_iter: 2500

lr_policy: "step"

gamma: 0.1

...중략...

I1025 19:24:53.536928 31227 solver.cpp:414] Test net output #0: accuracy = 0.20032

I1025 19:24:53.536965 31227 solver.cpp:414] Test net output #1: loss = 4.09802 (* 1 = 4.09802 loss)

I1025 19:24:54.180562 31227 solver.cpp:223] Iteration 2000 (1.27922 iter/s, 390.864s/500 iters), loss = 4.18248

I1025 19:24:54.180598 31227 solver.cpp:242] Train net output #0: loss = 4.18248 (* 1 = 4.18248 loss)

I1025 19:24:54.180613 31227 sgd_solver.cpp:121] Iteration 2000, lr = 0.01

I1025 19:26:57.349701 31256 data_layer.cpp:86] Restarting data prefetching from start.

I1025 19:30:28.547333 31256 data_layer.cpp:86] Restarting data prefetching from start.

I1025 19:30:29.081480 31227 solver.cpp:466] Snapshotting to binary proto file models/bvlc_alexnet/caffe_alexnet_train_iter_2500.caffemodel

I1025 19:30:29.283386 31228 solver.cpp:315] Iteration 2500, loss = 3.91634

I1025 19:30:29.283444 31228 solver.cpp:320] Optimization Done.

I1025 19:30:29.283535 31230 solver.cpp:315] Iteration 2500, loss = 3.9612

I1025 19:30:29.283582 31230 solver.cpp:320] Optimization Done.

I1025 19:30:29.285512 31228 caffe.cpp:357] Optimization Done.

I1025 19:30:29.285521 31228 caffe.cpp:359] [MPI:1 ] MPI_Finalize

I1025 19:30:29.285697 31230 caffe.cpp:357] Optimization Done.

I1025 19:30:29.285706 31230 caffe.cpp:359] [MPI:3 ] MPI_Finalize

I1025 19:30:29.286912 31229 solver.cpp:315] Iteration 2500, loss = 3.90313

I1025 19:30:29.286952 31229 solver.cpp:320] Optimization Done.

I1025 19:30:29.290489 31229 caffe.cpp:357] Optimization Done.

I1025 19:30:29.290498 31229 caffe.cpp:359] [MPI:2 ] MPI_Finalize

I1025 19:30:29.973234 31227 sgd_solver.cpp:356] Snapshotting solver state to binary proto file models/bvlc_alexnet/caffe_alexnet_train_iter_2500.solverstate

I1025 19:30:30.727695 31227 solver.cpp:315] Iteration 2500, loss = 3.89638

I1025 19:30:30.727744 31227 solver.cpp:320] Optimization Done.

I1025 19:30:30.729465 31227 caffe.cpp:357] Optimization Done.

I1025 19:30:30.729475 31227 caffe.cpp:359] [MPI:0 ] MPI_Finalize

2017년 9월 15일 금요일

PowerAI 4.0의 DDL을 이용한 caffe와 tensorflow의 병렬처리

PowerAI 4.0에 포함된 DDL(Distributed Deep Learning)의 구체적인 사용법에 대해서 보시겠습니다.

일단 caffe는 IBM 버전 caffe (caffe-ibm)에 DDL 옵션이 통합되어 있으므로 별도 debian 패키지를 설치할 필요가 없습니다. 이 caffe-ibm도 내부적으로는 OpenMPI를 이용하는 것이므로 관련 library들이 설치되기는 해야 합니다만, 이는 caffe-ibm을 설치할 때 함께 자동으로 설치되므로 따로 신경쓰지 않으셔도 됩니다.

nimbix@JARVICENAE-0A0A1835:/data/mnist$ dpkg -l | grep openmpi
ii libopenmpi2-cuda:ppc64el 2.0.1-4ibm1 ppc64el high performance message passing library -- shared library
ii openmpi-bin-cuda 2.0.1-4ibm1 ppc64el high performance message passing library -- binaries
ii openmpi-common-cuda 2.0.1-4ibm1 all high performance message passing library -- common files
ii openmpi-doc-cuda 2.0.1-4ibm1 all high performance message passing library -- man pages

가령 위에서 보는 것과 같이 CUDA-aware OpenMPI를 설치하고나면, mpirun이라는 MPI utility가 설치됩니다. 이 mpirun이라는 것은 여러단계의 soft link가 걸린 orterun이라는 명령어이고, 결국 아래와 같이 openmpi-bin-cuda에서 제공됩니다.

nimbix@JARVICENAE-0A0A1835:/data/mnist$ dpkg -S /usr/bin/orterun
openmpi-bin-cuda: /usr/bin/orterun

IBM 버전 caffe에서의 DDL 사용법은 알고 보면 단순합니다. 다음 4가지만 아시면 됩니다.

1) caffe 명령을 수행할 때 -ddl 옵션을 준다
2) Train/Validation용 dataset은 모든 서버에서 동일한 위치(directory)에 존재해야 한다 (병렬파일시스템 또는 NFS가 편리)
3) 모든 서버는 암호 없이 ssh가 되도록 ssh-keygen과 ssh-copy-id가 되어 있어야 한다
4) 환경변수 등을 다른 서버 노드에도 전달하기 위해서는 mpirun 명령을 사용하는 것이 편하다

다른 것은 다 쉽습니다만 1)번 항목이 조금 어렵게 느껴질 수도 있습니다. 복잡한 부분은 다 빼고, 그냥 쉽게 보면 이렇습니다.

DDL 옵션을 쓴다고 해서 caffe가 여러분이 가진 GPU서버 및 network 환경을 스스로 이해하고 그에 맞게 자동으로 최적화할 수는 없습니다. 따라서 그런 환경, 즉 topology를 caffe에게 여러분이 직접 알려주셔야 합니다. 그게 -ddl 옵션의 mode입니다. 쉽게 예를 들어 설명하면 다음과 같습니다.

$ mpirun -x PATH -x LD_LIBRARY_PATH -n 12 -rf 4x3.rf caffe train -solver /data/mnist/lenet_solver.prototxt -gpu 0 -bvlc -ddl "-mode n:4x3x1 -dev_sync 1"

- mpirun은 여러대의 서버 노드에 동일한 명령을 동일한 환경변수 (-x 옵션)을 써서 수행해주는 병렬환경 명령어입니다.
- 4x3.rf라는 이름의 파일은 rank file입니다. 이 속에 병렬 서버 환경의 toplogy가 들어있습니다. 이걸 어떻게 만드는지는 아래에서 다루겠습니다.
- -n 12라는 것은 MPI client의 총 숫자이며, 쉽게 말해 training에 이용하려는 GPU의 갯수입니다.
- -gpu 0에서, 왜 12개가 아니라 gpu 0이라고 1개로 지정했는지 의아하실 수 있는데, MPI 환경에서는 각각의 GPU가 하나의 learner가 됩니다. 따라서 실제 물리적 서버 1대에 GPU가 몇 장 장착되어있든 상관없이 모두 -gpu 0, 즉 GPU는 1개로 지정한 것입니다.
- "-mode n:4x3x1"에서 n이라는 것은 NCCL (NVIDIA Collective Communications Library, 니클이라고 읽습니다)을 이용하라는 뜻입니다. 4x3x1은 4장의 GPU를 가진 서버 3대가 하나의 rack에 들어있다는 뜻입니다. 사실 어느 rack에 들어있느냐가 중요한 것은 아닌데, 보통 병렬수퍼컴 환경에서는 한대의 rack 안에 장착된 서버끼리는 좀더 고속의 low latency network으로 연결되어있기 때문에 이렇게 rack 표시까지 해주는 것입니다. 만약 4장의 GPU를 가진 서버가 6대씩 장착된 rack이 5대있다면 4x6x5로 표시됩니다.
- dev_sync에서 0은 GPU간 sync를 하지 말라는 것이고, 1은 통신 시작할 때 sync하라는 뜻, 2는 시작할 때와 끝낼 때 각각 sync하라는 뜻입니다.

잠깐, 데이터는 어디에 있는지 어떻게 지정하냐고요 ? 저 위에 지정된 solver 파일, 즉 lenet_solver.prototxt에 neural network이 지정되어 있고, 다시 그 neural network의 prototxt 파일 속에 데이터 위치가 지정되어 있습니다. 아래처럼요.

$ vi lenet_solver.prototxt
#net: "examples/mnist/lenet_train_test.prototxt"
net: "/data/mnist/lenet_train_test.prototxt"

$ vi lenet_train_test.prototxt
...
# source: "examples/mnist/mnist_train_lmdb"
source: "/data/mnist/mnist_train_lmdb"
...
# source: "examples/mnist/mnist_test_lmdb"
source: "/data/mnist/mnist_test_lmdb"

여러 서버 노드들의 GPU마다 수행될 learner들이 어떻게 data를 나누어 가져가느냐고요 ? 가급적이면 서버 노드마다 미리 파티셔닝되어 적절히 분배된 data들을 넣어두는 것이 좋습니다. Data를 N개의 learner들이 읽어갈 때, 각자 순차적으로 파일 이름들이 뒤섞여 들어간 목록으로부터 data를 읽어가는데, 만약 이 data가 물리적으로 미리 파티셔닝하여 노드 별로 분배해놓은 것이 아니라면 그냥 1번 training을 끝낼 때마다 전체 data를 N번 (N epochs) training한 것과 같은 효과를 냅니다. 저 data들의 저장소는 여러 노드에서 동시에 access할 수 있도록 IBM Spectrum Scale (구명칭 GPFS) 같은 병렬 파일시스템으로 하든가, 그게 없다면 성능이 떨어지더라도 NFS 같은 것으로 구성하는 것이 좋습니다.

이제 저 rf 파일, 즉 랭크 파일을 어떻게 만드는지 보시겠습니다. 그냥 손으로, 즉 vi 에디터 같은 것을 이용해서 만드셔도 됩니다만, PowerAI에서 기본 제공되는 rank_gen.py를 이용해서 다음과 같이 만드시는 것이 편합니다.

$ python /opt/DL/ddl/bin/rank_gen.py 4x2x3 sys-89074,sys-89075,sys-89076,sys-89077,sys-89078,sys-89079 > 4x2x3.rf

위에서 콤마(,)로 분리된 이름들이 서버 이름들입니다. 4x2x3이니 4장의 GPU를 가진 서버가 총 6대 있는 것이니, 서버 이름은 반드시 6대를 적으셔야 합니다. 이렇게 만들어진 4x2x3.rf 파일의 내용은 아래와 같습니다. rank_gen.py는 기본적으로 10-core POWER8 chip 2장을 장착한 Minsky 서버를 기준으로 만들기 때문에 아래와 같이 10개의 core를 가진 slot 2개가 있는 것으로 나옵니다. 그래서 rank, 즉 GPU 1개마다 slot이 5개 (0-4) 있는 것으로 나오는데, 만약 그게 아니라 8-core POWER8 chip이 장착된 서버라면 수작업으로 0-4가 아닌 0-3으로 수정해주셔야 합니다.

u0017649@sys-89075:~$ cat 4x2x3.rf
#2017-09-14 04:45:51 by rank_gen
#dims = 4x2x3
#host = sys-89074,sys-89075,sys-89076,sys-89077,sys-89078,sys-89079
#dimX = 4
#dimY = 2
#dimZ = 3
#sockets = 2
#cores = 10

rank 0=sys-89074 slot=0:0-4
rank 6=sys-89074 slot=0:5-9
rank 12=sys-89074 slot=1:0-4
rank 18=sys-89074 slot=1:5-9

rank 3=sys-89075 slot=0:0-4
rank 9=sys-89075 slot=0:5-9
rank 15=sys-89075 slot=1:0-4
rank 21=sys-89075 slot=1:5-9

rank 1=sys-89076 slot=0:0-4
rank 7=sys-89076 slot=0:5-9
rank 13=sys-89076 slot=1:0-4
rank 19=sys-89076 slot=1:5-9

rank 4=sys-89077 slot=0:0-4
rank 10=sys-89077 slot=0:5-9
rank 16=sys-89077 slot=1:0-4
rank 22=sys-89077 slot=1:5-9

rank 2=sys-89078 slot=0:0-4
rank 8=sys-89078 slot=0:5-9
rank 14=sys-89078 slot=1:0-4
rank 20=sys-89078 slot=1:5-9

rank 5=sys-89079 slot=0:0-4
rank 11=sys-89079 slot=0:5-9
rank 17=sys-89079 slot=1:0-4
rank 23=sys-89079 slot=1:5-9

Caffe는 그렇게 쉽게 됩니다만, tensorflow는 그보다 좀 어렵습니다. 일단 별도의 ddl-tensorflow라는 debian package가 PowerAI 4.0에 포함되어 있는데, 이는 사실 tensorflow DDL에 꼭 필요한 것이 아니라, tensorflow DDL을 좀더 쉽게 사용하실 수 있도록 해주는 Google Slim에 기반한 script들과 example 파일들을 제공해주는 것입니다. 정작 tensorflow는 별도로 설치하셔야 하는데, 물론 그건 PowerAI에서 제공되는 tensorflow를 apt-get install 명령으로 설치하시면 됩니다.

$ sudo apt-get install ddl-tensorflow tensorflow

$ dpkg -L ddl-tensorflow
/.
/opt
/opt/DL
/opt/DL/ddl-tensorflow
/opt/DL/ddl-tensorflow/examples
/opt/DL/ddl-tensorflow/examples/mnist
/opt/DL/ddl-tensorflow/examples/mnist/ddl_mnist.py
/opt/DL/ddl-tensorflow/examples/mnist/README.md
/opt/DL/ddl-tensorflow/examples/slim
/opt/DL/ddl-tensorflow/examples/slim/BUILD
/opt/DL/ddl-tensorflow/examples/slim/WORKSPACE
/opt/DL/ddl-tensorflow/examples/slim/scripts
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_resnet_v2_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/train_lenet_on_mnist.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_resnet_v1_50_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_v3_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/train_cifarnet_on_cifar10.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_v1_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/train-alexnet.sh
/opt/DL/ddl-tensorflow/examples/slim/deployment
/opt/DL/ddl-tensorflow/examples/slim/deployment/__init__.py
...

이 ddl-tensorflow를 사용하시기 위해서는 PYTHONPATH 등의 환경변수 설정을 위해 source 명령으로 아래와 같이 ddl-tensorflow-activate를 수행해주셔야 합니다.

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

이제 ddl-tensorflow-install-samples 명령을 사용하시어 지정하는 directory에 sample들을 설치하실 수 있습니다.

nimbix@JARVICENAE-0A0A1835:~$ ddl-tensorflow-install-samples /data
Write into existing directory /data? (yN)
y
Copying examples/ into /data...
Success

가장 간단한 것으로, 손글씨 숫자를 판독하는 MNIST가 들어 있습니다.

nimbix@JARVICENAE-0A0A1835:~$ cd /data/examples/mnist

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ ls
ddl_mnist.py README.md

여기에 나온 것처럼, tensorflow는 명령어라기보다는 python에서 불러 사용하는 library로 되어 있기 때문에, 결국 multi-node 병렬처리를 하기 위해서는 python script를 위의 ddl_mnist.py에서처럼 작성해야 합니다.

일단 4-GPU 서버 2대(sys-89074와 sys-89075)로 수행하는 환경이라고 가정하고 아래와 같이 rank file을 먼저 만듭니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ python /opt/DL/ddl/bin/rank_gen.py 4x2x1 sys-89074,sys-89075 > 4x2.rf

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ cat 4x2.rf
#2017-09-15 03:19:14 by rank_gen
#dims = 4x2x1
#host = sys-89074,sys-89075
#dimX = 4
#dimY = 2
#dimZ = 1
#sockets = 2
#cores = 10

rank 0=sys-89074 slot=0:0-4
rank 2=sys-89074 slot=0:5-9
rank 4=sys-89074 slot=1:0-4
rank 6=sys-89074 slot=1:5-9

rank 1=sys-89075 slot=0:0-4
rank 3=sys-89075 slot=0:5-9
rank 5=sys-89075 slot=1:0-4
rank 7=sys-89075 slot=1:5-9

이제 다음과 같이 수행하면 됩니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -n 8 -rf 4x2.rf python ddl_mnist.py

(사실 mnist는 워낙 작은 dataset만 사용하므로, 병렬화의 의미가 없습니다. 그래서인지 ddl_mnist.py는 위에서 제가 예로 든 것처럼 4x2 구조는 애초에 불가능하고, 저 아래에 보시듯이 -mode r:2로 되어 있어 그냥 GPU 2장으로 병렬화하는 것만 가능합니다.)

결국 문제는 tensorflow를 병렬로 수행하기 위해서 python script를 어떻게 작성해야 하느냐인데, 이 부분에 대해서는 저도 개발자가 아닌 관계로 별 도움을 못 드리겠습니다. (사실 제겐 흰건 글씨요 검은건 공백이며, 깜빡이는 것은 커서 정도로만 보입니다.)

대신, 다소 깁니다만, 아래에 PowerAI에 포함된 ddl_mnist.py의 내용을 그대로 올려두겠습니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ vi ddl_mnist.py

import tensorflow as tf
import numpy as np

############################################################################
# IBM PowerAI Distributed Deep Learning (DDL) setup
############################################################################

# Disable GPU memory preallocation
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

############################################################################
# DDL Initialize BEGIN
############################################################################
# Load DDL operator
ddl = tf.load_op_library('/opt/DL/ddl-tensorflow/lib/ddl_MDR.so')

# DDL initializes MPI on CPU
# ddl.init takes two inputs
# 1) the number of GPUs to utilize on each host in training.
# this number is not the number of GPUs to use for each leaner. It simply tells DDL that there are X GPUs in each host to be used for training
# 2) DDL options (refer to README for details)
with tf.Session(config=config) as sess:
with tf.device('/cpu:0'):
rank, size, gpuid = sess.run(ddl.init(2, mode = '-mode r:2 -dump_iter 100'))

# MPI info and assigned GPU
print [rank, size, gpuid]
############################################################################
# DDL Initialize END
############################################################################

# Perform all TensorFlow computation within gpuid
with tf.device('/gpu:%d' %gpuid):
##############################################################################
# Import MNIST data

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

# Parameters
learning_rate = 0.001
training_iters = 200000
batch_size = 100
display_step = 1

# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# tf Graph input
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32) #dropout (keep probability)

# Create some wrappers for simplicity
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)

def maxpool2d(x, k=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],
padding='SAME')

# Create model
def conv_net(x, weights, biases, dropout):
# Reshape input picture
x = tf.reshape(x, shape=[-1, 28, 28, 1])

# Convolution Layer
conv1 = conv2d(x, weights['wc1'], biases['bc1'])
# Max Pooling (down-sampling)
conv1 = maxpool2d(conv1, k=2)

# Convolution Layer
conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
# Max Pooling (down-sampling)
conv2 = maxpool2d(conv2, k=2)

# Fully connected layer
# Reshape conv2 output to fit fully connected layer input
fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
fc1 = tf.nn.relu(fc1)
# Apply Dropout
fc1 = tf.nn.dropout(fc1, dropout)

# Output, class prediction
out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
return out

# Store layers weight & bias
weights = {
############################################################################
# DDL BROADCAST BEGIN
############################################################################
# This step ensures that all learners start with the same initial parameters

# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 1, 32]))),
# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 32, 64]))),
# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(ddl.bcast(tf.random_normal([7*7*64, 1024]))),
# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(ddl.bcast(tf.random_normal([1024, n_classes])))
############################################################################
# DDL BROADCAST END
############################################################################
}

biases = {
'bc1': tf.Variable(ddl.bcast(tf.random_normal([32]))),
'bc2': tf.Variable(ddl.bcast(tf.random_normal([64]))),
'bd1': tf.Variable(ddl.bcast(tf.random_normal([1024]))),
'out': tf.Variable(ddl.bcast(tf.random_normal([n_classes])))
}

# Construct model
pred = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

############################################################################
# DDL ALLREDUCE BEGIN
############################################################################

# Collect the gradients and the corresponding parameters w.r.t the given cost
grads_and_vars = optimizer.compute_gradients(cost)

# Separate out the tuple
grads, vars = zip(*grads_and_vars)

# This step takes the average of the gradients on all the learners
grads_and_vars_ddl = zip(ddl.all_reduce_n(grads, op='avg'), vars)

# Update the parameters with the averaged gradient
objective = optimizer.apply_gradients(grads_and_vars_ddl)

############################################################################
# DDL ALLREDUCE END
############################################################################

# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
##############################################################################

def split(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in xrange(n))

# Launch the graph
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
step = 1
# Keep training until reach max iterations
while step * batch_size < training_iters:

# Each learner will read batch_size*size samples and
# use only the portion correspoding to the current learner (or rank)

batch_x, batch_y = mnist.train.next_batch(batch_size*size)

batch_x = np.split(batch_x,size)[rank]
batch_y = np.split(batch_y,size)[rank]

# Run optimization op (backprop)
sess.run(objective, feed_dict={x: batch_x, y: batch_y,
keep_prob: dropout})
if step % display_step == 0:
# Calculate batch loss and accuracy
loss, acc = sess.run([cost, accuracy], feed_dict={x: batch_x,
y: batch_y,
keep_prob: 1.})
print("MPI "+str(rank)+"] Iter " + str(step*batch_size) + ", Minibatch Loss= " + \
"{:.6f}".format(loss) + ", Training Accuracy= " + \
"{:.5f}".format(acc))
step += 1

print("MPI "+str(rank)+"] Optimization Finished!")

# Calculate accuracy for 256 mnist test images
print("MPI "+str(rank)+"] Testing Accuracy:", \
sess.run(accuracy, feed_dict={x: mnist.test.images[:256],
y: mnist.test.labels[:256],
keep_prob: 1.}))