HW 엔지니어를 위한 Deep Learning: Caffe

레이블이 Caffe인 게시물을 표시합니다. 모든 게시물 표시

2018년 8월 22일 수요일

PowerAI 5.2의 caffe-ibm에서의 LMS 테스트 (cifar10)

먼저, cifar10 dataset을 준비합니다. 보통 caffe에 포함되어 있는 get_cifar10.sh를 수행하면 lmdb로 포맷된 dataset을 일사천리로 download 받을 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ pwd
/opt/DL/caffe-ibm

[bsyu@p57a22 caffe-ibm]$ ./data/cifar10/get_cifar10.sh
Downloading...
--2018-08-21 04:14:32-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[============================================================================>] 170,052,171 37.4MB/s in 4.6s

2018-08-21 04:14:37 (35.4 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

LMS의 효용성을 보기 위해서는 image size가 커야 합니다만, cifar10에 포함된 image들은 6만개의 32x32 칼러 이미지들로서 size가 매우 작습니다. 대신 한번에 GPU에서 처리되는 image 개수인 batch_size를 크게 하여 GPU memory를 꽉 차게 해보겠습니다.

[bsyu@p57a22 caffe-ibm]$ vi examples/cifar10/cifar10_full_train_test.prototxt
...
data_param {
source: "examples/cifar10/cifar10_train_lmdb"
batch_size: 22000 # 원래는 100
backend: LMDB
}

[bsyu@p57a22 caffe-ibm]$ which caffe
/opt/DL/caffe-ibm/bin/caffe

이제 수행해봅니다. batch_size: 22000 정도로는 LMS 없이도 잘 수행되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 03:51:25.516746 52459 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 03:51:28.164131 52459 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 03:51:28.753823 52459 solver.cpp:336] Iteration 200, loss = 1.71708
I0821 03:51:28.753847 52459 solver.cpp:341] Optimization Done.
I0821 03:51:28.753859 52459 caffe.cpp:421] Optimization Done.

이때 nvidia-smi로 GPU memory 사용량을 보면 거의 아슬아슬하게 GPU memory가 꽉 찬 것을 보실 수 있습니다.

이번에는 batch_size: 24000으로 늘려서 다시 동일한 training을 수행해봅니다. 이번에는 error가 납니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
F0821 04:26:41.953693 60726 syncedmem.cpp:569] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x3fffa30acf8c (unknown)
@ 0x3fffa30afa0c (unknown)
@ 0x3fffa30ac9b4 (unknown)
@ 0x3fffa30b0634 (unknown)
@ 0x3fffaac7c154 caffe::SyncedMemory::get_gpu_ptr()
@ 0x3fffaac77650 caffe::SyncedMemory::mutable_gpu_data()
@ 0x3fffaaa8aa28 caffe::Blob<>::mutable_gpu_diff()
@ 0x3fffaaced94c caffe::PoolingLayer<>::Backward_gpu()
@ 0x3fffaac207a4 caffe::Net<>::BackwardFromTo()
@ 0x3fffaac20ad8 caffe::Net<>::Backward()
@ 0x3fffaac5aabc caffe::Solver<>::Step()
@ 0x3fffaac5b328 caffe::Solver<>::Solve()
@ 0x1000e9e4 train()
@ 0x1000a848 main
@ 0x3fff88155100 generic_start_main.isra.0
@ 0x3fff881552f4 __libc_start_main
@ (nil) (unknown)

이번에는 역시 batch_size: 24000으로 하되, training시킬 때 -lms라는 옵션을 넣어서 수행합니다. 이번에는 LMS 관련 message가 나오면서, 문제없이 training이 완료되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 04:30:40.643229 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=35% available=19%]
I0821 04:30:42.045287 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=39% available=36%]
I0821 04:30:44.427603 75342 syncedmem.cpp:349] [LMS:2] allocate: size=3145728008 [count=80 demanded=159% allocated=53% available=61%]
I0821 04:30:44.605427 75342 solver.cpp:244] Iteration 0 (0 iter/s, 0s/200 iters), loss = 2.30264
I0821 04:30:44.605461 75342 solver.cpp:263] Train net output #0: loss = 2.30264 (* 1 = 2.30264 loss)
I0821 04:30:44.605484 75342 sgd_solver.cpp:128] Iteration 0, lr = 0.001
I0821 04:30:44.636401 75346 data_layer.cpp:86] Restarting data prefetching from start.
…
I0821 04:42:50.161826 75342 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 04:42:50.164942 75342 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 04:42:50.908159 75342 solver.cpp:336] Iteration 200, loss = 1.72997
I0821 04:42:50.908186 75342 solver.cpp:341] Optimization Done.
I0821 04:42:50.908195 75342 caffe.cpp:421] Optimization Done.

그리고 LMS 덕분에 GPU memory 사용량이 확 줄어든 것을 보실 수 있습니다.

그러나 lms를 쓸 경우, 분명히 성능은 다소 느려집니다. PCIe 대신 NVLink를 쓴다고 해도, GPU 메모리보다 host 서버의 DRAM이 느린 것이 당연하니까요. 그러나 lms도 약간이나마 튜닝을 할 수는 있습니다. 주로 다음의 2가지 추가 옵션을 쓰시면 됩니다.

-lms_size_threshold <size in KB> : Default는 1000.
여기에 명기하는 것보다 작은 크기의 memory chunk는 LMS에 의해 swap-out/in 되지 않고 GPU 메모리에 상주하게 하라는 뜻입니다.

-lms_exclude <size in MB> : Default는 0.
GPU 메모리 크기에서 이 값을 뺀 크기가 LMS cache로 사용되는 GPU memory 할당량의 soft limit입니다. 이 값을 작게 할 수록 GPU 메모리 사용량이 커져 성능이 좋아집니다.

가령 아래의 명령어 옵션은 crop size 2240x2240, batch size 5인 고해상도 이미지 dataset에서 GoogleNet model을 사용할 경우 가장 좋은 LMS 성능을 낸 example입니다. 그러나 모델의 신경망 구조와 data 크기, batch_size와 GPU 메모리 크기에 따라 이상적인 튜닝값은 다 다릅니다.

$ caffe train -solver=solver.prototxt -gpu all -lms -lms_size_threshold 1000 -lms_exclude 1400

실제 cifar10에서는 어떨까요 ? 먼저 lms 없이 수행할 수 있는 최대 크기의 batch_size로 non-LMS caffe training을 해보겠습니다. 그럴 경우, LMS에서 사용했던 batch_size 24000 * max_iter 200와 동일한 개수의 image를 non-LMS에서 처리하기 위해서는 batch_size 22000 * max_iter 219를 쓰시면 됩니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m46.926s

이와 거의 같은 개수의 image인 batch_size 24000 * max_iter 200를 LMS로 별도 튜닝 옵션 없이 처리해보겠습니다. 확실히 조금 더 느립니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 12m12.219s

이번에는 동일한 training에 -lms_size_threshold와 -lms_exclude를 추가해서 수행해봅니다. 성능이 거의 non-LMS만큼 빨라진 것이 보입니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms -lms_size_threshold 1000 -lms_exclude 1400 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m47.405s

그리고 더 큰 batch_size를 사용함에도 LMS에 의해 줄어들었던 GPU memory 사용량이 거의 non-LMS 수준으로 다시 늘어난 것이 보입니다.

2018년 4월 12일 목요일

축약형 ILSVRC2012_img_train_t3.tar를 이용한 LMDB 포맷 파일

지난 편에 이어서 이번에는 축약형 ILSVRC2012_img_train_t3.tar를 caffe 테스트용 lmdb로 포맷하는 방법을 정리했습니다.

일단 training용 raw image는 지난 편에서 풀어놓은 ~/ilsvrc2012/raw-data/train 속의 것을 그대로 이용하면 됩니다. 그러나 validation용 dataset은 따로 속아내야 합니다.

[user1@ac922 raw-data]$ mkdir val && cd val

[user1@ac922 val]$ tar -xf ~/files/ILSVRC2012_img_val.tar

[user1@ac922 val]$ ls *.JPEG | wc -l
50000

먼저 아래와 같이 caffe에 기본 포함되어 있는 get_ilsvrc_aux.sh를 이용하여 각종 label 파일들을 download 받습니다.

[user1@ac922 ~]$ cd ~/caffe/data/ilsvrc12/

[user1@ac922 ilsvrc12]$ ./get_ilsvrc_aux.sh

Download 받은 val.txt와 train.txt 등을 아래와 같이 ILSVRC2012_img_train_t3.tar에 맞게 편집합니다.

[user1@ac922 ilsvrc12]$ head val.txt
ILSVRC2012_val_00000001.JPEG 65
ILSVRC2012_val_00000002.JPEG 970
ILSVRC2012_val_00000003.JPEG 230
ILSVRC2012_val_00000004.JPEG 809
ILSVRC2012_val_00000005.JPEG 516
ILSVRC2012_val_00000006.JPEG 57
ILSVRC2012_val_00000007.JPEG 334
ILSVRC2012_val_00000008.JPEG 415
ILSVRC2012_val_00000009.JPEG 674
ILSVRC2012_val_00000010.JPEG 332

[user1@ac922 ilsvrc12]$ wc -l val.txt
50000 val.txt

[user1@ac922 ilsvrc12]$ head train.txt
n01440764/n01440764_10026.JPEG 0
n01440764/n01440764_10027.JPEG 0
n01440764/n01440764_10029.JPEG 0
n01440764/n01440764_10040.JPEG 0
n01440764/n01440764_10042.JPEG 0
n01440764/n01440764_10043.JPEG 0
n01440764/n01440764_10048.JPEG 0
n01440764/n01440764_10066.JPEG 0
n01440764/n01440764_10074.JPEG 0
n01440764/n01440764_1009.JPEG 0

[user1@ac922 ilsvrc12]$ wc -l train.txt
1281167 train.txt

[user1@ac922 ilsvrc12]$ cat train.txt | cut -d"/" -f 1 | sort -u | wc -l
1000

[user1@ac922 ilsvrc12]$ cp train.txt train.txt.org

[user1@ac922 ilsvrc12]$ cp val.txt val.txt.org

[user1@ac922 ilsvrc12]$ for i in `ls -R ~/ilsvrc2012/raw-data/train/*`
> do
> grep ${i} train.txt >> train.txt.new
> done

위와 같이 해서 train.txt.new에 축약형 멍멍이 사진 list인 20580장의 list만 따로 뽑았습니다.

[user1@ac922 ilsvrc12]$ wc -l train.txt.new
20580 train.txt.new

[user1@ac922 ilsvrc12]$ cp train.txt.new train.txt

기타 synsets.txt나 synset_words.txt 등의 label에도 같은 작업을 해줍니다.

[user1@ac922 ilsvrc12]$ wc -l synsets.txt
1000 synsets.txt

[user1@ac922 ilsvrc12]$ head synsets.txt
n01440764
n01443537
n01484850
n01491361
n01494475
n01496331
n01498041
n01514668
n01514859
n01518878

[user1@ac922 ilsvrc12]$ cp synsets.txt synsets.txt.org

[user1@ac922 ilsvrc12]$ for i in `ls ~/ilsvrc2012/raw-data/train`
> do
> grep ${i} synsets.txt >> synsets.txt.new
> done

[user1@ac922 ilsvrc12]$ wc -l synsets.txt.new
120 synsets.txt.new

[user1@ac922 ilsvrc12]$ cp synsets.txt.new synsets.txt

[user1@ac922 ilsvrc12]$ head synset_words.txt
n01440764 tench, Tinca tinca
n01443537 goldfish, Carassius auratus
n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
n01491361 tiger shark, Galeocerdo cuvieri
n01494475 hammerhead, hammerhead shark
n01496331 electric ray, crampfish, numbfish, torpedo
n01498041 stingray
n01514668 cock
n01514859 hen
n01518878 ostrich, Struthio camelus

[user1@ac922 ilsvrc12]$ cp synset_words.txt synset_words.txt.org

[user1@ac922 ilsvrc12]$ wc -l synset_words.txt
1000 synset_words.txt

[user1@ac922 ilsvrc12]$ for i in `ls ~/ilsvrc2012/raw-data/train`
> do
> grep ${i} synset_words.txt >> synset_words.txt.new
> done

[user1@ac922 ilsvrc12]$ wc -l synset_words.txt.new
120 synset_words.txt.new

[user1@ac922 ilsvrc12]$ head synset_words.txt.new
n02085620 Chihuahua
n02085782 Japanese spaniel
n02085936 Maltese dog, Maltese terrier, Maltese
n02086079 Pekinese, Pekingese, Peke
n02086240 Shih-Tzu
n02086646 Blenheim spaniel
n02086910 papillon
n02087046 toy terrier
n02087394 Rhodesian ridgeback
n02088094 Afghan hound, Afghan

[user1@ac922 ilsvrc12]$ cp synset_words.txt.new synset_words.txt

이제 ~/ilsvrc2012/raw-data/val 속에 들어있는 6만장의 validation용 JPEG 파일 중에서 축약형 training dataset에 맞는 카테고리의 파일들만 걸러내는 작업을 아래와 같이 합니다.

[user1@ac922 ilsvrc12]$ cat train.txt | awk '{print $2}' | sort -u | wc -l
120

[user1@ac922 ilsvrc12]$ cat train.txt | awk '{print $2}' | sort -u > train.id

[user1@ac922 ilsvrc12]$ cat train.id | head
151
152
153
154
155
156
157
158
159
160

[user1@ac922 ilsvrc12]$ cat train.id | tail
262
263
264
265
266
267
268
273
274
275

[user1@ac922 ilsvrc12]$ sed 's/ /@/' val.txt > val.imsi

[user1@ac922 ilsvrc12]$ for i in `cat val.imsi`
> do
> j=`echo ${i} | cut -d'@' -f2`
> if [[ $j -gt 150 && $j -lt 276 ]]
> then
> echo ${i} >> val.txt.new
> fi
> done

[user1@ac922 ilsvrc12]$ sed 's/@/ /' val.txt.new > val.txt

[user1@ac922 ilsvrc12]$ wc -l val.txt
6250 val.txt

[user1@ac922 ilsvrc12]$ for i in `ls ~/ilsvrc2012/raw-data/val`
> do
> grep ${i} val.txt > /dev/null
> if [[ $? -ne 0 ]]
> then
> rm ~/ilsvrc2012/raw-data/val/${i}
> fi
> done

일부 겹치는 것이 있어서 아래와 같이 6250장이 걸러졌습니다.

[user1@ac922 ilsvrc12]$ ls ~/ilsvrc2012/raw-data/val | wc -l
6250

이제 이것으로 아래와 같이 create_imagenet.sh을 수행하여 lmdb 파일들을 만듭니다. 저희 AC922에 좀 문제가 있어서 부득이 POWER8 Minsky 서버로 옮겨서 작업을 수행했습니다.

그리고 inception v3에 맞도록 RESIZE_HEIGHT와 RESIZE_WIDTH를 default인 256 x 256 대신 320 x 320으로 바꾸어 lmdb를 생성하겠습니다.

minsky@minsky:/opt/DL/caffe-nv$ vi ./examples/imagenet/create_imagenet.sh
...
RESIZE=true
if $RESIZE; then
RESIZE_HEIGHT=320
RESIZE_WIDTH=320

minsky@minsky:/opt/DL/caffe-nv$ ./examples/imagenet/create_imagenet.sh
Creating train lmdb...
I0412 01:31:48.584789 44556 convert_imageset.cpp:83] Shuffling data
I0412 01:31:49.089481 44556 convert_imageset.cpp:86] A total of 20580 images.
I0412 01:31:49.089962 44556 db_lmdb.cpp:35] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb
...
Creating val lmdb...
...
I0412 01:33:47.749966 44599 convert_imageset.cpp:150] Processed 6250 files.
Done.

그리고 이렇게 새로 만들어진 lmdb 파일들에 대해 새로 imagenet_mean.binaryproto을 만들어야 합니다. 이것도 이미 주어진 script를 이용합니다.

minsky@minsky:/opt/DL/caffe-nv$ ./examples/imagenet/make_imagenet_mean.sh
Done.

minsky@minsky:/opt/DL/caffe-nv/data/ilsvrc12$ ls -l imagenet_mean.binaryproto
-rw-rw-r-- 1 minsky minsky 1228814 Apr 13 08:51 imagenet_mean.binaryproto

이렇게 만들어진 ilsvrc12_lmdb 파일들과 각종 label 및 imagenet_mean.binaryproto가 들어있는 data/ilsvrc12 디렉토리 전체를 아래와 같은 파일 이름으로 묶어서 아래 URL의 구글 드라이브에 올려놓았습니다.

minsky@minsky:/opt/DL/caffe-nv/examples/imagenet$ ls -l *.tgz
-rw-rw-r-- 1 minsky minsky 6806226325 Apr 13 08:48 ilsvrc12_lmdb_small_320.tgz

minsky@minsky:/opt/DL/caffe-nv/data$ tar -zcf ilsvrc12_320.tgz ilsvrc12

ilsvrc12_lmdb_small_320.tgz :
https://drive.google.com/open?id=12LqkuqCChOjK9zz1ZNGJa_YVXglgPUuB

ilsvrc12_320.tgz :
https://drive.google.com/open?id=1cugEHL2zm5UuvOy0MGSFJAPFAHg-KBEP

그리고 아래 URL은 caffe로 inception v1을 수행하기 위한 script 묶음입니다.

https://drive.google.com/open?id=17M9CcZFyHIKBx3Hv4EXAWXlG7bmKd6HO

2018년 3월 14일 수요일

ppc64le에서 PyCaffe를 이용한 Image Classification demo

여기서는 caffe가 이미 설치되어 있고, anaconda2도 설치된 상태부터 시작합니다.

[user1@gpusvr ~]$ which python
~/anaconda2/bin/python

먼저, caffe의 source code가 있는 CAFFE_ROOT 디렉토리로 가서 make pycaffe를 수행합니다.

[user1@gpusvr caffe]$ make pycaffe
CXX/LD -o python/caffe/_caffe.so python/caffe/_caffe.cpp
touch python/caffe/proto/__init__.py
PROTOC (python) src/caffe/proto/caffe.proto

이렇게 빌드된 pycaffe는 python/caffe 디렉토리 밑에 다음과 같이 설치됩니다.

[user1@gpusvr caffe]$ ls -l python/caffe
total 2016
-rw-rw-r--. 1 user1 user1 21363 Mar 13 10:50 _caffe.cpp
-rwxrwxr-x. 1 user1 user1 1897912 Mar 14 15:17 _caffe.so
-rw-rw-r--. 1 user1 user1 3546 Mar 13 10:50 classifier.py
-rw-rw-r--. 1 user1 user1 3278 Mar 14 15:53 classifier.pyc
-rw-rw-r--. 1 user1 user1 6721 Mar 13 10:50 coord_map.py
-rw-rw-r--. 1 user1 user1 8549 Mar 13 10:50 detector.py
-rw-rw-r--. 1 user1 user1 7335 Mar 14 15:53 detector.pyc
-rw-rw-r--. 1 user1 user1 11174 Mar 13 10:50 draw.py
drwxrwxr-x. 2 user1 user1 34 Mar 13 10:50 imagenet
-rw-rw-r--. 1 user1 user1 552 Mar 13 10:50 __init__.py
-rw-rw-r--. 1 user1 user1 1206 Mar 14 15:16 __init__.pyc
-rw-rw-r--. 1 user1 user1 13079 Mar 13 10:50 io.py
-rw-rw-r--. 1 user1 user1 13576 Mar 14 15:28 io.pyc
-rw-rw-r--. 1 user1 user1 8277 Mar 13 10:50 net_spec.py
-rw-rw-r--. 1 user1 user1 10008 Mar 14 15:53 net_spec.pyc
drwxrwxr-x. 2 user1 user1 86 Mar 14 15:53 proto
-rw-rw-r--. 1 user1 user1 11615 Mar 13 10:50 pycaffe.py
-rw-rw-r--. 1 user1 user1 12179 Mar 14 15:16 pycaffe.pyc
drwxrwxr-x. 2 user1 user1 256 Mar 13 10:50 test

이걸 다음과 같이 PYTHONPATH로 지정된 ~/anaconda2/lib/python2.7/site-packages 밑으로 copy 합니다. 그냥 현재의 저 python/caffe directory를 PYTHONPATH에 추가하는 방법도 있겠습니다만, 저는 해보니 왜인지는 모르겠으나 자꾸 no module named caffe 라는 error가 나더라구요.

[user1@gpusvr caffe]$ cp -r python/caffe /home/user1/anaconda2/lib/python2.7/site-packages

그리고 libcaffe.so file not found 등의 error를 피하기 위해서 다음과 같이 LD_LIBRARY_PATH를 제대로 지정해줍니다.

[user1@gpusvr caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/caffe/.build_release/lib:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

이제 jupyter notebook을 설치하고, 설정도 해준 뒤 구동합니다.

[user1@gpusvr ~]$ conda install jupyter

[user1@gpusvr ~]$ jupyter notebook --generate-config
Writing default config to: /home/user1/.jupyter/jupyter_notebook_config.py

[user1@gpusvr ~]$ vi /home/user1/.jupyter/jupyter_notebook_config.py
...
#c.NotebookApp.ip = 'localhost'
c.NotebookApp.ip = '*'

[user1@gpusvr ~]$ jupyter notebook &
...
http://localhost:8888/?token=9b9c684173d7883ae7431c02d3e8dfe831cf790322737d16

이제 다른 PC의 laptop의 웹브라우저에서 http://GPU서버주소:8888/?token=9b9c684173d7883ae7431c02d3e8dfe831cf790322737d16 를 입력창에 넣고 접속합니다.

접속해보면 jupyter를 구동한 그 홈디렉토리가 보일 겁니다. 여기서 click-click하여 caffe/examples 디렉토리를 찾아들어간 뒤, 00-classification.ipynb 라는 파일을 클릭합니다.

Jupyter 메뉴바의 play (오른쪽 삼각형) 버튼을 클릭하여 각 섹션을 넘어가다보면 다음처럼 "6. Try your own image"라는 섹션이 나옵니다. 여기서 코드 속의 my_image_url = 오른쪽 부분에 인터넷에서 구글링으로 찾은 적절한 jpg 파일의 URL을 복사해 붙여봅니다. 저는 전함 미주리호의 사진 주소를 붙여보았습니다.

# download an image
my_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/USS_Missouri_HNL.jpg/220px-USS_Missouri_HNL.jpg" # paste your URL here

Jupyter 메뉴바의 play 버튼을 클릭하면 다음과 같이 49.4%의 확률로 항공모함, 11.7%의 확률로 스페이스 셔틀, 9.2% 확률로 군용 항공기라고 나옵니다. 아마 갑판에서 바라본 포탑과 함교의 모습을 전함으로 인식하지는 못하나봅니다.

[(0.49434182,
'n02687172 aircraft carrier, carrier, flattop, attack aircraft carrier'),
(0.11762773, 'n04266014 space shuttle'),
(0.092569701, 'n04552348 warplane, military plane'),
(0.081569709, 'n04008634 projectile, missile'),
(0.078422301,
'n04389033 tank, army tank, armored combat vehicle, armoured combat vehicle')]

다음과 같이 전체 전함의 모습을 담은 사진을 넣어도 여전히 99.9% 확률로 항공모함이라고 나옵니다. 아마도 ImageNet 2012 training dataset으로 훈련된 이 모델에는 전함과 항공모함을 구별해서 labeling이 되어있지는 않은 모양입니다.

이 데모의 웹 버전은 다음 URL에 준비되어 있으니 언제든 테스트해보실 수 있습니다.

http://demo.caffe.berkeleyvision.org/

2018년 3월 13일 화요일

AC922에서 bvlc caffe 빌드하고 cifar10 돌려보기

(Pycaffe를 위해) Anaconda2가 설치된 환경을 가정하겠습니다. 먼저 아래 package들부터 설치합니다.

[user1@ac922 ~]$ sudo yum install git gcc gcc-c++ python-devel python-enum34 numpy cmake automake snappy.ppc64le boost-python.ppc64le libgfortran4.ppc64le gtk+.ppc64le gtk+-devel.ppc64le gtk2.ppc64le gtk3.ppc64le gstreamer.ppc64le gstreamer-tools.ppc64le libdc1394.ppc64le libdc1394-tools.ppc64le

1. HDF5를 설치합니다.

[user1@ac922 ~]$ wget https://support.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.10.1.tar
[user1@ac922 ~]$ tar -xf hdf5-1.10.1.tar
[user1@ac922 ~]$ cd hdf5-1.10.1
[user1@ac922 hdf5-1.10.1]$ ./configure --prefix=/usr/local --enable-fortran --enable-cxx --build=powerpc64le-linux-gnu
[user1@ac922 hdf5-1.10.1]$ make && sudo make install

2. boost를 설치합니다.

[user1@ac922 ~]$ wget https://dl.bintray.com/boostorg/release/1.66.0/source/boost_1_66_0.tar.gz
[user1@ac922 ~]$ tar -zxf boost_1_66_0.tar.gz
[user1@ac922 ~]$ cd boost_1_66_0
[user1@ac922 boost_1_66_0]$ ./bootstrap.sh --prefix=/usr/local
[user1@ac922 boost_1_66_0]$ ./b2
[user1@ac922 boost_1_66_0]$ sudo ./b2 install

3. GFLAGS를 설치합니다.

[user1@ac922 ~]$ wget https://github.com/schuhschuh/gflags/archive/master.zip
[user1@ac922 ~]$ unzip master.zip && cd gflags-master
[user1@ac922 gflags-master]$ mkdir build && cd build
[user1@ac922 build]$ cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_STATIC_LIBS=ON -DBUILD_gflags_LIB=ON
[user1@ac922 build]$ make && sudo make install

4. GLOG를 설치합니다.

[user1@ac922 ~]$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/google-glog/glog-0.3.3.tar.gz
[user1@ac922 ~]$ tar zxvf glog-0.3.3.tar.gz
[user1@ac922 ~]$ cd glog-0.3.3
[user1@ac922 glog-0.3.3]$ ./configure --build=powerpc64le-redhat-linux-gnu
[user1@ac922 glog-0.3.3]$ make && sudo make install

5. LMDB를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/LMDB/lmdb
[[user1@ac922 ~]$ cd lmdb/libraries/liblmdb
[user1@ac922 liblmdb]$ make && sudo make install

6. LEVELDB를 설치합니다.

[user1@ac922 files]$ wget https://rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ wget https://www.rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-devel-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-devel-1.12.0-11.el7.ppc64le.rpm

7. OpenBLAS를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/xianyi/OpenBLAS.git
[user1@ac922 ~]$ cd OpenBLAS
[user1@ac922 OpenBLAS]$ git checkout power8
[user1@ac922 OpenBLAS]$ make TARGET=POWER8 LDFLAGS="-fopenmp"
[user1@ac922 OpenBLAS]$ sudo make TARGET=POWER8 LDFLAGS="-fopenmp" install
...
Copying the static library to /opt/OpenBLAS/lib
Copying the shared library to /opt/OpenBLAS/lib
Generating OpenBLASConfig.cmake in /opt/OpenBLAS/lib/cmake/openblas
Generating OpenBLASConfigVersion.cmake in /opt/OpenBLAS/lib/cmake/openblas
Install OK!
make[1]: Leaving directory `/home/user1/OpenBLAS'

8. OpenCV를 설치합니다.

[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv.git
[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv_contrib.git
[user1@ac922 ~]$ cd opencv
[user1@ac922 opencv]$ git checkout tags/3.4.1
[user1@ac922 opencv]$ mkdir build && cd build
[user1@ac922 build]$ which protoc
~/anaconda2/bin/protoc
[user1@ac922 build]$ export PROTOBUF_PROTOC_EXECUTABLE="~/anaconda2/bin/protoc"
[user1@ac922 build]$ cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DOPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules -D WITH_EIGEN=OFF -DBUILD_LIBPROTOBUF_FROM_SOURCES=ON ..
[user1@ac922 build]$ make && sudo make install
...
-- Installing: /usr/local/bin/opencv_visualisation
-- Set runtime path of "/usr/local/bin/opencv_visualisation" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_interactive-calibration
-- Set runtime path of "/usr/local/bin/opencv_interactive-calibration" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_version
-- Set runtime path of "/usr/local/bin/opencv_version" to "/usr/local/lib64:/usr/local/cuda/lib64"

9. NCCL을 빌드합니다.

[user1@ac922 ~]$ git clone https://github.com/NVIDIA/nccl
[user1@ac922 ~]$ cd nccl
[user1@ac922 nccl]$ make
[user1@ac922 nccl]$ sudo make install

10. 이제 비로소 caffe를 빌드할 수 있습니다.

[user1@ac922 ~]$ git clone https://github.com/BVLC/caffe.git
[user1@ac922 ~]$ cd caffe
[user1@ac922 caffe]$ cp Makefile.config.example Makefile.config
[user1@ac922 caffe]$ vi Makefile.config
...
# USE_CUDNN := 1
USE_CUDNN := 1
...
# OPENCV_VERSION := 3
OPENCV_VERSION := 3
...
#CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_20,code=sm_21 \
-gencode arch=compute_30,code=sm_30 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=sm_50 \
-gencode arch=compute_52,code=sm_52 \
-gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_61,code=compute_61
CUDA_ARCH := -gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_61,code=compute_61
...
# BLAS := atlas
BLAS := open
...
#PYTHON_INCLUDE := /usr/include/python2.7 \
/usr/lib/python2.7/dist-packages/numpy/core/include
ANACONDA_HOME := $(HOME)/anaconda2
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
$(ANACONDA_HOME)/include/python2.7 \
$(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include
...
# PYTHON_LIB := /usr/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib
...
# WITH_PYTHON_LAYER := 1
WITH_PYTHON_LAYER := 1
...
# USE_NCCL := 1
USE_NCCL := 1

LINKFLAGS := -Wl,-rpath,$(HOME)/anaconda2/lib # added to prevent "~/anaconda2/lib/libpng16.so.16 undefined reference to `inflateValidate@ZLIB_1.2.9" error

여기서 아래와 같이 soft link를 걸어주어야 cannot find -lsnappy 등의 error를 피할 수 있습니다.

[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libsnappy.so.1 /usr/lib64/libsnappy.so
[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libboost_python.so.1.53.0 /usr/lib64/libboost_python.so

[user1@ac922 caffe]$ make all
...
CXX/LD -o .build_release/examples/cpp_classification/classification.bin
CXX examples/mnist/convert_mnist_data.cpp
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX examples/siamese/convert_mnist_siamese_data.cpp
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin

[user1@ac922 caffe]$ sudo mkdir /opt/caffe
[user1@ac922 caffe]$ sudo cp -r build/* /opt/caffe

11. 이제 cifar10을 수행해봅니다.

[user1@ac922 caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

[user1@ac922 caffe]$ export CAFFE_HOME=/home/user1/caffe/build/tools

[user1@ac922 caffe]$ cd data/cifar10/

[user1@ac922 cifar10]$ ./get_cifar10.sh
Downloading...
--2018-03-13 14:13:14-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[==================================================================>] 170,052,171 11.0MB/s in 15s

2018-03-13 14:13:29 (11.0 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

[user1@ac922 cifar10]$ ls -la
total 180084
drwxrwxr-x. 2 user1 user1 213 Mar 13 14:13 .
drwxrwxr-x. 5 user1 user1 50 Mar 13 10:50 ..
-rw-r--r--. 1 user1 user1 61 Jun 5 2009 batches.meta.txt
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_1.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_2.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_3.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_4.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_5.bin
-rwxrwxr-x. 1 user1 user1 506 Mar 13 10:50 get_cifar10.sh
-rw-r--r--. 1 user1 user1 88 Jun 5 2009 readme.html
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 test_batch.bin

[user1@ac922 caffe]$ ./examples/cifar10/create_cifar10.sh

[user1@ac922 caffe]$ ls -l examples/cifar10/*_lmdb
examples/cifar10/cifar10_test_lmdb:
total 35656
-rw-rw-r--. 1 user1 user1 36503552 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1 8192 Mar 13 14:14 lock.mdb

examples/cifar10/cifar10_train_lmdb:
total 177992
-rw-rw-r--. 1 user1 user1 182255616 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1 8192 Mar 13 14:14 lock.mdb

[user1@ac922 caffe]$ vi ./examples/cifar10/train_full.sh
#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver.prototxt $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@
# --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
--snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@
# --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate $@

# 무슨 이유에선지 cifar10_full_iter_60000.solverstate 대신 cifar10_full_iter_60000.solverstate.h5 이라는 파일이 생기므로 그에 따라 파일 이름 변경

[user1@ac922 caffe]$ time ./examples/cifar10/train_full.sh
I0313 14:15:55.463438 114263 caffe.cpp:204] Using GPUs 0
I0313 14:15:55.529319 114263 caffe.cpp:209] GPU 0: Tesla V100-SXM2-16GB
...
I0313 14:34:16.333791 126407 solver.cpp:239] Iteration 69800 (177.976 iter/s, 1.12375s/200 iters), loss = 0.332006
I0313 14:34:16.333875 126407 solver.cpp:258] Train net output #0: loss = 0.332006 (* 1 = 0.332006 loss)
I0313 14:34:16.333892 126407 sgd_solver.cpp:112] Iteration 69800, lr = 1e-05
I0313 14:34:17.436130 126413 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.453459 126407 solver.cpp:478] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_70000.caffemodel.h5
I0313 14:34:17.458664 126407 sgd_solver.cpp:290] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_70000.solverstate.h5
I0313 14:34:17.461360 126407 solver.cpp:331] Iteration 70000, loss = 0.294117
I0313 14:34:17.461383 126407 solver.cpp:351] Iteration 70000, Testing net (#0)
I0313 14:34:17.610864 126424 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.612763 126407 solver.cpp:418] Test net output #0: accuracy = 0.8169
I0313 14:34:17.612794 126407 solver.cpp:418] Test net output #1: loss = 0.533315 (* 1 = 0.533315 loss)
I0313 14:34:17.612810 126407 solver.cpp:336] Optimization Done.
I0313 14:34:17.612821 126407 caffe.cpp:250] Optimization Done.

real 6m51.615s
user 7m30.483s
sys 1m5.158s

2018년 1월 19일 금요일

AC922 Redhat 환경에서 source code로부터 caffe build하기

AC922 Redhat 환경에서는 아직은 정식으로 PowerAI가 full 지원되지 않으며, 이는 2018년 2Q로 예정되어 있습니다. 그러나 source code로부터 빌드하는 것은 언제나 하기 나름입니다.

지난 포스팅에서는 AC922에서 Tensorflow 1.4.1을 빌드하는 것을 보여드렸는데, 이번에는 caffe (물론 bvlc-caffe)를 빌드합니다. 이번에는 IBM 이보란 과장께서 수고해주셨습니다. 본문은 아래 이보란 과장이 운영하는 블로그를 click 하십시요.

http://shareithw.blogspot.kr/2018/01/bvlc-caffe-rhel-power9-ppc64le.html

2018년 1월 15일 월요일

AC922에서 Ubuntu 기반 container image로 caffe 및 tensorflow 수행하기

좋은 소식입니다. 원래 발표에서는 AC922에서는 2018년 2Q가 되어야 정식으로 Ubuntu가 지원될 예정이었습니다. caffe나 tensorflow 등도 그때나 되어야 지원될 예정이었고요.

오늘 테스트해보니 AC922의 CUDA9.1 + Redhat 7.4 + nvidia-docker v1 환경에서 테스트해보니, 기존 ubuntu 16.04에서 빌드해놓았던 tensorflow 1.3과 caffe 1.0이 다 잘 돌아갑니다.

그리고 성능도 기존 Minsky P100에서보다 약 1.7~1.8배 정도 나옵니다. V100의 공식 TFLOPS 성능이 P100의 1.5배인 것을 생각하면 이는 POWER9과 NVLink 2.0 덕분인 것 같습니다.

[root@ac922 ~]# nvidia-docker run -ti --rm -v /nvme:/nvme bsyu/tf1.3-ppc64le:v0.1 bash

root@2be8a3ffc5fd:/nvme/models/tutorials/image/cifar10# which python
/opt/anaconda3/bin/python

root@2be8a3ffc5fd:/nvme/models/tutorials/image/cifar10# time python cifar10_multi_gpu_train.py --batch_size 512

...

2018-01-15 13:10:45.808267: step 8220, loss = 0.75 (27044.5 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:46.585492: step 8230, loss = 0.60 (27071.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:47.353298: step 8240, loss = 0.59 (26355.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:48.130751: step 8250, loss = 0.67 (26219.3 examples/sec; 0.020 sec/batch)
2018-01-15 13:10:48.909557: step 8260, loss = 0.61 (27011.5 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:49.681004: step 8270, loss = 0.65 (27131.1 examples/sec; 0.019 sec/batch)

단, 이런 성능을 내기 위해서는 예전과 마찬가지로 GPU auto boost를 해줘야 하며, V100에 대해서는 다음과 같이 해주시기 바랍니다.

[root@ac922 ~]# cat /etc/rc.local

#!/bin/bash

# THIS FILE IS ADDED FOR COMPATIBILITY PURPOSES

# It is highly advisable to create own systemd services or udev rules

# to run scripts during boot instead of using this file.

# In contrast to previous versions due to parallel execution during boot

# this script will NOT be run after all other services.

# Please note that you must run 'chmod +x /etc/rc.d/rc.local' to ensure

# that this script will be executed during boot.

touch /var/lock/subsys/local

/usr/bin/nvidia-smi -pm ENABLED

/usr/bin/nvidia-smi -ac 877,1530

/usr/sbin/ppc64_cpu --smt=off

sleep 30

/usr/bin/cpupower frequency-set --governor performance

/usr/bin/nvidia-docker-plugin &

또한 이렇게 docker container를 이용할 경우, cuda를 initialize하는데 시간이 꽤 오래 걸립니다. 이건 Redhat 위에서 ubuntu 기반 container를 띄우기 때문인지, 아니면 nvidia-docker v1을 사용하기 때문인지, 아니면 CUDA 9의 문제인지 아직 불분명합니다.

그리고 그대로 nvidia-smi를 수행하면 failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED error가 납니다. 그 문제는 아래와 같이 해결 가능합니다. 아래 내용은 IBM 이보란 과장이 정리해주었습니다.

*nvidia 설치가이드 참조 : https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas

# vi /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

# sudo dracut --force

# vi /usr/lib/systemd/system/nvidia-persistenced.service

[Unit]

Description=NVIDIA Persistence Daemon

Wants=syslog.target

[Service]

Type=forking

PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid

Restart=always

ExecStart=/usr/bin/nvidia-persistenced --verbose

ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]

WantedBy=multi-user.target

# sudo systemctl enable nvidia-persistenced

$ vi /lib/udev/rules.d/40-redhat.rules (아래 줄을 #으로 comment-out)

#SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

# /usr/bin/nvidia-persistenced --verbose

# sudo yum install freeglut-devel libX11-devel libXi-devel libXmu-devel make mesa-libGLU-devel

Error downloading packages:
libXext-devel-1.3.3-3.el7.ppc64le: [Errno 256] No more mirrors to try.
libXdamage-devel-1.1.4-4.1.el7.ppc64le: [Errno 256] No more mirrors to try. (생략)
==> 설치부분은 위와 같이 다운로드 에러가 납니다만, 그냥 무시하고 넘어갔습니다.

이후 PYTHONPATH, PATH, LD_LIBRARY_PATH 설정 후에 tensorflow 1.4 로 cifar10이든, mnist 든 수행하면 다음의 에러가 납니다.

[root@ac922 cifar10]# python cifar10_train.py
Traceback (most recent call last):
File "cifar10_train.py", line 42, in <module>
import tensorflow as tf
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 30, in <module>
import traceback
File "/opt/anaconda3/lib/python3.6/traceback.py", line 5, in <module>
import linecache
File "/opt/anaconda3/lib/python3.6/linecache.py", line 11, in <module>
import tokenize
File "/opt/anaconda3/lib/python3.6/tokenize.py", line 33, in <module>
import re
File "/opt/anaconda3/lib/python3.6/re.py", line 142, in <module>
class RegexFlag(enum.IntFlag):
AttributeError: module 'enum' has no attribute 'IntFlag'

enum34가 설치되어 있어서 발생하는 에러인데, Python 3.4 버전 이상부터는 enum34와 호환이 되지 않아서 이를 삭제해야 한다고 합니다. 삭제를 위해, 우선 PYTHONPATH를 Python 2.7 경로로 바꿔줍니다.

# export PYTHONPATH=/opt/DL/tensorflow/lib/python2.7/site-packages

# pip uninstall enum34

# export PYTHONPATH=<python3.6/site-packages의 경로로 재설정>

# export PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages

그 다음, cifar10 이든 mnist를 수행하면 enum 관련 에러는 발생하지 않고, nvidia-smi 에서는 unknown error 대신 GPU 메모리 상태 정보를 제대로 띄우는 것을 확인할 수 있습니다.

2017년 12월 12일 화요일

Caffe를 이용하여 ILSVRC2012 dataset을 alexnet으로 training하기

먼저 작업 환경을 PowerAI에 포함된 caffe-nv로 하기 위해 PATH 등 각종 환경 변수를 설정해주는 다음 script를 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ source /opt/DL/caffe-nv/bin/caffe-activate

다음과 같이 caffe가 caffe-nv로 잡히는지 확인합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ which caffe
/opt/DL/caffe-nv/bin/caffe

PowerAI에 포함된 caffe-nv 밑의 example과 data를 GPFS 파일시스템 쪽으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/examples examples
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/data data

거기서 아래와 같이 get_ilsvrc_aux.sh를 수행하여 ilsvrc2012 dataset 생성에 필요한 label 파일 등을 download 받습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd data/ilsvrc12
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ./get_ilsvrc_aux.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ls -ltr
total 37888
-rw-r----- 1 b7p286za IBM1 3200000 Feb 25 2014 test.txt
-rw-r----- 1 b7p286za IBM1 10000 Feb 25 2014 synsets.txt
-rw-r----- 1 b7p286za IBM1 786446 Feb 25 2014 imagenet_mean.binaryproto
-rw-r----- 1 b7p286za IBM1 1644500 Feb 25 2014 val.txt
-rw-r----- 1 b7p286za IBM1 43829433 Feb 25 2014 train.txt
-rw-r----- 1 b7p286za IBM1 31675 Apr 8 2014 synset_words.txt
-rw-r----- 1 b7p286za IBM1 3787 Jun 8 2014 det_synset_words.txt
-rw-r----- 1 b7p286za IBM1 14931117 Jul 11 2014 imagenet.bet.pickle
-rwxr-x--- 1 b7p286za IBM1 585 Dec 12 02:12 get_ilsvrc_aux.sh

이제 imagenet data, 즉 ILSVRC2012를 download 받습니다. Training dataset은 앞선 posting에서 사용한 tensorflow resnet training에서 사용했던 raw-data를 이용하면 됩니다. 다만, 거기서는 validation dataset도 label명에 따른 디렉토리로 분산해서 넣었는데, 이 alexnet에서는 val이라는 디렉토리에 한꺼번에 풀어놓아야 합니다. 따라서 다음과 같이 val만 새로 풀어놓습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ cd ../..

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mkdir raw-data/val && cd raw-data/val

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ tar -xf ../../ILSVRC2012_img_val.tar

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ cd ../..

이제 raw-data 밑의 train과 val 속의 JPEG 파일들을 LMDB 포맷으로 변환해야 합니다. 다음과 같이 create_imagenet.sh 스크립트를 수정해서 사용합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/create_imagenet.sh
...
export CAFFE_BIN=/opt/DL/caffe-nv/bin (추가)
...
TRAIN_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/train/
VAL_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val/
...
#RESIZE=false
RESIZE=true

수정을 마치고 다음과 같이 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/create_imagenet.sh

이 과정도 200GB가 넘는 data를 LMDB format으로 변환하는 것이므로 스토리지 상황에 따라 6~7시간 가량 걸립니다. 위의 script가 다 돌고나면 examples/imagenet/ilsvrc12_train_lmdb와 examples/imagenet/ilsvrc12_val_lmdb에 LMDB format으로 변환된 dataset이 생깁니다.

이제 생성된 LMDB로부터 전체 imagenet data의 평균값을 구하기 위해 make_imagenet_mean.sh를 수행합니다. 여기서도 script 맨 앞에 다음과 같이 CAFFE_BIN을 정의해줍니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/make_imagenet_mean.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
...

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/make_imagenet_mean.sh

다음으로는 solver.prototxt를 수정합니다. 먼저 /opt/DL/caffe-nv/models에 있는 bvlc_alexnet 디렉토리를 GPFS 파일시스템으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/models/bvlc_alexnet .

그리고나서 다음과 같이 solver.prototxt 속의 디렉토리 이름들과 max_iter 등을 적절히 수정해줍니다.
여기서는 나중에 batch_size를 2048로 할 것이므로, max_iter를 1250으로 하면 대략 1250 x 2048 / 1280000 = 20 epochs의 training을 완료하게 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/solver.prototxt
#net: "models/bvlc_alexnet/train_val.prototxt"
net: "bvlc_alexnet/train_val.prototxt"
...
#display: 20
display: 500
#max_iter: 100000
max_iter: 1250
...
#snapshot_prefix: "models/bvlc_alexnet/caffe_alexnet_train"
snapshot_prefix: "bvlc_alexnet/caffe_alexnet_train"

다음으로는 bvlc_alexnet/train_val.prototxt를 필요시 수정하여 train data의 batch_size를 늘이거나 줄이고, 각종 path도 적절히 변경합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/train_val.prototxt
...
source: "examples/imagenet/ilsvrc12_train_lmdb"
# batch_size: 1024
batch_size: 2048
...

이제 다음과 같은 train_alexnet.sh를 만들어 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/train_alexnet.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
set -e
$CAFFE_BIN/caffe train -gpu all --solver=bvlc_alexnet/solver.prototxt

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ nohup time ./examples/imagenet/train_alexnet.sh &

결과 log는 nohup.out에서 보실 수 있습니다. 위와 같이 20 epochs를 수행하는데는 12분 정도 밖에 안 걸립니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ grep iter nohup.out
test_iter: 1000
max_iter: 1250
I1212 14:15:42.151552 82131 solver.cpp:242] Iteration 0 (0 iter/s, 24.031s/500 iter), loss = 6.91103
I1212 14:22:10.507652 82131 solver.cpp:242] Iteration 500 (1.28749 iter/s, 388.352s/500 iter), loss = 6.37464
I1212 14:26:19.506183 82131 solver.cpp:242] Iteration 1000 (2.00806 iter/s, 248.996s/500 iter), loss = 5.34417
I1212 14:27:50.453514 82131 solver.cpp:479] Snapshotting to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.caffemodel
I1212 14:27:51.540899 82131 sgd_solver.cpp:273] Snapshotting solver state to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.solverstate

2017년 9월 15일 금요일

PowerAI 4.0의 DDL을 이용한 caffe와 tensorflow의 병렬처리

PowerAI 4.0에 포함된 DDL(Distributed Deep Learning)의 구체적인 사용법에 대해서 보시겠습니다.

일단 caffe는 IBM 버전 caffe (caffe-ibm)에 DDL 옵션이 통합되어 있으므로 별도 debian 패키지를 설치할 필요가 없습니다. 이 caffe-ibm도 내부적으로는 OpenMPI를 이용하는 것이므로 관련 library들이 설치되기는 해야 합니다만, 이는 caffe-ibm을 설치할 때 함께 자동으로 설치되므로 따로 신경쓰지 않으셔도 됩니다.

nimbix@JARVICENAE-0A0A1835:/data/mnist$ dpkg -l | grep openmpi
ii libopenmpi2-cuda:ppc64el 2.0.1-4ibm1 ppc64el high performance message passing library -- shared library
ii openmpi-bin-cuda 2.0.1-4ibm1 ppc64el high performance message passing library -- binaries
ii openmpi-common-cuda 2.0.1-4ibm1 all high performance message passing library -- common files
ii openmpi-doc-cuda 2.0.1-4ibm1 all high performance message passing library -- man pages

가령 위에서 보는 것과 같이 CUDA-aware OpenMPI를 설치하고나면, mpirun이라는 MPI utility가 설치됩니다. 이 mpirun이라는 것은 여러단계의 soft link가 걸린 orterun이라는 명령어이고, 결국 아래와 같이 openmpi-bin-cuda에서 제공됩니다.

nimbix@JARVICENAE-0A0A1835:/data/mnist$ dpkg -S /usr/bin/orterun
openmpi-bin-cuda: /usr/bin/orterun

IBM 버전 caffe에서의 DDL 사용법은 알고 보면 단순합니다. 다음 4가지만 아시면 됩니다.

1) caffe 명령을 수행할 때 -ddl 옵션을 준다
2) Train/Validation용 dataset은 모든 서버에서 동일한 위치(directory)에 존재해야 한다 (병렬파일시스템 또는 NFS가 편리)
3) 모든 서버는 암호 없이 ssh가 되도록 ssh-keygen과 ssh-copy-id가 되어 있어야 한다
4) 환경변수 등을 다른 서버 노드에도 전달하기 위해서는 mpirun 명령을 사용하는 것이 편하다

다른 것은 다 쉽습니다만 1)번 항목이 조금 어렵게 느껴질 수도 있습니다. 복잡한 부분은 다 빼고, 그냥 쉽게 보면 이렇습니다.

DDL 옵션을 쓴다고 해서 caffe가 여러분이 가진 GPU서버 및 network 환경을 스스로 이해하고 그에 맞게 자동으로 최적화할 수는 없습니다. 따라서 그런 환경, 즉 topology를 caffe에게 여러분이 직접 알려주셔야 합니다. 그게 -ddl 옵션의 mode입니다. 쉽게 예를 들어 설명하면 다음과 같습니다.

$ mpirun -x PATH -x LD_LIBRARY_PATH -n 12 -rf 4x3.rf caffe train -solver /data/mnist/lenet_solver.prototxt -gpu 0 -bvlc -ddl "-mode n:4x3x1 -dev_sync 1"

- mpirun은 여러대의 서버 노드에 동일한 명령을 동일한 환경변수 (-x 옵션)을 써서 수행해주는 병렬환경 명령어입니다.
- 4x3.rf라는 이름의 파일은 rank file입니다. 이 속에 병렬 서버 환경의 toplogy가 들어있습니다. 이걸 어떻게 만드는지는 아래에서 다루겠습니다.
- -n 12라는 것은 MPI client의 총 숫자이며, 쉽게 말해 training에 이용하려는 GPU의 갯수입니다.
- -gpu 0에서, 왜 12개가 아니라 gpu 0이라고 1개로 지정했는지 의아하실 수 있는데, MPI 환경에서는 각각의 GPU가 하나의 learner가 됩니다. 따라서 실제 물리적 서버 1대에 GPU가 몇 장 장착되어있든 상관없이 모두 -gpu 0, 즉 GPU는 1개로 지정한 것입니다.
- "-mode n:4x3x1"에서 n이라는 것은 NCCL (NVIDIA Collective Communications Library, 니클이라고 읽습니다)을 이용하라는 뜻입니다. 4x3x1은 4장의 GPU를 가진 서버 3대가 하나의 rack에 들어있다는 뜻입니다. 사실 어느 rack에 들어있느냐가 중요한 것은 아닌데, 보통 병렬수퍼컴 환경에서는 한대의 rack 안에 장착된 서버끼리는 좀더 고속의 low latency network으로 연결되어있기 때문에 이렇게 rack 표시까지 해주는 것입니다. 만약 4장의 GPU를 가진 서버가 6대씩 장착된 rack이 5대있다면 4x6x5로 표시됩니다.
- dev_sync에서 0은 GPU간 sync를 하지 말라는 것이고, 1은 통신 시작할 때 sync하라는 뜻, 2는 시작할 때와 끝낼 때 각각 sync하라는 뜻입니다.

잠깐, 데이터는 어디에 있는지 어떻게 지정하냐고요 ? 저 위에 지정된 solver 파일, 즉 lenet_solver.prototxt에 neural network이 지정되어 있고, 다시 그 neural network의 prototxt 파일 속에 데이터 위치가 지정되어 있습니다. 아래처럼요.

$ vi lenet_solver.prototxt
#net: "examples/mnist/lenet_train_test.prototxt"
net: "/data/mnist/lenet_train_test.prototxt"

$ vi lenet_train_test.prototxt
...
# source: "examples/mnist/mnist_train_lmdb"
source: "/data/mnist/mnist_train_lmdb"
...
# source: "examples/mnist/mnist_test_lmdb"
source: "/data/mnist/mnist_test_lmdb"

여러 서버 노드들의 GPU마다 수행될 learner들이 어떻게 data를 나누어 가져가느냐고요 ? 가급적이면 서버 노드마다 미리 파티셔닝되어 적절히 분배된 data들을 넣어두는 것이 좋습니다. Data를 N개의 learner들이 읽어갈 때, 각자 순차적으로 파일 이름들이 뒤섞여 들어간 목록으로부터 data를 읽어가는데, 만약 이 data가 물리적으로 미리 파티셔닝하여 노드 별로 분배해놓은 것이 아니라면 그냥 1번 training을 끝낼 때마다 전체 data를 N번 (N epochs) training한 것과 같은 효과를 냅니다. 저 data들의 저장소는 여러 노드에서 동시에 access할 수 있도록 IBM Spectrum Scale (구명칭 GPFS) 같은 병렬 파일시스템으로 하든가, 그게 없다면 성능이 떨어지더라도 NFS 같은 것으로 구성하는 것이 좋습니다.

이제 저 rf 파일, 즉 랭크 파일을 어떻게 만드는지 보시겠습니다. 그냥 손으로, 즉 vi 에디터 같은 것을 이용해서 만드셔도 됩니다만, PowerAI에서 기본 제공되는 rank_gen.py를 이용해서 다음과 같이 만드시는 것이 편합니다.

$ python /opt/DL/ddl/bin/rank_gen.py 4x2x3 sys-89074,sys-89075,sys-89076,sys-89077,sys-89078,sys-89079 > 4x2x3.rf

위에서 콤마(,)로 분리된 이름들이 서버 이름들입니다. 4x2x3이니 4장의 GPU를 가진 서버가 총 6대 있는 것이니, 서버 이름은 반드시 6대를 적으셔야 합니다. 이렇게 만들어진 4x2x3.rf 파일의 내용은 아래와 같습니다. rank_gen.py는 기본적으로 10-core POWER8 chip 2장을 장착한 Minsky 서버를 기준으로 만들기 때문에 아래와 같이 10개의 core를 가진 slot 2개가 있는 것으로 나옵니다. 그래서 rank, 즉 GPU 1개마다 slot이 5개 (0-4) 있는 것으로 나오는데, 만약 그게 아니라 8-core POWER8 chip이 장착된 서버라면 수작업으로 0-4가 아닌 0-3으로 수정해주셔야 합니다.

u0017649@sys-89075:~$ cat 4x2x3.rf
#2017-09-14 04:45:51 by rank_gen
#dims = 4x2x3
#host = sys-89074,sys-89075,sys-89076,sys-89077,sys-89078,sys-89079
#dimX = 4
#dimY = 2
#dimZ = 3
#sockets = 2
#cores = 10

rank 0=sys-89074 slot=0:0-4
rank 6=sys-89074 slot=0:5-9
rank 12=sys-89074 slot=1:0-4
rank 18=sys-89074 slot=1:5-9

rank 3=sys-89075 slot=0:0-4
rank 9=sys-89075 slot=0:5-9
rank 15=sys-89075 slot=1:0-4
rank 21=sys-89075 slot=1:5-9

rank 1=sys-89076 slot=0:0-4
rank 7=sys-89076 slot=0:5-9
rank 13=sys-89076 slot=1:0-4
rank 19=sys-89076 slot=1:5-9

rank 4=sys-89077 slot=0:0-4
rank 10=sys-89077 slot=0:5-9
rank 16=sys-89077 slot=1:0-4
rank 22=sys-89077 slot=1:5-9

rank 2=sys-89078 slot=0:0-4
rank 8=sys-89078 slot=0:5-9
rank 14=sys-89078 slot=1:0-4
rank 20=sys-89078 slot=1:5-9

rank 5=sys-89079 slot=0:0-4
rank 11=sys-89079 slot=0:5-9
rank 17=sys-89079 slot=1:0-4
rank 23=sys-89079 slot=1:5-9

Caffe는 그렇게 쉽게 됩니다만, tensorflow는 그보다 좀 어렵습니다. 일단 별도의 ddl-tensorflow라는 debian package가 PowerAI 4.0에 포함되어 있는데, 이는 사실 tensorflow DDL에 꼭 필요한 것이 아니라, tensorflow DDL을 좀더 쉽게 사용하실 수 있도록 해주는 Google Slim에 기반한 script들과 example 파일들을 제공해주는 것입니다. 정작 tensorflow는 별도로 설치하셔야 하는데, 물론 그건 PowerAI에서 제공되는 tensorflow를 apt-get install 명령으로 설치하시면 됩니다.

$ sudo apt-get install ddl-tensorflow tensorflow

$ dpkg -L ddl-tensorflow
/.
/opt
/opt/DL
/opt/DL/ddl-tensorflow
/opt/DL/ddl-tensorflow/examples
/opt/DL/ddl-tensorflow/examples/mnist
/opt/DL/ddl-tensorflow/examples/mnist/ddl_mnist.py
/opt/DL/ddl-tensorflow/examples/mnist/README.md
/opt/DL/ddl-tensorflow/examples/slim
/opt/DL/ddl-tensorflow/examples/slim/BUILD
/opt/DL/ddl-tensorflow/examples/slim/WORKSPACE
/opt/DL/ddl-tensorflow/examples/slim/scripts
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_resnet_v2_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/train_lenet_on_mnist.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_resnet_v1_50_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_v3_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/train_cifarnet_on_cifar10.sh
/opt/DL/ddl-tensorflow/examples/slim/scripts/finetune_inception_v1_on_flowers.sh
/opt/DL/ddl-tensorflow/examples/slim/train-alexnet.sh
/opt/DL/ddl-tensorflow/examples/slim/deployment
/opt/DL/ddl-tensorflow/examples/slim/deployment/__init__.py
...

이 ddl-tensorflow를 사용하시기 위해서는 PYTHONPATH 등의 환경변수 설정을 위해 source 명령으로 아래와 같이 ddl-tensorflow-activate를 수행해주셔야 합니다.

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

이제 ddl-tensorflow-install-samples 명령을 사용하시어 지정하는 directory에 sample들을 설치하실 수 있습니다.

nimbix@JARVICENAE-0A0A1835:~$ ddl-tensorflow-install-samples /data
Write into existing directory /data? (yN)
y
Copying examples/ into /data...
Success

가장 간단한 것으로, 손글씨 숫자를 판독하는 MNIST가 들어 있습니다.

nimbix@JARVICENAE-0A0A1835:~$ cd /data/examples/mnist

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ ls
ddl_mnist.py README.md

여기에 나온 것처럼, tensorflow는 명령어라기보다는 python에서 불러 사용하는 library로 되어 있기 때문에, 결국 multi-node 병렬처리를 하기 위해서는 python script를 위의 ddl_mnist.py에서처럼 작성해야 합니다.

일단 4-GPU 서버 2대(sys-89074와 sys-89075)로 수행하는 환경이라고 가정하고 아래와 같이 rank file을 먼저 만듭니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ python /opt/DL/ddl/bin/rank_gen.py 4x2x1 sys-89074,sys-89075 > 4x2.rf

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ cat 4x2.rf
#2017-09-15 03:19:14 by rank_gen
#dims = 4x2x1
#host = sys-89074,sys-89075
#dimX = 4
#dimY = 2
#dimZ = 1
#sockets = 2
#cores = 10

rank 0=sys-89074 slot=0:0-4
rank 2=sys-89074 slot=0:5-9
rank 4=sys-89074 slot=1:0-4
rank 6=sys-89074 slot=1:5-9

rank 1=sys-89075 slot=0:0-4
rank 3=sys-89075 slot=0:5-9
rank 5=sys-89075 slot=1:0-4
rank 7=sys-89075 slot=1:5-9

이제 다음과 같이 수행하면 됩니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -n 8 -rf 4x2.rf python ddl_mnist.py

(사실 mnist는 워낙 작은 dataset만 사용하므로, 병렬화의 의미가 없습니다. 그래서인지 ddl_mnist.py는 위에서 제가 예로 든 것처럼 4x2 구조는 애초에 불가능하고, 저 아래에 보시듯이 -mode r:2로 되어 있어 그냥 GPU 2장으로 병렬화하는 것만 가능합니다.)

결국 문제는 tensorflow를 병렬로 수행하기 위해서 python script를 어떻게 작성해야 하느냐인데, 이 부분에 대해서는 저도 개발자가 아닌 관계로 별 도움을 못 드리겠습니다. (사실 제겐 흰건 글씨요 검은건 공백이며, 깜빡이는 것은 커서 정도로만 보입니다.)

대신, 다소 깁니다만, 아래에 PowerAI에 포함된 ddl_mnist.py의 내용을 그대로 올려두겠습니다.

nimbix@JARVICENAE-0A0A1835:/data/examples/mnist$ vi ddl_mnist.py

import tensorflow as tf
import numpy as np

############################################################################
# IBM PowerAI Distributed Deep Learning (DDL) setup
############################################################################

# Disable GPU memory preallocation
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

############################################################################
# DDL Initialize BEGIN
############################################################################
# Load DDL operator
ddl = tf.load_op_library('/opt/DL/ddl-tensorflow/lib/ddl_MDR.so')

# DDL initializes MPI on CPU
# ddl.init takes two inputs
# 1) the number of GPUs to utilize on each host in training.
# this number is not the number of GPUs to use for each leaner. It simply tells DDL that there are X GPUs in each host to be used for training
# 2) DDL options (refer to README for details)
with tf.Session(config=config) as sess:
with tf.device('/cpu:0'):
rank, size, gpuid = sess.run(ddl.init(2, mode = '-mode r:2 -dump_iter 100'))

# MPI info and assigned GPU
print [rank, size, gpuid]
############################################################################
# DDL Initialize END
############################################################################

# Perform all TensorFlow computation within gpuid
with tf.device('/gpu:%d' %gpuid):
##############################################################################
# Import MNIST data

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

# Parameters
learning_rate = 0.001
training_iters = 200000
batch_size = 100
display_step = 1

# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# tf Graph input
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32) #dropout (keep probability)

# Create some wrappers for simplicity
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)

def maxpool2d(x, k=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],
padding='SAME')

# Create model
def conv_net(x, weights, biases, dropout):
# Reshape input picture
x = tf.reshape(x, shape=[-1, 28, 28, 1])

# Convolution Layer
conv1 = conv2d(x, weights['wc1'], biases['bc1'])
# Max Pooling (down-sampling)
conv1 = maxpool2d(conv1, k=2)

# Convolution Layer
conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
# Max Pooling (down-sampling)
conv2 = maxpool2d(conv2, k=2)

# Fully connected layer
# Reshape conv2 output to fit fully connected layer input
fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
fc1 = tf.nn.relu(fc1)
# Apply Dropout
fc1 = tf.nn.dropout(fc1, dropout)

# Output, class prediction
out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
return out

# Store layers weight & bias
weights = {
############################################################################
# DDL BROADCAST BEGIN
############################################################################
# This step ensures that all learners start with the same initial parameters

# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 1, 32]))),
# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 32, 64]))),
# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(ddl.bcast(tf.random_normal([7*7*64, 1024]))),
# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(ddl.bcast(tf.random_normal([1024, n_classes])))
############################################################################
# DDL BROADCAST END
############################################################################
}

biases = {
'bc1': tf.Variable(ddl.bcast(tf.random_normal([32]))),
'bc2': tf.Variable(ddl.bcast(tf.random_normal([64]))),
'bd1': tf.Variable(ddl.bcast(tf.random_normal([1024]))),
'out': tf.Variable(ddl.bcast(tf.random_normal([n_classes])))
}

# Construct model
pred = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

############################################################################
# DDL ALLREDUCE BEGIN
############################################################################

# Collect the gradients and the corresponding parameters w.r.t the given cost
grads_and_vars = optimizer.compute_gradients(cost)

# Separate out the tuple
grads, vars = zip(*grads_and_vars)

# This step takes the average of the gradients on all the learners
grads_and_vars_ddl = zip(ddl.all_reduce_n(grads, op='avg'), vars)

# Update the parameters with the averaged gradient
objective = optimizer.apply_gradients(grads_and_vars_ddl)

############################################################################
# DDL ALLREDUCE END
############################################################################

# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
##############################################################################

def split(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in xrange(n))

# Launch the graph
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
step = 1
# Keep training until reach max iterations
while step * batch_size < training_iters:

# Each learner will read batch_size*size samples and
# use only the portion correspoding to the current learner (or rank)

batch_x, batch_y = mnist.train.next_batch(batch_size*size)

batch_x = np.split(batch_x,size)[rank]
batch_y = np.split(batch_y,size)[rank]

# Run optimization op (backprop)
sess.run(objective, feed_dict={x: batch_x, y: batch_y,
keep_prob: dropout})
if step % display_step == 0:
# Calculate batch loss and accuracy
loss, acc = sess.run([cost, accuracy], feed_dict={x: batch_x,
y: batch_y,
keep_prob: 1.})
print("MPI "+str(rank)+"] Iter " + str(step*batch_size) + ", Minibatch Loss= " + \
"{:.6f}".format(loss) + ", Training Accuracy= " + \
"{:.5f}".format(acc))
step += 1

print("MPI "+str(rank)+"] Optimization Finished!")

# Calculate accuracy for 256 mnist test images
print("MPI "+str(rank)+"] Testing Accuracy:", \
sess.run(accuracy, feed_dict={x: mnist.test.images[:256],
y: mnist.test.labels[:256],
keep_prob: 1.}))

2017년 9월 13일 수요일

IBM PowerAI 4.0에 포함된 Caffe Distributed Deep Learning (DDL)

Tensorflow는 distributed tensorflow를 예전부터 지원하여, 여러대의 서버에 장착된 여러장의 GPU를 이용한 분산 training이 가능했습니다. IBM PowerAI toolkit에도 ddl-tensorflow가 포함되어 있습니다.

Caffe는 tensorflow와는 달리 분산 모델이 정식으로는 지원되지 않아 한대의 서버에서만 training이 가능했습니다. 물론 각 기업이나 연구소별로 open source로 공개되지 않은 자체적인 버전의 distributed caffe를 자체 개발하여 사용하고 있긴 했습니다.

최근 새로 나온 IBM PowerAI 4.0에 포함된 IBM 버전의 caffe에서는 Distributed Deep Learning (DDL) 옵션을 지원합니다. 이는 OpenMPI 기술에 기반하여 caffe가 하나의 큰 모델을 여러대의 서버에서 분산 처리할 수 있도록 만든 것입니다.

구체적으로는 caffe 명령어에 -ddl 옵션이 추가된 형태로 제공됩니다. 구체적인 내용은 아래 link에 설명되어 있습니다.

https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/README.html

문제는 여기에 설명되는 option parameter들에 대해, 충분한 설명이 없다는 것입니다. 가령 -ddl "-mode b:4x3"이라고 쓸 때, b는 뭐고 4는 무엇이며 3은 무엇인지 위 link만 보고는 알기가 어렵습니다.

-ddl "-mode b:4x3"를 설명하자면 b는 enhanced NCCL library를 쓰되, 4장의 GPU를 장착한 서버 3대를 쓰라는 것입니다.

또 가령 -ddl "-mode r:2x8"이라는 것은 RING 구성만 써서 2장의 GPU를 장착한 서버 8대를 쓰라는 것이고요.

이에 대해서 설명이 없는 이유를 IBM 본사에 물어보니, "인터넷에는 없지만 PowerAI 4.0을 설치하면 민스키 서버 안에 생성되는 /opt/DL/ddl/doc/README.md 파일 속에 설명이 다 들어있다" 라고 합니다.

해서, 많은 분들이 쉽게 보실 수 있도록 제가 여기에 그 파일 내용을 올려둡니다.

# Overview

IBM PowerAI Distributed Deep Learning (or DDL) is a MPI-based
communication library, which is specifically optimized for Deep Learning
training. An application integrated with DDL becomes a MPI-application,
which will allow the use of the `mpirun` command to invoke the job in
parallel across a cluster of systems. DDL understands multi-tier network
environment and uses different libraries (e.g. NCCL) and algorithms to
get the best performance in multi-node, multi-GPU environments.

IBM PowerAI Distributed Deep Learning considers each GPU in a cluster as
an individual "learner". The overall set of learners is described to
IBM PowerAI Distributed Deep Learning in terms of 3 dimensions (X-Y-Z)
that correspond to a multi-tier network hierarchy. The recommended
mapping is:

- X for within-host (e.g. number of GPUs per host for multi-GPU hosts)
- Y for between nearby-hosts (e.g. number of hosts in a single rack)
- Z for between distant-hosts (e.g. number of racks)

For example, 256 learners can be configured as 4x8x8 or 4x16x4 and so on.

**Example: 2 racks of 8 S822LC for HPC systems with 4 GPUs each**

In this setup, there are 64 learners (4 GPUs in each of 8 hosts in each
of 2 racks) and a simple description would be 4x8x2.

If this configuration includes a truly hierarchical network setup--for
example a high-speed, low-latency network within each rack, and a
slower, higher-latency network between the racks--then 4x8x2 might be
the optimal description.

But if the network configuration is not actually hierarchical--if all
the hosts are connected to a "flat" network regardless of the physical
racking--then a 4x4x4 description may perform better than 4x8x2. Some
experimentation may be needed to find the optimal description.

# Required Libraries

Pre-requisite packages required for IBM PowerAI Distributed Deep
Learning are provided with PowerAI:

1. OpenMPI with CUDA Support
2. NVIDIA NCCL

# Integration with Caffe and TensorFlow

IBM PowerAI Distributed Deep Learning has been integrated with the
PowerAI IBM Caffe and TensorFlow packages. `mpirun` must be used to
launch training using the IBM PowerAI Distributed Deep Learning
integration. General information about `mpirun` is available on the
OpenMPI website
[https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php](https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php).

1. Caffe

IBM PowerAI Distributed Deep Learning is directly integrated into
Caffe, and can be exercised by adding the following to the command line.

-ddl ?쏡DL_OPTIONS HERE??
2. TensorFlow

DDL is indirectly integrated into TensorFlow in the form of custom
operator. The custom operator is provided as a shared library, which is
loaded and invoked in the python training script.

The PowerAI ddl-tensorflow package provides an example training
setup based on the TensorFlow-Slim model library from the TensorFlow
models repository. Those can be found on your system in:

/opt/DL/ddl-tensorflow/examples/

More details on IBM PowerAI Distributed Deep Learning integration
into TensorFlow, can be found in

/opt/DL/ddl-tensorflow/doc/README.md

# Using IBM PowerAI Distributed Deep Learning

IBM PowerAI Distributed Deep Learning takes advantage of the network
topology to perform communication quickly. Network topology is described
to IBM PowerAI Distributed Deep Learning in two ways, through an MPI
rank file and via DDL options.

## MPI rank file

A rank file is a standard file that maps MPI clients (IBM PowerAI
Distributed Deep Learning learners) to specific hosts, sockets, and
cores. To get the best performance from IBM PowerAI Distributed Deep
Learning , it is crucial to generate an optimally mapped rank file. To
help with this effort, a script (`rank_gen.py`) is provided to
automatically generate rank files that are appropriate for most S822LC
systems. The script takes two inputs: the network decomposition and a
list of comma-separated hostnames.

**How to use rank file generator script**

$ python rank_gen.py XxYxZ host_list > rank_file

Here, `XxYxZ` specifies the topology of the GPU and multi-tier network
hierarchy.

For example, for 64 learners (e.g. 16 hosts each with 4 GPUs), any of
4x16x1, 4x8x2, or 4x4x4 might be reasonable choices, depending on the
network topology. All 3 dimensions must be specificed (use 1 to fill any
spaces).

`host_list` is a comma separated list of host names (e.g.
host1,host2,host3,...). It must contain `Y` times `Z` hostnames,
ordered "by Z". For example, a 4x2x2 configuration with 2 racks of 2
hosts each might have a host list of: `r1h1,r1h2,r2h1,r2h2`.The
hostnames provided in the rankfile should match the system hostnames.

It is possible in a distributed environment to have more than one
interface for each host. In such a scenario, OpenMPI by default, uses
any and all interfaces that are "up" to communicate with a host. To
avoid problems in such cases you can tell OpenMPI to use given
interface. E.g.:

$ mpirun --mca btl_tcp_if_include ib0 ...

$ mpirun --mca btl_tcp_if_exclude lo,enp1s0f2 ...

More details available on OpenMPI FAQ page:
[https://www.open-mpi.org/faq/?category=tcp#tcp-selection]([https://www.open-mpi.org/faq/?category=tcp#tcp-selection)

**Parameters for optimal rankfile**

An optimal rank file will depend on the number of sockets or nodes in
the system and the number of cores per socket/node. The `numactl` and
`ppc64_cpu` commands can help determine this information.

1. Number of sockets and thread slots for each socket.

`numactl -H` shows the number of sockets ("nodes") in a system,
and also lists the CPUs (thread slots) for each. For example:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 261788 MB
node 0 free: 6042 MB
node 1 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 1 size: 261334 MB
node 1 free: 158805 MB
node distances:
node 0 1
0: 10 40
1: 40 10

Here the system has two sockets with 80 thread slots each.

2. Mapping between physical cores and CPUs/thread slots.

$ ppc64_cpu --info
Core 0: 0* 1* 2* 3* 4* 5* 6* 7*
Core 1: 8* 9* 10* 11* 12* 13* 14* 15*
Core 2: 16* 17* 18* 19* 20* 21* 22* 23*
Core 3: 24* 25* 26* 27* 28* 29* 30* 31*
Core 4: 32* 33* 34* 35* 36* 37* 38* 39*
Core 5: 40* 41* 42* 43* 44* 45* 46* 47*
Core 6: 48* 49* 50* 51* 52* 53* 54* 55*
Core 7: 56* 57* 58* 59* 60* 61* 62* 63*
Core 8: 64* 65* 66* 67* 68* 69* 70* 71*
Core 9: 72* 73* 74* 75* 76* 77* 78* 79*
Core 10: 80* 81* 82* 83* 84* 85* 86* 87*
Core 11: 88* 89* 90* 91* 92* 93* 94* 95*
Core 12: 96* 97* 98* 99* 100* 101* 102* 103*
Core 13: 104* 105* 106* 107* 108* 109* 110* 111*
Core 14: 112* 113* 114* 115* 116* 117* 118* 119*
Core 15: 120* 121* 122* 123* 124* 125* 126* 127*
Core 16: 128* 129* 130* 131* 132* 133* 134* 135*
Core 17: 136* 137* 138* 139* 140* 141* 142* 143*
Core 18: 144* 145* 146* 147* 148* 149* 150* 151*
Core 19: 152* 153* 154* 155* 156* 157* 158* 159*

Here the system has 20 physical cores with 8 thread slots/CPUs each. The
thread slot numbers match with the numbers in the `numactl` output. The
asterisks indicate which thread slots are enabled.

The rankfile only cares about cores (not CPUs/thread slots), and the
core numbering is relative to the to the node/socket (which is named
"slot" in the rankfile). So in rank file terms, this system has socket 0
cores 0-9 and socket 1 cores 0-9.

**Note:** If the number of cores specified in the rankfile exceeds the
actual number of cores, `mpirun` will fail with a non-obvious message.
For example, on a machine with 8-cores per socket:

$ cat 2x10core.rf
rank 0=host1 slot=0:0-9
rank 1=host1 slot=1:0-9

$ mpirun -n 2 -rf 2x10core.rf /bin/true
[host1:46256] [[20503,0],0] ORTE_ERROR_LOG: Not found in file rmaps_rank_file.c at line 320
[host1:46256] [[20503,0],0] ORTE_ERROR_LOG: Not found in file base/rmaps_base_map_job.c at line 351
$

Versus the working:

$ cat 2x8core.rf
rank 0=host1 slot=0:0-7
rank 1=host1 slot=1:0-7

$ mpirun -n 2 -rf 2x8core.rf /bin/true
$

The `-report-bindings` flag may be useful for diagnosing problems:

$ mpirun -report-bindings ......

## DDL options

There are a number of runtime options for the DDL engine. The options are:

`-mode`: This optionally indicates the algorithm and topology. The topology
should match to the rank assignment (e.g. via rankfile) to get the best
performance. If a mode is not provided, it will work as a single ring
configuration (e.g., r:N). Therefore, the total number of MPI clients
(specified as -n N to mpirun) must match with the number of learners in the
topology (specified as -mode in DDL): otherwise, it will show an error like
`invalid dim size=A usr_size=B dim[0]=...`

b:4x2 => use enhanced NCCL whenever possible (otherwise use ring) for 4x2 configuration

n:4x2 => use NCCL whenever possible (otherwise use ring) for 4x2 configuration

r:4x4 => use only RING for 4x4 configuration

m:4x6 => use only MPI reduce_scatter and all_gatherV for 4x6 configuration (currently disabled)

c:4x8 => use only RCS for 4x8 configuration

p:4x16x4 => first activate ex"p"location mode to get the best algorithms for each dimension of 4x16x4

`-dump_iter <N>`: This optionally makes DDL dump network performance on
every N iterations

`-dev_sync <0, 1, or 2>` : This optionally calls cudaDeviceSynchronize
to minimize jitter, default is 0 (which means off). With 1, it
invokes sync once in the beginning of communication. With 2, it invokes
sync twice in the beginning AND end of communication

`-rebind_iter <N>`: This optionally monitors variation every N
iterations, and performs rebind if a leaner has been slow for the last 3
checks. Too small number will incur variation-check overhead, but too
big number will make training suffer from variation for long time

`-dbg_level <0,1,2>`: 0 for no, 1 for mild, and 2 for detailed debug
messages

When `dump_iter` is given, you can see the following periodically where
you can find which learner has the maximum jitter and end to end DDL
elapsed time. Also, per dimension, it shows runtime breakdown along with
the selected algorithm for that particular dimension.

![Alt text](ddl_dump.png?raw=true "DDL dump")

**Example of 2 racks of 8 S822LC HPC systems with 4 GPUs on each host**

Generate an appropriate rank file:

$ python rank_gen.py 4x8x2 host0,host1,host2,??,host15 > 4x8x2.rf

To start IBM Caffe with `mpirun`, specifying rank file and DDL options:

$ source /opt/DL/caffe/bin/caffe-activate

$ mpirun -x PATH -x LD_LIBRARY_PATH -n 16 -rf 4x8x2.rf caffe train -solver solver.prototxt -gpu 0 -bvlc -ddl "-mode b:4x8x2 -dump_iter 100"

To start TensorFlow with `mpirun` using custom operator for DDL:

- Update `train_image_classifier.py` to specify DDL options during
initialization:

ddl.Init(4, mode =??mode b:4x8x2 -dump_iter 100??

- Execute with `mpirun`:

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

$ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -n 16 -rf 4x8.2.rf python train_image_classifier.py ...

Inference 시스템을 위한 GPU 용량 sizing, 그리고 IBM caffe의 Large Model Support (LMS) 옵션

오늘은 inference, 그 중에서도 inference를 위한 GPU 시스템의 sizing을 어떻게 해야 하는지에 대해서 보겠습니다. 여기서는 특정적으로, caffe를 이용하여 image data를 inference할 때 어떻게 하는지를 보겠습니다. 그리고 덧붙여, IBM Minsky 서버에서만 가능한 옵션, -lms (Large Model Support)가 어떤 혜택을 주는지도 보시겠습니다.

이에 대해서는 아래 site에 기본적인 방법이 소개됩니다. IBM China의 Deep Learning 개발팀의 박사님들에게 물어보니, 이 방법이 맞다고 합니다.

https://stackoverflow.com/questions/36867591/how-to-estimate-inference-time-from-average-forward-pass-time-in-caffe

여기서 핵심적인 부분은 바로 아래 부분입니다.

For instance, if I run the default command that comes with Caffe:

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0
I get the following output

...
I0426 13:07:32.701490 30417 layer_factory.hpp:77] Creating layer data
I0426 13:07:32.701513 30417 net.cpp:91] Creating Layer data
I0426 13:07:32.701529 30417 net.cpp:399] data -> data
I0426 13:07:32.709048 30417 net.cpp:141] Setting up data
I0426 13:07:32.709079 30417 net.cpp:148] Top shape: 10 3 227 227 (1545870)
I0426 13:07:32.709084 30417 net.cpp:156] Memory required for data: 6183480
...
I0426 13:07:34.390281 30417 caffe.cpp:377] Average Forward pass: 16.7818 ms.
I0426 13:07:34.390290 30417 caffe.cpp:379] Average Backward pass: 12.923 ms.
I0426 13:07:34.390296 30417 caffe.cpp:381] Average Forward-Backward: 29.7969 ms.
The following line:

I0426 13:07:32.709079 30417 net.cpp:148] Top shape: 10 3 227 227 (1545870)
is super important. It says that your input layer is 10x3x227x227-dimensional. In this case, the batch size is 10 images, each of size 3x227x227 (the 3 refers to each of the rgb channels in an image).

So effectively, it took 1.67818 ms/image to do a forward pass or inference time per image.

즉, caffe 명령어의 sub-comand 중 time 명령, 즉 caffe를 이용한 성능 benchmark 결과에서 평균 forward pass에 걸린 시간이 해당 model과 해당 이미지에 대해서 걸릴 inference time이라는 것입니다. 당연한 이야기지만 해당 model에 지정하는 data layer의 Top shape 10 3 227 227, 즉 batch size 10 x channel (RGB) 3 x height 227 x width 227이 클 수록 더 많은 시간이 걸립니다.

HPC cloud 서비스 업체인 Nimbix (nimbix.net/powerai)에서 제공하는 Minsky 서버의 P100 1장짜리 가상머신을 사용할 기회가 있어, 거기에서 이 test를 해봤습니다. 참고로 Nimbix는 docker 기반의 NVLink P100 GPU 가상 머신을 제공하는데, 이에 대해서도 나중에 다룰 기회가 있을 것입니다.

먼저, 1200x1200 크기의 이미지 1장에 대해서 GoogleNet으로 inference하는데 NVLink P100으로는 시간이 얼마나 걸리는지 보시겠습니다. 이를 위해서 먼저 GoogleNet에 포함된 deploy.prototxt를 아래와 같이 편집합니다. 원본 line은 아래에 #으로 comment-out 처리했습니다.

nimbix@JARVICENAE-0A0A1844:/data$ vi bvlc_googlenet/deploy.prototxt
name: "GoogleNet"
layer {
name: "data"
type: "Input"
top: "data"
input_param { shape: { dim: 1 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 10 dim: 3 dim: 224 dim: 224 } }
}

이제 이렇게 수정된 model로 caffe time을 수행합니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

그 과정을 다 보실 필요는 없고, 사실 맨 끝의 benchmark 결과에서 Average Forward pass 시간만 보시면 됩니다.

I0908 05:39:36.830621 567 caffe.cpp:513] prob forward: 0.020864 ms.
I0908 05:39:36.830627 567 caffe.cpp:516] prob backward: 0.00368 ms.
I0908 05:39:36.830641 567 caffe.cpp:521] Average Forward pass: 45.3671 ms.
I0908 05:39:36.830649 567 caffe.cpp:523] Average Backward pass: 102.551 ms.
I0908 05:39:36.830657 567 caffe.cpp:525] Average Forward-Backward: 150.178 ms.
I0908 05:39:36.830673 567 caffe.cpp:527] Total Time: 150.178 ms.
I0908 05:39:36.830689 567 caffe.cpp:528] *** Benchmark ends ***

여기서 만약 우리가 batch size(맨 앞의 dim)를 10으로 했다면 저 Average Forward pass 시간을 10으로 나눠야 합니다. 그러나 우리는 dim을 1로 주었으므로 그럴 필요없이 저것을 그대로 쓰면 됩니다. 즉, RGB 3 채널의 1200x1200 이미지 1장을 P100 GPU를 이용하여 GoogleNet으로 inference하는데 0.045초가 걸린다고 보시면 됩니다.

위의 테스트에서 display되는 benchmark 과정을 보면 Deep Learning의 얼개를 대충 보실 수 있습니다. 아래처럼 먼저 Top shape를 1 x 3 x 1200 x 1200으로 시작했다가, 다음 단계에서는 1 x 64 x 600 x 600으로, 그 다음에는 다시 300 x 300으로 계속 절반으로 줄여나가다가 결국 31 x 31에서 마무리 됩니다. 마지막 단계에서의 channel 수는 무려 1024로 늘어나게 되는데, 그 의미를 (저 같은 무식한 HW 엔지니어는) 잘 모르겠군요. 사실 HW 엔지니어에게 중요한 것은 거기에 필요로 하는 메모리 사이즈입니다. 각 단계별 top shape마다 필요로 하는 메모리 사이즈가 'Memory required for data'라는 항목으로 display되는데, 처음 단계에서는 17MB 정도로 시작했다가 맨 마지막 단계에서는 거의 1.6GB 가까이 갑니다.

...
I0908 05:39:25.035709 567 net.cpp:135] Top shape: 1 3 1200 1200 (4320000)
I0908 05:39:25.035733 567 net.cpp:143] Memory required for data: 17280000
I0908 05:39:25.035754 567 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:39:25.035786 567 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:39:25.035799 567 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:39:25.035816 567 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:39:29.695616 567 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:39:29.695672 567 net.cpp:135] Top shape: 1 64 600 600 (23040000)
I0908 05:39:29.695695 567 net.cpp:143] Memory required for data: 109440000
...
I0908 05:39:29.862272 567 net.cpp:128] Setting up pool5/drop_7x7_s1
I0908 05:39:29.862279 567 net.cpp:135] Top shape: 1 1024 31 31 (984064)
I0908 05:39:29.862287 567 net.cpp:143] Memory required for data: 1587930496
I0908 05:39:29.862294 567 layer_factory.hpp:77] Creating layer loss3/classifier
I0908 05:39:29.862305 567 net.cpp:90] Creating Layer loss3/classifier
I0908 05:39:29.862311 567 net.cpp:635] loss3/classifier <- pool5/7x7_s1
I0908 05:39:29.862320 567 net.cpp:609] loss3/classifier -> loss3/classifier
I0908 05:39:36.385628 567 net.cpp:128] Setting up loss3/classifier
I0908 05:39:36.385684 567 net.cpp:135] Top shape: 1 1000 (1000)
I0908 05:39:36.385696 567 net.cpp:143] Memory required for data: 1587934496
I0908 05:39:36.385712 567 layer_factory.hpp:77] Creating layer prob
I0908 05:39:36.385728 567 net.cpp:90] Creating Layer prob
I0908 05:39:36.385737 567 net.cpp:635] prob <- loss3/classifier
I0908 05:39:36.385749 567 net.cpp:609] prob -> prob
I0908 05:39:36.386745 567 net.cpp:128] Setting up prob
I0908 05:39:36.386756 567 net.cpp:135] Top shape: 1 1000 (1000)
I0908 05:39:36.386765 567 net.cpp:143] Memory required for data: 1587938496
I0908 05:39:36.386771 567 net.cpp:206] prob does not need backward computation.
...

잠깐만요, 1.6GB라고요 ? P100의 GPU 메모리 크기가 16GB 밖에 안되는데, 저런 image를 10장을 한꺼번에 inference하면 어떻게 된다는 것일까요 ? 설마 error가 날까요 ? 한번 해보겠습니다. 위와 동일한 모델을 사용하되, 단지 맨 앞의 dim, 즉 batch size를 1에서 10으로 바꾸겠습니다.

nimbix@JARVICENAE-0A0A1844:/data$ vi bvlc_googlenet/deploy.prototxt
name: "GoogleNet"
layer {
name: "data"
type: "Input"
top: "data"
input_param { shape: { dim: 10 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 1 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 10 dim: 3 dim: 224 dim: 224 } }
}

이제 동일하게 caffe time을 수행합니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 05:43:44.249899 646 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 05:43:44.249914 646 net.cpp:143] Memory required for data: 172800000
I0908 05:43:44.249928 646 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:43:44.249949 646 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:43:44.249956 646 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:43:44.249967 646 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:43:44.614331 646 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:43:44.614367 646 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 05:43:44.614382 646 net.cpp:143] Memory required for data: 1094400000
...
I0908 05:43:44.763245 646 net.cpp:135] Top shape: 10 1024 31 31 (9840640)
I0908 05:43:44.763254 646 net.cpp:143] Memory required for data: 15839942400
I0908 05:43:44.763260 646 layer_factory.hpp:77] Creating layer pool5/drop_7x7_s1
I0908 05:43:44.763272 646 net.cpp:90] Creating Layer pool5/drop_7x7_s1
I0908 05:43:44.763278 646 net.cpp:635] pool5/drop_7x7_s1 <- pool5/7x7_s1
I0908 05:43:44.763285 646 net.cpp:596] pool5/drop_7x7_s1 -> pool5/7x7_s1 (in-place)
I0908 05:43:44.763319 646 net.cpp:128] Setting up pool5/drop_7x7_s1
I0908 05:43:44.763325 646 net.cpp:135] Top shape: 10 1024 31 31 (9840640)
I0908 05:43:44.763334 646 net.cpp:143] Memory required for data: 15879304960
I0908 05:43:44.763340 646 layer_factory.hpp:77] Creating layer loss3/classifier
I0908 05:43:44.763352 646 net.cpp:90] Creating Layer loss3/classifier
I0908 05:43:44.763358 646 net.cpp:635] loss3/classifier <- pool5/7x7_s1
I0908 05:43:44.763367 646 net.cpp:609] loss3/classifier -> loss3/classifier
I0908 05:43:51.338423 646 net.cpp:128] Setting up loss3/classifier
I0908 05:43:51.345638 646 net.cpp:135] Top shape: 10 1000 (10000)
I0908 05:43:51.345651 646 net.cpp:143] Memory required for data: 15879344960
I0908 05:43:51.345667 646 layer_factory.hpp:77] Creating layer prob
I0908 05:43:51.345683 646 net.cpp:90] Creating Layer prob
I0908 05:43:51.345693 646 net.cpp:635] prob <- loss3/classifier
I0908 05:43:51.345705 646 net.cpp:609] prob -> prob
I0908 05:43:51.346666 646 net.cpp:128] Setting up prob
I0908 05:43:51.346678 646 net.cpp:135] Top shape: 10 1000 (10000)
I0908 05:43:51.346685 646 net.cpp:143] Memory required for data: 15879384960
...
I0908 05:43:51.724148 646 caffe.cpp:465] Initial loss: 0
I0908 05:43:51.724202 646 caffe.cpp:466] Performing Backward
I0908 05:43:51.724215 646 caffe.cpp:474] *** Benchmark begins ***
I0908 05:43:51.724222 646 caffe.cpp:475] Testing for 1 iterations.
F0908 05:43:51.915272 646 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x100000f5ce0c google::LogMessage::Fail()
@ 0x100000f5f284 google::LogMessage::SendToLog()
@ 0x100000f5c768 google::LogMessage::Flush()
@ 0x100000f611c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x10000026e3a0 caffe::SyncedMemory::mutable_gpu_data()
@ 0x1000002736c4 caffe::Blob<>::mutable_gpu_diff()
@ 0x1000004e774c caffe::InnerProductLayer<>::Backward_gpu()
@ 0x10018ca8 (unknown)
@ 0x10012974 (unknown)
@ 0x100001c2309c (unknown)
@ 0x100001c23298 __libc_start_main
@ (nil) (unknown)

아 ! 정말 error가 납니다. 정말 data에만 무려 15.8GB의 메모리가 필요하다고 나오더니, 실제 벤치마크에 들어가자마자 out of memory 에러가 나면서 중단됩니다. 정말 GPU의 발목을 잡는 것은 GPU 메모리 크기의 한계라는 것을 절실히 깨닫는 순간입니다.

하지만 IBM과 NVIDIA는 여기서 포기하지 않습니다. 원래 NVIDIA의 CUDA에서는 Unified Memory라고 해서, GPU가 CPU 메모리를 마치 GPU 메모리처럼 쓸 수 있는 기능을 내놓았지요. 그러나 실제로는 그렇게 GPU가 CPU memory에 접근하는 통로가 느려터진 PCIe이다보니, Unified Memory를 쓰면 편리하기는 해도 성능은 거의 1/10 수준으로 떨어져 버리는 것이 상식이었습니다. 이는 NVLink P100을 장착한 DGX-1 서버에서도 마찬가지였습니다. DGX-1도 GPU끼리만 NVLink로 연결될 뿐, 정작 CPU와 GPU를 연결하는 것은 PCIe거든요. 그래서 결국 아무도 caffe에서 unified memory를 쓸 생각을 하지 않았습니다.

그러나 IBM Minsky는 다릅니다. POWER8 processor에는 NVLink port가 박혀있으므로, CPU와 GPU가 NVLink로 직접 연결되며, 그것도 NVLink 2개를 뭉쳐서 무려 80GB/sec로 연결됩니다. PCIe의 2.5배나 되는 대역폭입니다. 이를 활용하여 caffe에서 CPU-GPU 간에 data를 직접 주고받을 수 있습니다 ! 실제로 IBM은 최근 발표한 PowerAI 4.0에 포함된 IBM caffe (caffe-ibm)에 이를 적용했습니다. 그 결과, IBM caffe에서는 일반 bvlc caffe나 NV caffe에는 없는 새로운 옵션, -lms (LMS, Large Model Support)를 사용할 수 있습니다.

이에 대해서는 아래 문서를 참조하시면 됩니다.

https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/README.html

역시 귀찮으신 분들을 위해 간략히 요약해드리면 이렇습니다.

-lms 8000000 : 이는 8000000 (kbyte 단위, 즉 8GB) 이상의 메모리 덩어리는 그냥 CPU 메모리 상에 두라는 뜻입니다.

즉, -lms 뒤에 큰 수를 적을 수록 가급적 GPU 메모리를 많이 쓰고 CPU 메모리는 정말 필요한 경우에만 쓰라는 이야기입니다. 당연히 최대치는 16000000 정도가 될 것이고, 이보다 더 큰 수를 적는 것은 사실상 LMS 옵션을 disable하는 효과를 냅니다. 반면에 -lms를 매우 작게, 가령 100으로 주는 것은 사실상 GPU 메모리를 쓰지 말고 다 CPU 메모리를 쓰라는 이야기가 됩니다.

또 -lms_frac <0~1.0> 이라는 옵션을 줄 수도 있습니다. 가령 -lms_frac 0.4로 주면, GPU 메모리 사용률이 40%가 되기 전에는 LMS 기능을 쓰지 말라는 것이 됩니다. 작은 크기의 model을 수행할 때는 굳이 느린 CPU 메모리를 쓸 필요가 없으므로, -lms_frac 0.9 정도로 주는 것이 좋습니다.

이제 위에서 out of memory를 낸 model에 대해 실제로 -lms 옵션을 적용해 보시지요. 먼저 -lms 8192, 즉 8MB 이상의 메모리 덩어리는 모두 CPU 메모리에 두라고 지시했습니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -lms 8192 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 05:47:44.949090 676 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 05:47:44.949105 676 net.cpp:143] Memory required for data: 172800000
I0908 05:47:44.949124 676 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:47:44.949146 676 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:47:44.949153 676 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:47:44.949167 676 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:47:45.580006 676 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:47:45.580046 676 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 05:47:45.580060 676 net.cpp:143] Memory required for data: 1094400000
...
I0908 05:47:57.704324 676 caffe.cpp:465] Initial loss: 0
I0908 05:47:57.704356 676 caffe.cpp:466] Performing Backward
I0908 05:47:57.704371 676 caffe.cpp:474] *** Benchmark begins ***
I0908 05:47:57.704377 676 caffe.cpp:475] Testing for 1 iterations.
I0908 05:47:57.711424 676 syncedmem.cpp:355] [LMS] memory[0x110024232400] device_=0 size_ = 921600000 allocation=7349057792 fragmented size = 655558000 gpu_ptr_=1155371368464
I0908 05:47:57.769644 676 syncedmem.cpp:355] [LMS] memory[0x110024258aa0] device_=0 size_ = 230400000 allocation=7579458048 fragmented size = 425158224 gpu_ptr_=1122381070352
I0908 05:47:57.778683 676 syncedmem.cpp:355] [LMS] memory[0x110024286d30] device_=0 size_ = 230400000 allocation=7809858304 fragmented size = 425158464 gpu_ptr_=1122842444032
I0908 05:47:57.790587 676 syncedmem.cpp:355] [LMS] memory[0x1100242c0be0] device_=0 size_ = 691200000 allocation=8731458560 fragmented size = 655558704 gpu_ptr_=1156294115344
I0908 05:47:57.838747 676 syncedmem.cpp:355] [LMS] memory[0x1100242df300] device_=0 size_ = 691200000 allocation=9653058816 fragmented size = 885958944 gpu_ptr_=1157447262464
...
I0908 05:47:58.203995 676 caffe.cpp:513] pool5/7x7_s1 forward: 4.48429 ms.
I0908 05:47:58.204002 676 caffe.cpp:516] pool5/7x7_s1 backward: 0.002144 ms.
I0908 05:47:58.204010 676 caffe.cpp:513] pool5/drop_7x7_s1 forward: 0.367552 ms.
I0908 05:47:58.204015 676 caffe.cpp:516] pool5/drop_7x7_s1 backward: 0.002112 ms.
I0908 05:47:58.204022 676 caffe.cpp:513] loss3/classifier forward: 18.1078 ms.
I0908 05:47:58.204033 676 caffe.cpp:516] loss3/classifier backward: 0.002112 ms.
I0908 05:47:58.204041 676 caffe.cpp:513] prob forward: 0.022848 ms.
I0908 05:47:58.204047 676 caffe.cpp:516] prob backward: 0.011328 ms.
I0908 05:47:58.204061 676 caffe.cpp:521] Average Forward pass: 495.206 ms.
I0908 05:47:58.204067 676 caffe.cpp:523] Average Backward pass: 2.21437 ms.
I0908 05:47:58.204074 676 caffe.cpp:525] Average Forward-Backward: 499.65 ms.
I0908 05:47:58.204092 676 caffe.cpp:527] Total Time: 499.65 ms.
I0908 05:47:58.204107 676 caffe.cpp:528] *** Benchmark ends ***

예 ! 도중에 LMS가 사용된다는 메시지가 display되면서 성공적으로 완료되었습니다 ! 아무래도 느린 CPU 메모리를 사용하니까 당연히 성능은 떨어졌을 것입니다. 얼마나 떨어졌을까요 ? 여기서의 결과는 Average Forward pass: 495.206 ms 인데, batch size가 10이므로 이미지 1장당 0.0495초 걸린 것입니다. 위에서 1장씩 테스트했을 때의 결과 0.045초보다 10% 정도 느려졌습니다. 10장씩 batch로 돌리면 사실 1장씩 돌린 것보다는 빨리 나와야 하는데 오히려 10% 느려진 것은 많이 느려진 것이지요.

결국 LMS를 사용하면 심각한 성능 저하는 어쩔 수 없이 발생하는 것일까요 ? 꼭 그렇지는 않습니다. 방금 제가 수행한 것은 극단적으로 거의 모든 메모리 덩어리를 CPU 메모리에 두라고 지시한 것입니다. GPU 메모리를 적극적으로 활용하되, GPU 메모리 크기보다 큰 것들만 어쩔 수 없이 CPU 메모리를 사용하라고 지시하면 성능이 훨씬 더 좋을 것입니다.

이번에는 그렇게 -lms 160000000 옵션으로 돌려 보겠습니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -lms 160000000 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 06:32:20.006875 1126 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 06:32:20.006891 1126 net.cpp:143] Memory required for data: 172800000
I0908 06:32:20.006904 1126 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 06:32:20.006927 1126 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 06:32:20.006933 1126 net.cpp:635] conv1/7x7_s2 <- data
I0908 06:32:20.006944 1126 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 06:32:20.591289 1126 net.cpp:128] Setting up conv1/7x7_s2
I0908 06:32:20.591329 1126 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 06:32:20.591343 1126 net.cpp:143] Memory required for data: 1094400000
...
I0908 06:32:28.272960 1126 net.cpp:296] [LMS] BuildLargeModelSupport
W0908 06:32:28.273018 1126 net.cpp:348] [LMS] ######################################################
W0908 06:32:28.273172 1126 net.cpp:349] [LMS] uncovered layer type: Softmax
W0908 06:32:28.273182 1126 net.cpp:350] [LMS] ######################################################
W0908 06:32:28.273310 1126 net.cpp:348] [LMS] ######################################################
W0908 06:32:28.273320 1126 net.cpp:349] [LMS] uncovered layer type: Input
W0908 06:32:28.273329 1126 net.cpp:350] [LMS] ######################################################
I0908 06:32:28.273347 1126 net.cpp:425] [LMS] data_forward [0] data: -> data: 0x110009bfa4f0(172800000) ### flag=0 data:
I0908 06:32:28.273361 1126 net.cpp:425] [LMS] conv1/7x7_s2_forward [1] data: 0x110009bfa4f0(172800000) -> data: 0x1100233f7520(921600000) ### flag=0 data: 0x110009bfa4f0(1,1)
...
I0908 06:32:29.055697 1126 caffe.cpp:513] prob forward: 0.022016 ms.
I0908 06:32:29.055704 1126 caffe.cpp:516] prob backward: 0.006848 ms.
I0908 06:32:29.055716 1126 caffe.cpp:521] Average Forward pass: 263.516 ms.
I0908 06:32:29.055724 1126 caffe.cpp:523] Average Backward pass: 2.21066 ms.
I0908 06:32:29.055730 1126 caffe.cpp:525] Average Forward-Backward: 267.967 ms.
I0908 06:32:29.055748 1126 caffe.cpp:527] Total Time: 267.967 ms.
I0908 06:32:29.055764 1126 caffe.cpp:528] *** Benchmark ends ***

이번에는 10장에 대해 263.516 ms, 즉 1장에 대해서는 0.0263초가 걸렸습니다. 이는 1장씩 테스트했을 때의 결과 0.045초보다 무려 71% 빠른 결과입니다 ! LMS 덕분에 10장씩 batch로 돌리니까 더 빨라진 것이지요. 결국 LMS를 사용하면 오히려 더 빠른 성능을 낼 수도 있는 것입니다.