HW 엔지니어를 위한 Deep Learning: large model support

레이블이 large model support인 게시물을 표시합니다. 모든 게시물 표시

2020년 6월 12일 금요일

tf_cnn_benchmarks.py를 이용한 Tensorflow Large Model Support (LMS)의 demo

IBM Watson Machine Learning Community Edition (WML-CE) 1.6.2 속에 포함된 Tensorflow 1.15를 이용하여 large model support (LMS)에 대한 demo를 해보는 방법입니다.

** 아래 환경은 IBM CECC cloud에서 제공되는 가상화 환경의 P100 GPU를 이용했기 때문에, training 성능 자체는 떨어진다는 점을 인지해주시기 바랍니다.

가장 간단한 것은 WML-CE 속에 들어있는 tf_cnn_benchmarks suite를 이용하는 것입니다.

먼저 아래 명령어를 이용하여 tf_cnn_benchmarks suite를 원하는 directory에 copy합니다. (이건 optional step이고, 그냥 해당 directory로 직접 찾아 들어가도 됩니다.)

(wmlce_162) [cecuser@p1290-kvm1 ~]$ ./anaconda3/envs/wmlce_162/bin/install_tf_cnn_benchmarks .

(wmlce_162) [cecuser@p1290-kvm1 ~]$ cd tf_cnn_benchmarks

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ ls
all_reduce_benchmark.py cnn_util.py mlperf_test.py test_data
all_reduce_benchmark_test.py cnn_util_test.py models test_util.py
allreduce.py coco_metric.py platforms tf_cnn_benchmarks.py
allreduce_test.py constants.py preprocessing.py variable_mgr.py
batch_allreduce.py convnet_builder.py __pycache__ variable_mgr_util.py
benchmark_cnn_distributed_test.py datasets.py README.md variable_mgr_util_test.py
benchmark_cnn_distributed_test_runner.py flags.py run_tests.py
benchmark_cnn.py leading_indicators_test.py ssd_constants.py
benchmark_cnn_test.py mlperf.py ssd_dataloader.py

다만 여기서 benchmark_cnn.py 에서 일부 source code를 수정해야 합니다. 이는 source code 안에 들어있는 LMS 관련 parameter인 lms_swapout_threshold 관련 bug 때문입니다. 원래 값인 -1을 그대로 놔두면 원래는 auto-tuning이 되어야 하는데, TF 버전 등과의 호환 문제로 거기서 에러가 나므로, 일단은 그냥 1로 수정합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ vi benchmark_cnn.py

...

#flags.DEFINE_integer('lms_swapout_threshold', -1,

flags.DEFINE_integer('lms_swapout_threshold', 1,

...

그렇게 한 뒤 아래와 같이 tf_cnn_benchmarks.py를 수행해 봅니다. 여기서는 batch_size를 150으로 주고 해봅니다. 일단 잘 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=150 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
I0612 04:14:43.269922 140736229690240 session_manager.py:502] Done running local_init_op.
Running warm up
2020-06-12 04:14:45.534222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:14:45.728889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 255.8 +/- 0.0 (jitter = 0.0) 7.820
10 images/sec: 255.8 +/- 0.1 (jitter = 0.4) 8.082
20 images/sec: 255.7 +/- 0.1 (jitter = 0.4) 7.856
30 images/sec: 255.6 +/- 0.1 (jitter = 0.3) 7.832
40 images/sec: 255.5 +/- 0.0 (jitter = 0.3) 7.879
50 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.701
60 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.918
70 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.845
80 images/sec: 255.4 +/- 0.0 (jitter = 0.2) 7.750
90 images/sec: 255.3 +/- 0.0 (jitter = 0.2) 7.806
100 images/sec: 255.3 +/- 0.0 (jitter = 0.3) 7.856
----------------------------------------------------------------
total images/sec: 255.22
----------------------------------------------------------------

이때 OS에서 nmon tool을 이용해서 관찰해보면, host RAM의 메모리 사용량이 5GB 정도에 불과하고 tf_cnn_benchmarks의 data size도 32GB 정도, res set size도 2.1GB 정도에 불과한 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 58734.8 3518.5 x
x Free Percent 92.7% 85.9% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13671 42.1 32142m 2153m 3200 2866m 0 633216 12 0 tf_cnn_benchmar x

이제 batch_size를 200으로 늘려 보겠습니다. 그러면 16GB에 불과한 P100 GPU의 메모리가 꽉 차서 결국 Out-Of-Memory(OOM) error가 발생합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[main_fetch_group/_566]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
...

하지만 동일한 batch_size를 그대로 주더라도, LMS를 적용하여 --lms=True 옵션을 주고 수행해보면 (비록 많이 느려졌지만) error 없이 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10 --lms=True
....
I0612 04:27:06.439511 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: -0.09 GiB (memory ratio: 0.9)
I0612 04:27:06.439677 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0612 04:27:06.440271 140735558208384 lms.py:1275] [LMS][0] LMS will use the latest parameter set found by Simulator for the best performance. However, if you encounter an out-of-memory error, please manually use the previous parameter set found by Simulator.
I0612 04:27:06.440359 140735558208384 lms.py:1275] [LMS][0] sync_mode: 3 (Synchronous memory copy between host and device)
I0612 04:27:06.440439 140735558208384 lms.py:1275] [LMS][0] swapout_threshold: 1
I0612 04:27:06.440520 140735558208384 lms.py:1275] [LMS][0] swapin_ahead: -1 (ignored since sync_mode is 3)
I0612 04:27:06.440600 140735558208384 lms.py:1275] [LMS][0] swapin_groupby: -1 (ignored since sync_mode is 3)
I0612 04:27:06.869183 140735558208384 lms.py:1275] [LMS][0] Added 425 operations to the model (180 swap-out operations (20.33 GiB) and 245 swap-in operations (31.36 GiB))
I0612 04:27:06.869335 140735558208384 lms.py:1275] [LMS][0] Editing model for LMS, took: 799.3814945220947 ms
...
Running warm up
2020-06-12 04:27:15.098435: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:27:15.371592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.988
10 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.900
20 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.914
30 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 8.043
40 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.880
50 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.903
60 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.889
70 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.770
80 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.906
90 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.813
100 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.824
----------------------------------------------------------------
total images/sec: 21.13
----------------------------------------------------------------

이때 host의 OS를 관찰해보면 host RAM 사용량이 22GB 정도로 대폭 늘었고, tf_cnn_benchmarks의 data size도 50GB 정도, res set size도 19GB 정도로 대폭 늘어난 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 41505.6 3490.5 x
x Free Percent 65.5% 85.2% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13427 10.4 49577m19322m 3200 3399m 0 176710 0 0 tf_cnn_benchmar x

2018년 8월 22일 수요일

PowerAI 5.2의 caffe-ibm에서의 LMS 테스트 (cifar10)

먼저, cifar10 dataset을 준비합니다. 보통 caffe에 포함되어 있는 get_cifar10.sh를 수행하면 lmdb로 포맷된 dataset을 일사천리로 download 받을 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ pwd
/opt/DL/caffe-ibm

[bsyu@p57a22 caffe-ibm]$ ./data/cifar10/get_cifar10.sh
Downloading...
--2018-08-21 04:14:32-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[============================================================================>] 170,052,171 37.4MB/s in 4.6s

2018-08-21 04:14:37 (35.4 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

LMS의 효용성을 보기 위해서는 image size가 커야 합니다만, cifar10에 포함된 image들은 6만개의 32x32 칼러 이미지들로서 size가 매우 작습니다. 대신 한번에 GPU에서 처리되는 image 개수인 batch_size를 크게 하여 GPU memory를 꽉 차게 해보겠습니다.

[bsyu@p57a22 caffe-ibm]$ vi examples/cifar10/cifar10_full_train_test.prototxt
...
data_param {
source: "examples/cifar10/cifar10_train_lmdb"
batch_size: 22000 # 원래는 100
backend: LMDB
}

[bsyu@p57a22 caffe-ibm]$ which caffe
/opt/DL/caffe-ibm/bin/caffe

이제 수행해봅니다. batch_size: 22000 정도로는 LMS 없이도 잘 수행되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 03:51:25.516746 52459 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 03:51:28.164131 52459 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 03:51:28.753823 52459 solver.cpp:336] Iteration 200, loss = 1.71708
I0821 03:51:28.753847 52459 solver.cpp:341] Optimization Done.
I0821 03:51:28.753859 52459 caffe.cpp:421] Optimization Done.

이때 nvidia-smi로 GPU memory 사용량을 보면 거의 아슬아슬하게 GPU memory가 꽉 찬 것을 보실 수 있습니다.

이번에는 batch_size: 24000으로 늘려서 다시 동일한 training을 수행해봅니다. 이번에는 error가 납니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
F0821 04:26:41.953693 60726 syncedmem.cpp:569] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x3fffa30acf8c (unknown)
@ 0x3fffa30afa0c (unknown)
@ 0x3fffa30ac9b4 (unknown)
@ 0x3fffa30b0634 (unknown)
@ 0x3fffaac7c154 caffe::SyncedMemory::get_gpu_ptr()
@ 0x3fffaac77650 caffe::SyncedMemory::mutable_gpu_data()
@ 0x3fffaaa8aa28 caffe::Blob<>::mutable_gpu_diff()
@ 0x3fffaaced94c caffe::PoolingLayer<>::Backward_gpu()
@ 0x3fffaac207a4 caffe::Net<>::BackwardFromTo()
@ 0x3fffaac20ad8 caffe::Net<>::Backward()
@ 0x3fffaac5aabc caffe::Solver<>::Step()
@ 0x3fffaac5b328 caffe::Solver<>::Solve()
@ 0x1000e9e4 train()
@ 0x1000a848 main
@ 0x3fff88155100 generic_start_main.isra.0
@ 0x3fff881552f4 __libc_start_main
@ (nil) (unknown)

이번에는 역시 batch_size: 24000으로 하되, training시킬 때 -lms라는 옵션을 넣어서 수행합니다. 이번에는 LMS 관련 message가 나오면서, 문제없이 training이 완료되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 04:30:40.643229 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=35% available=19%]
I0821 04:30:42.045287 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=39% available=36%]
I0821 04:30:44.427603 75342 syncedmem.cpp:349] [LMS:2] allocate: size=3145728008 [count=80 demanded=159% allocated=53% available=61%]
I0821 04:30:44.605427 75342 solver.cpp:244] Iteration 0 (0 iter/s, 0s/200 iters), loss = 2.30264
I0821 04:30:44.605461 75342 solver.cpp:263] Train net output #0: loss = 2.30264 (* 1 = 2.30264 loss)
I0821 04:30:44.605484 75342 sgd_solver.cpp:128] Iteration 0, lr = 0.001
I0821 04:30:44.636401 75346 data_layer.cpp:86] Restarting data prefetching from start.
…
I0821 04:42:50.161826 75342 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 04:42:50.164942 75342 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 04:42:50.908159 75342 solver.cpp:336] Iteration 200, loss = 1.72997
I0821 04:42:50.908186 75342 solver.cpp:341] Optimization Done.
I0821 04:42:50.908195 75342 caffe.cpp:421] Optimization Done.

그리고 LMS 덕분에 GPU memory 사용량이 확 줄어든 것을 보실 수 있습니다.

그러나 lms를 쓸 경우, 분명히 성능은 다소 느려집니다. PCIe 대신 NVLink를 쓴다고 해도, GPU 메모리보다 host 서버의 DRAM이 느린 것이 당연하니까요. 그러나 lms도 약간이나마 튜닝을 할 수는 있습니다. 주로 다음의 2가지 추가 옵션을 쓰시면 됩니다.

-lms_size_threshold <size in KB> : Default는 1000.
여기에 명기하는 것보다 작은 크기의 memory chunk는 LMS에 의해 swap-out/in 되지 않고 GPU 메모리에 상주하게 하라는 뜻입니다.

-lms_exclude <size in MB> : Default는 0.
GPU 메모리 크기에서 이 값을 뺀 크기가 LMS cache로 사용되는 GPU memory 할당량의 soft limit입니다. 이 값을 작게 할 수록 GPU 메모리 사용량이 커져 성능이 좋아집니다.

가령 아래의 명령어 옵션은 crop size 2240x2240, batch size 5인 고해상도 이미지 dataset에서 GoogleNet model을 사용할 경우 가장 좋은 LMS 성능을 낸 example입니다. 그러나 모델의 신경망 구조와 data 크기, batch_size와 GPU 메모리 크기에 따라 이상적인 튜닝값은 다 다릅니다.

$ caffe train -solver=solver.prototxt -gpu all -lms -lms_size_threshold 1000 -lms_exclude 1400

실제 cifar10에서는 어떨까요 ? 먼저 lms 없이 수행할 수 있는 최대 크기의 batch_size로 non-LMS caffe training을 해보겠습니다. 그럴 경우, LMS에서 사용했던 batch_size 24000 * max_iter 200와 동일한 개수의 image를 non-LMS에서 처리하기 위해서는 batch_size 22000 * max_iter 219를 쓰시면 됩니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m46.926s

이와 거의 같은 개수의 image인 batch_size 24000 * max_iter 200를 LMS로 별도 튜닝 옵션 없이 처리해보겠습니다. 확실히 조금 더 느립니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 12m12.219s

이번에는 동일한 training에 -lms_size_threshold와 -lms_exclude를 추가해서 수행해봅니다. 성능이 거의 non-LMS만큼 빨라진 것이 보입니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms -lms_size_threshold 1000 -lms_exclude 1400 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m47.405s

그리고 더 큰 batch_size를 사용함에도 LMS에 의해 줄어들었던 GPU memory 사용량이 거의 non-LMS 수준으로 다시 늘어난 것이 보입니다.

2018년 8월 21일 화요일

Session-based tensorflow training에 LMS 적용한 MNIST python code

이 MNIST training을 위한 tensorflow python code는 원래 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/examples/tutorials/mnist/mnist_deep.py 에 LMS를 적용한 것입니다. 보시다시피 graph라는 단어가 나오기 때문에 Session-based tensorflow training이고, 그에 따라 LMS가 적용되어 있습니다.

실제 LMS 구현을 위한 부분은 굵은 파란색으로 표시했습니다. 의외로 간단하다는 것을 보실 수 있습니다. 해당 부분들을 제거하면 그냥 LMS 없는 평범한 MNIST training code가 됩니다.

이 example code도 PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py 을 그대로 가져다 놓은 것입니다.

실제 수행해보면 다음과 같이 동작하며, 27개의 tensor가 host 서버의 RAM으로 swap-out/in 되는 것을 보실 수 있습니다.

[bsyu@p57a22 examples]$ CUDA_VISIBLE_DEVICES=3 python mnist_deep_lms.py

...

INFO:tensorflow:[LMS][0] Editing model for LMS
INFO:tensorflow:[LMS][0] n_tensors: all tensors
INFO:tensorflow:[LMS][0] lb: 3
INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one
INFO:tensorflow:[LMS][0] Added 53 ops into the model
INFO:tensorflow:[LMS][0] Editing model for LMS, took: 179.62193489074707 ms
INFO:tensorflow:[LMS][0] 27 tensors will be swapped out(in) to(from) the host
Saving graph to: /tmp/tmpz5ozc9jr
2018-08-20 21:37:47.954552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
....

step 19900, training accuracy 1
test accuracy 0.9918

[bsyu@p57a22 doc]$ cd /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/

[bsyu@p57a22 examples]$ source /opt/DL/tensorflow/bin/tensorflow-activate

[bsyu@p57a22 examples]$ cat /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""A deep MNIST classifier using convolutional layers.

See extensive documentation at
https://www.tensorflow.org/get_started/mnist/pros
"""
# Disable linter warnings to maintain consistency with tutorial.
# pylint: disable=invalid-name
# pylint: disable=g-bad-import-order

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
FLAGS = None

def deepnn(x):
"""deepnn builds the graph for a deep net for classifying digits.

Args:
x: an input tensor with the dimensions (N_examples, 784), where 784 is the
number of pixels in a standard MNIST image.

Returns:
A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values
equal to the logits of classifying the digit into one of 10 classes (the
digits 0-9). keep_prob is a scalar placeholder for the probability of
dropout.
"""
# Reshape to use within a convolutional neural net.
# Last dimension is for "features" - there is only one here, since images are
# grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
with tf.name_scope('reshape'):
x_image = tf.reshape(x, [-1, 28, 28, 1])

# First convolutional layer - maps one grayscale image to 32 feature maps.
with tf.name_scope('conv1'):
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

# Pooling layer - downsamples by 2X.
with tf.name_scope('pool1'):
h_pool1 = max_pool_2x2(h_conv1)

# Second convolutional layer -- maps 32 feature maps to 64.
with tf.name_scope('conv2'):
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

# Second pooling layer.
with tf.name_scope('pool2'):
h_pool2 = max_pool_2x2(h_conv2)

# Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
# is down to 7x7x64 feature maps -- maps this to 1024 features.
with tf.name_scope('fc1'):
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# Dropout - controls the complexity of the model, prevents co-adaptation of
# features.
with tf.name_scope('dropout'):
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Map the 1024 features to 10 classes, one for each digit
with tf.name_scope('fc2'):
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
return y_conv, keep_prob

def conv2d(x, W):
"""conv2d returns a 2d convolution layer with full stride."""
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
"""max_pool_2x2 downsamples a feature map by 2X."""
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')

def weight_variable(shape):
"""weight_variable generates a weight variable of a given shape."""
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)

def bias_variable(shape):
"""bias_variable generates a bias variable of a given shape."""
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)

def main(_):
# Import data
mnist = input_data.read_data_sets(FLAGS.data_dir)

# Create the model
x = tf.placeholder(tf.float32, [None, 784])

# Define loss and optimizer
y_ = tf.placeholder(tf.int64, [None])

# Build the graph for the deep net
y_conv, keep_prob = deepnn(x)

with tf.name_scope('loss'):
cross_entropy = tf.losses.sparse_softmax_cross_entropy(
labels=y_, logits=y_conv)
cross_entropy = tf.reduce_mean(cross_entropy)

with tf.name_scope('adam_optimizer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

with tf.name_scope('accuracy'):
correct_prediction = tf.equal(tf.argmax(y_conv, 1), y_)
correct_prediction = tf.cast(correct_prediction, tf.float32)
accuracy = tf.reduce_mean(correct_prediction)

# Enable Large Model Support
from tensorflow.contrib.lms import LMS
lms_model = LMS({'adam_optimizer'},
excl_scopes = {'loss', 'accuracy', 'dropout'},
lb=3)
lms_model.run(tf.get_default_graph())

graph_location = tempfile.mkdtemp()
print('Saving graph to: %s' % graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(tf.get_default_graph())

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(20000):
batch = mnist.train.next_batch(50)
if i % 100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x: batch[0], y_: batch[1], keep_prob: 1.0})
print('step %d, training accuracy %g' % (i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print('test accuracy %g' % accuracy.eval(feed_dict={
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', type=str,
default='/tmp/tensorflow/mnist/input_data',
help='Directory for storing input data')
FLAGS, unparsed = parser.parse_known_args()

tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

Estimator-based tensorflow training에 LMS 적용한 MNIST python code

이 MNIST training을 위한 tensorflow python code는 원래 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py 에 LMS를 적용한 것입니다. 보시다시피 hook를 사용하기 때문에 Estimator-based tensorflow training이고, 그에 따라 LMS가 적용되어 있습니다.

LMS가 동작하는 message 확인이라든가, 다중 사용자를 위한 permission 등을 위한 부분도 있습니다만 그건 LMS와는 무관한 부분이고, 그 부분들은 빨간색으로 표시를 했습니다. 실제 LMS 구현을 위한 부분은 굵은 파란색으로 표시했습니다. 의외로 간단하다는 것을 보실 수 있습니다. 해당 부분들을 제거하면 그냥 LMS 없는 평범한 MNIST training code가 됩니다.

이 example code도 PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py 을 그대로 가져다 놓은 것입니다.

실제 수행해보면 다음과 같이 동작하며, 12개의 tensor가 host 서버의 RAM으로 swap-out/in 되는 것을 보실 수 있습니다.

[bsyu@p57a22 ~]$ cd /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples

[bsyu@p57a22 examples]$ source /opt/DL/tensorflow/bin/tensorflow-activate

[bsyu@p57a22 examples]$ python cnn_mnist_lms.py
...
INFO:tensorflow:[LMS][1] Tensor sparse_softmax_cross_entropy_loss/Sum_1:0 will be placed on /cpu:0
INFO:tensorflow:[LMS][1] Operation: adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1, order 23, type RealDiv
INFO:tensorflow:[LMS][1] Tensor adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1:0 will be placed on /cpu:0
INFO:tensorflow:[LMS][1] Consuming op adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_2 (order 24) swaps in adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1:0
INFO:tensorflow:[LMS][1] No control dependency op needed for swap in of op adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1.
INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one
INFO:tensorflow:[LMS][0] Added 25 ops into the model
INFO:tensorflow:[LMS][0] Editing model for LMS, took: 88.58513832092285 ms
INFO:tensorflow:[LMS][0] 12 tensors will be swapped out(in) to(from) the host
INFO:tensorflow:Graph was finalized.
...
{'accuracy': 0.9702, 'loss': 0.098796368, 'global_step': 20000}

전체 code 내용은 아래를 보시기 바랍니다.

[bsyu@p57a22 examples]$ cat /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py

# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tempfile # Change not related to LMS
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO) #LMS 기능과는 무관. LMS 메시지를 보기 위한 설정.

def cnn_model_fn(features, labels, mode):
"""Model function for CNN."""
# Input Layer
# Reshape X to 4-D tensor: [batch_size, width, height, channels]
# MNIST images are 28x28 pixels, and have one color channel
input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])

# Convolutional Layer #1
# Computes 32 features using a 5x5 filter with ReLU activation.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 28, 28, 1]
# Output Tensor Shape: [batch_size, 28, 28, 32]
conv1 = tf.layers.conv2d(
inputs=input_layer,
filters=32,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #1
# First max pooling layer with a 2x2 filter and stride of 2
# Input Tensor Shape: [batch_size, 28, 28, 32]
# Output Tensor Shape: [batch_size, 14, 14, 32]
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

# Convolutional Layer #2
# Computes 64 features using a 5x5 filter.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 14, 14, 32]
# Output Tensor Shape: [batch_size, 14, 14, 64]
conv2 = tf.layers.conv2d(
inputs=pool1,
filters=64,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #2
# Second max pooling layer with a 2x2 filter and stride of 2
# Input Tensor Shape: [batch_size, 14, 14, 64]
# Output Tensor Shape: [batch_size, 7, 7, 64]
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

# Flatten tensor into a batch of vectors
# Input Tensor Shape: [batch_size, 7, 7, 64]
# Output Tensor Shape: [batch_size, 7 * 7 * 64]
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])

# Dense Layer
# Densely connected layer with 1024 neurons
# Input Tensor Shape: [batch_size, 7 * 7 * 64]
# Output Tensor Shape: [batch_size, 1024]
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

# Add dropout operation; 0.6 probability that element will be kept
dropout = tf.layers.dropout(
inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

# Logits layer
# Input Tensor Shape: [batch_size, 1024]
# Output Tensor Shape: [batch_size, 10]
logits = tf.layers.dense(inputs=dropout, units=10)

predictions = {
# Generate predictions (for PREDICT and EVAL mode)
"classes": tf.argmax(input=logits, axis=1),
# Add `softmax_tensor` to the graph. It is used for PREDICT and by the
# `logging_hook`.
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# Calculate Loss (for both TRAIN and EVAL modes)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

# Configure the Training Op (for TRAIN mode)
if mode == tf.estimator.ModeKeys.TRAIN:
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

# Add evaluation metrics (for EVAL mode)
eval_metric_ops = {
"accuracy": tf.metrics.accuracy(
labels=labels, predictions=predictions["classes"])}
return tf.estimator.EstimatorSpec(
mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

def main(unused_argv):
# Load training and eval data
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)

# The graph_location changes are not related to LMS enablement
# but rather allow multiple users to run the example without
# having permission issues on temp directories.
graph_location = tempfile.mkdtemp()
print('Saving graph to: %s' % graph_location)

# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir=graph_location)

# Set up logging for predictions
# Log the values in the "Softmax" tensor with label "probabilities"
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=50)

# Train the model
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_data},
y=train_labels,
batch_size=100,
num_epochs=None,
shuffle=True)

# Hook for Large Model Support
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'}, lb=3, debug=True)

mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook, lms_hook])

# Evaluate the model and print results
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

if __name__ == "__main__":
tf.app.run()

Tensorflow LMS를 이용한 python coding guide (README-LMS.md)

PowerAI 5.2에 포함된 Tensorflow 1.8에는 IBM이 자랑하는 LMS (Large Model Support) 기능이 통합되어 있습니다. 즉, GPU 메모리가 부족할 경우 host 서버의 RAM을 GPU 메모리처럼 사용할 수 있도록 해주는 기능으로서, 마치 real memory가 부족할 때 disk 상의 swap space로 swap in/out 하는 것과 비슷하다고 생각하시면 됩니다.

Caffe-ibm에는 이것이 -lms 옵션으로 구현되어 있으므로 별다른 코딩 없이 사용하시면 되는데, tensorflow는 그 자체가 원래 python의 library이므로 LMS를 사용하기 위해서는 python coding이 약간 필요합니다. 그에 대한 가이드는 아직 internet 상에는 없고, PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/doc/README-LMS.md에 비교적 상세한 설명이 들어 있습니다.

IBM PowerAI 5.2는 Minsky 또는 AC922 서버를 소유하고 계신 고객께서는 추가 비용없이 주문하실 수 있는 무료 SW입니다. (그러나 공식적으로 주문을 하셔야 받으실 수 있습니다.) 고객 여러분의 편의를 위해 그 README-LMS.md를 여기에 올려놓습니다.

아래 문서에 자세히 나와 있습니다만, tensorflow에서 training하는 방법에는 크게 2가지 방법이 있고, LMS 구현도 그에 따라 2가지 방법이 있습니다. 제가 얼치기로 이해한 것을 요약하면 이렇습니다.

1) Session-based training :
기존 tensorflow python code에 graph라는 단어가 나오면 training 부분 전에 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph()) # 파란색 부분을 삽입

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})

2) Estimator-based training :
기존 tensorflow python code에 hook이라는 단어가 나오면 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'}) # 파란색 부분을 삽입
...

mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook]) #lms_hook을 삽입

그리고 LMS 튜닝에 있어서는 2가지 parameter가 중요합니다. 실제로는 n_tensor를 code에서 직접 쓰실 일은 거의 없는 것 같고, lb는 가끔 사용하는 것 같습니다.

n_tensor : host 서버의 RAM으로 swap-out 했다가 필요시 swap-in 할 tensor의 개수입니다. 당연히 많을 수록 GPU memory를 적게 쓰지만, 대신 training 성능은 나빠집니다. Default 값은 -1로서, 모든 tensor를 다 swap 대상으로 삼습니다만, tensorflow LMS가 초기 estimation 후 자동으로 적절한 값을 산정하여 사용합니다.

lb : Lower Bound의 약자로서, 미리 swap-in 시켜놓을 tensor의 개수입니다. 많을 수록 성능은 좋아지겠지만, 대신 GPU memory를 많이 쓰게 됩니다. Default 값은 1입니다.

전체 내용은 아래 README-LMS.md를 참조하십시요.

[bsyu@p57a22 doc]$ vi /opt/DL/tensorflow/doc/README-LMS.md

# TFLMS: Graph Editing Library for Large Model Support (LMS) in TensorFlow

This library provides an approach to training large models that cannot be fit into GPU memory.
It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa.
The computational graph is statically modified. Hence, it needs to be done before a session actually starts.

## How to use
TFLMS needs to know some information about user-defined models.
There is one requirement for a user-defined model: it must have scopes for the optimizers/solvers.

Enabling LMS for a model depends on how users write their training. Followings are guidelines for two ways: [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training and [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training.

### [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
```
#### Step 2: define an LMS object and run it
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())
```
The above lines must be put before starting a training session, for example:
- Before inserting LMS code
```python
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
- After inserting LMS code
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
For a working example of LMS integration with Session based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/mnist/mnist_deep.py`.

### [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
```
#### Step 2: define an LMSHook (LMSHook and LMS share the same set of parameters)
```python
# Hook for Large Model Support
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'})
```
#### Step 3: add the LMSHook into Estimator's hook list
```python
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook])
```

For a working example of LMS integration with Estimator based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py`.

### High-Performance Models
A version of TensorFlow High-Performance Models which includes options to use Distributed Deep Learning is included in the tensorflow-performance-models package. These models also have integrated
Large Model Support. For more information, see:

`/opt/DL/tensorflow-performance-models/tf_cnn_benchmarks/README.md`

### GPU scaling tips

To achieve better scaling performance with LMS on multiple GPUs,
the train script should updated to use PowerAI Distributed Deep Learning and
the training script should be run with the `ddlrun` command. Additionally,
if running on a single system without an Infiniband set up,
the `--mpiarg -pami_noib` parameter must be added to the ddlrun command
line, for example:

```bash
ddlrun --mpiarg -pami_noib -H host1 python tf_cnn_benchmarks.py --batch_size=512 --num_batches=100 --model=resnet50 --gpu_thread_mode=gpu_shared --num_gpus=1 --display_every=10 --lms=True --lms_lb=3 --lms_n_tensors=80 --variable_update=ddl
```

For more information on using ddlrun, see: `/opt/DL/ddl/doc/README.md` and `/opt/DL/ddl-tensorflow/doc/README.md`.

### Parameters for LMS/LMSHook
#### Required parameters
_graph_ :: the graph we will modify for LMS. This should be the graph of user-defined neural network. (not required in LMSHook)

_optimizer_scopes_ :: scopes for the optimizers/solvers.

#### Optional parameters
_starting_scope_ :: Tensors that are reachable from the operations in this scope will be swapped for LMS. Set this to the scope of the first layer if we would like to modify the whole graph. Default `None`.

_starting_op_names_ :: Tensors that are reachable from the operations with these names will be swapped for LMS. Default `None`.

_excl_scopes_ :: a set of scopes for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_scopes_ :: a set of scopes for operations whose tensors will be swapped out to the host. Default `empty`.

_excl_types_ :: a set of types for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_types_ :: a set of types for operations whose tensors will be swapped out to the host. Default `empty`.

_n_tensors_ :: The number of tensors for LMS, counting from the `starting_scope`. To turn off LMS, set `n_tensors` to `0`. Default `-1` (all reachable tensors will be swapped for LMS).

_lb_ :: Lowerbound value for LMS. A tensor will be swapped in during the backward phase at least `lb` nodes before it in the graph. Default `1`.

_ub_ :: Upperbound value for LMS. Default `10000`.

_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.

_ctrld_strategy_ :: Two strategies to find control dependency ops for swapin ops: `chain_rule` and `direct_order`. `chain_rule` strategy starts from a forward operation, goes forward and finds a corresponding backward operation to be a control dependency opepartion. `direct_order` strategy directly gets a backward ops in the topological order to be a control dependency operation. Both strategies depend on `lb` and `ub` to choose a control dependency operation. While the `direct_order` is more exact than `chain_rule` in relation to `lb` and `ub`, it experimentally often results in smaller maximum batch size than `chain_rule`. Default `chain_rule`.

_swap_branches_ :: If True, LMS will swap tensors in branches in the forward phase. Default `False`.

_branch_threshold_ :: If `swap_branches` is enabled and the topological-sort distance between the consuming operation and generating operation of a tensor is greater than `branch_threshold`, then swap the tensor. Default `0`.

_debug_ :: Debug mode for LMS. Default `False`.

_debug_level_ :: Debug level for LMS (1 or 2). Default `1`.

### Performance Tuning LMS

Once you have enabled LMS graph modification in your code you will want to find the combination of tuning parameters that gives the fastest training time and best accuracy with your model. The goal of the performance tuning is to swap out enough tensors to allow your training to run without hitting out of memory errors, while not swapping too many such that the extra swapping communication overhead degrades performance.

The two tuning parameters you should focus on are `n_tensors` and `lb`. Since `n_tensors` controls the number of tensors that will be swapped, the higher this is set, the lower the peak GPU memory usage will be. The `lb` controls how soon the tensor is swapped back in before use. A low value of `lb` can make the training on the GPU pause and wait while the swap in finishes. This will degrade performance. A higher value of `lb` can allow the tensor swap in to finish before it's needed and allow training to run without pause. The downside to swapping in too early is that more tensors will be in GPU memory at any point in time, resulting in higher peak GPU memory usage.

The tuning thus becomes finding the correct balance between `n_tensors` and `lb` that provides the best performance for given model. To start the performance tuning it's suggested that `n_tensors` be set to -1 which will swap all reachable tensors. The `lb` should be set to the default 1, which is the latest possible swap in. If `tf.logging` verbosity is set to `tf.logging.INFO`, LMS will output a log statement with a count of the number of tensors swapped. It is useful to run with `n_tensors=-1` for the first run to find this maximum value and then adjust it downward. If your model has branches like some UNet models do, you will likely want to set `swap_branches=True` and tune the branch threshold as well.

By default LMS will analyze your graph to find the starting operations to use for finding tensor swap candidates. You can bypass this analysis by placing your starting operations in a named scope and providing the scope on the `starting_scope` parameter, or by providing the names of the starting operations on the `starting_op_names` parameter. This can speed up repeated runs of LMS during tuning. Furthermore, you can enable `debug=True` and `debug_level=1` and LMS will print out the name and type of the starting operations it finds. These names could be passed in on the `starting_op_names` parameter on subsequent runs.

It is recommended that you start with tuning training on a single GPU before enabling your code for multi-GPU with DDL.

2017년 10월 27일 금요일

caffe-ibm의 LMS 기능에 대한 설명

전에 올린 DDL 관련 포스팅(https://hwengineer.blogspot.kr/2017/10/caffe-ddl-alexnet-training.html)에서, LMS(large model support) 기능을 믿고 batch_size를 화끈하게 2048이나 4096으로 올리면 어떤가라는 질문이 있을 수 있습니다. 결론부터 말씀드리면 LMS를 쓴다고 해도 batch_size를 무한정 키울 수는 없습니다.

먼저 저 포스팅에 나와 있듯이, LMS 기능은 '기존 GPU memory 한계 때문에 돌릴 수 없었던 큰 모델도 돌릴 수 있게 해주는 기능'이지, 이 때문에 반드시 더 빠른 성능을 낼 수 있는 것은 아닙니다. 아무리 NVLink를 통해 가져온다고 해도, host server memory가 GPU memory보다는 느리니까요.

그와는 별도로, LMS도 무한정 host memory를 쓸 수는 없습니다. Lab에서 들은 이야기입니다만, LMS를 쓴다고 해도 아래 정보들은 반드시 GPU memory 상에 올라가야 한다고 합니다.

- input tensor
- output tensor
- weight
- gradient (training인 경우)

그리고 이 정보들이 차지하는 memory의 양은 batch_size가 늘어날 수록 함께 늘어나는데, 그로 인해 결국 한계가 있습니다. LMS의 핵심은, deep learning에서 layer별로 training을 할 때, 당장 처리하고 있는 layer는 GPU memory 위에 두더라도, 이미 처리했거나 나중에 처리할 layer들은 host memory에 저장할 수 있다는 것입니다. 따라서, LMS로 처리가능한 최대 neural network 크기는 그 neural network에서 가장 큰 layer의 크기에 달려 있다고 할 수 있습니다.

가령 전에 테스트했던, Alexnet의 deploy.prototxt 이용해서 'caffe time'을 수행할 때 보면 아래와 같이 data, conv1, relu1 등 총 24개의 layer가 만들어집니다.

$ grep "Creating Layer" caffe_time.log
…
I1025 16:37:01.848961 29514 net.cpp:90] Creating Layer data
I1025 16:37:01.867213 29514 net.cpp:90] Creating Layer conv1
…
I1025 16:37:03.477823 29514 net.cpp:90] Creating Layer fc8
I1025 16:37:03.481210 29514 net.cpp:90] Creating Layer prob

그리고 각 layer마다 다음과 같이 Top shape가 정해지면서 "Memory required for data"가 연산됩니다. 그리고 그 값은 처음 layer에서는 작아도 나중에는 매우 커지지요.

…
I1025 16:37:01.867137 29514 net.cpp:135] Top shape: 10 3 1600 1200 (57600000)
I1025 16:37:01.867183 29514 net.cpp:143] Memory required for data: 230400000
…
I1025 16:37:02.231456 29514 net.cpp:135] Top shape: 10 96 398 298 (113859840)
I1025 16:37:02.231468 29514 net.cpp:143] Memory required for data: 685839360
…
I1025 16:37:02.246103 29514 net.cpp:135] Top shape: 10 256 49 37 (4641280)
I1025 16:37:02.246112 29514 net.cpp:143] Memory required for data: 3315185920
…

이런 특성들로 인해, wide한 (layer별 메모리 필요량이 많은) neural network보다는 deep한 (layer 개수가 많은) neural network이 LMS의 잇점을 훨씬 더 잘 살릴 수 있습니다.

LMS의 진정한 장점을 정리하면 아래와 같습니다.

1) width로는 10-30배 정도 더 큰 모델 사용 가능
2) depth로는 무한대의 큰 모델을 사용 가능

특성상 layer들이 많은 RNN에서 LMS가 특히 유용하게 쓰일 수 있다고 합니다. 그리고 요즘 신경망 발전 방향이 wide해지는 방향이 아니라 점점 deep해지는 방향이라고 합니다. 가령 몇년 전에 나온 Alexnet 같은 경우 이론상 7개 layer라고 하는데, 그로부터 4년 뒤에 나온 ResNet 같은 경우 1000개 layer라고 하지요. LMS의 적용 범위는 점점 넓어지고 있다고 할 수 있습니다.

2017년 9월 13일 수요일

Inference 시스템을 위한 GPU 용량 sizing, 그리고 IBM caffe의 Large Model Support (LMS) 옵션

오늘은 inference, 그 중에서도 inference를 위한 GPU 시스템의 sizing을 어떻게 해야 하는지에 대해서 보겠습니다. 여기서는 특정적으로, caffe를 이용하여 image data를 inference할 때 어떻게 하는지를 보겠습니다. 그리고 덧붙여, IBM Minsky 서버에서만 가능한 옵션, -lms (Large Model Support)가 어떤 혜택을 주는지도 보시겠습니다.

이에 대해서는 아래 site에 기본적인 방법이 소개됩니다. IBM China의 Deep Learning 개발팀의 박사님들에게 물어보니, 이 방법이 맞다고 합니다.

https://stackoverflow.com/questions/36867591/how-to-estimate-inference-time-from-average-forward-pass-time-in-caffe

여기서 핵심적인 부분은 바로 아래 부분입니다.

For instance, if I run the default command that comes with Caffe:

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0
I get the following output

...
I0426 13:07:32.701490 30417 layer_factory.hpp:77] Creating layer data
I0426 13:07:32.701513 30417 net.cpp:91] Creating Layer data
I0426 13:07:32.701529 30417 net.cpp:399] data -> data
I0426 13:07:32.709048 30417 net.cpp:141] Setting up data
I0426 13:07:32.709079 30417 net.cpp:148] Top shape: 10 3 227 227 (1545870)
I0426 13:07:32.709084 30417 net.cpp:156] Memory required for data: 6183480
...
I0426 13:07:34.390281 30417 caffe.cpp:377] Average Forward pass: 16.7818 ms.
I0426 13:07:34.390290 30417 caffe.cpp:379] Average Backward pass: 12.923 ms.
I0426 13:07:34.390296 30417 caffe.cpp:381] Average Forward-Backward: 29.7969 ms.
The following line:

I0426 13:07:32.709079 30417 net.cpp:148] Top shape: 10 3 227 227 (1545870)
is super important. It says that your input layer is 10x3x227x227-dimensional. In this case, the batch size is 10 images, each of size 3x227x227 (the 3 refers to each of the rgb channels in an image).

So effectively, it took 1.67818 ms/image to do a forward pass or inference time per image.

즉, caffe 명령어의 sub-comand 중 time 명령, 즉 caffe를 이용한 성능 benchmark 결과에서 평균 forward pass에 걸린 시간이 해당 model과 해당 이미지에 대해서 걸릴 inference time이라는 것입니다. 당연한 이야기지만 해당 model에 지정하는 data layer의 Top shape 10 3 227 227, 즉 batch size 10 x channel (RGB) 3 x height 227 x width 227이 클 수록 더 많은 시간이 걸립니다.

HPC cloud 서비스 업체인 Nimbix (nimbix.net/powerai)에서 제공하는 Minsky 서버의 P100 1장짜리 가상머신을 사용할 기회가 있어, 거기에서 이 test를 해봤습니다. 참고로 Nimbix는 docker 기반의 NVLink P100 GPU 가상 머신을 제공하는데, 이에 대해서도 나중에 다룰 기회가 있을 것입니다.

먼저, 1200x1200 크기의 이미지 1장에 대해서 GoogleNet으로 inference하는데 NVLink P100으로는 시간이 얼마나 걸리는지 보시겠습니다. 이를 위해서 먼저 GoogleNet에 포함된 deploy.prototxt를 아래와 같이 편집합니다. 원본 line은 아래에 #으로 comment-out 처리했습니다.

nimbix@JARVICENAE-0A0A1844:/data$ vi bvlc_googlenet/deploy.prototxt
name: "GoogleNet"
layer {
name: "data"
type: "Input"
top: "data"
input_param { shape: { dim: 1 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 10 dim: 3 dim: 224 dim: 224 } }
}

이제 이렇게 수정된 model로 caffe time을 수행합니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

그 과정을 다 보실 필요는 없고, 사실 맨 끝의 benchmark 결과에서 Average Forward pass 시간만 보시면 됩니다.

I0908 05:39:36.830621 567 caffe.cpp:513] prob forward: 0.020864 ms.
I0908 05:39:36.830627 567 caffe.cpp:516] prob backward: 0.00368 ms.
I0908 05:39:36.830641 567 caffe.cpp:521] Average Forward pass: 45.3671 ms.
I0908 05:39:36.830649 567 caffe.cpp:523] Average Backward pass: 102.551 ms.
I0908 05:39:36.830657 567 caffe.cpp:525] Average Forward-Backward: 150.178 ms.
I0908 05:39:36.830673 567 caffe.cpp:527] Total Time: 150.178 ms.
I0908 05:39:36.830689 567 caffe.cpp:528] *** Benchmark ends ***

여기서 만약 우리가 batch size(맨 앞의 dim)를 10으로 했다면 저 Average Forward pass 시간을 10으로 나눠야 합니다. 그러나 우리는 dim을 1로 주었으므로 그럴 필요없이 저것을 그대로 쓰면 됩니다. 즉, RGB 3 채널의 1200x1200 이미지 1장을 P100 GPU를 이용하여 GoogleNet으로 inference하는데 0.045초가 걸린다고 보시면 됩니다.

위의 테스트에서 display되는 benchmark 과정을 보면 Deep Learning의 얼개를 대충 보실 수 있습니다. 아래처럼 먼저 Top shape를 1 x 3 x 1200 x 1200으로 시작했다가, 다음 단계에서는 1 x 64 x 600 x 600으로, 그 다음에는 다시 300 x 300으로 계속 절반으로 줄여나가다가 결국 31 x 31에서 마무리 됩니다. 마지막 단계에서의 channel 수는 무려 1024로 늘어나게 되는데, 그 의미를 (저 같은 무식한 HW 엔지니어는) 잘 모르겠군요. 사실 HW 엔지니어에게 중요한 것은 거기에 필요로 하는 메모리 사이즈입니다. 각 단계별 top shape마다 필요로 하는 메모리 사이즈가 'Memory required for data'라는 항목으로 display되는데, 처음 단계에서는 17MB 정도로 시작했다가 맨 마지막 단계에서는 거의 1.6GB 가까이 갑니다.

...
I0908 05:39:25.035709 567 net.cpp:135] Top shape: 1 3 1200 1200 (4320000)
I0908 05:39:25.035733 567 net.cpp:143] Memory required for data: 17280000
I0908 05:39:25.035754 567 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:39:25.035786 567 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:39:25.035799 567 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:39:25.035816 567 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:39:29.695616 567 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:39:29.695672 567 net.cpp:135] Top shape: 1 64 600 600 (23040000)
I0908 05:39:29.695695 567 net.cpp:143] Memory required for data: 109440000
...
I0908 05:39:29.862272 567 net.cpp:128] Setting up pool5/drop_7x7_s1
I0908 05:39:29.862279 567 net.cpp:135] Top shape: 1 1024 31 31 (984064)
I0908 05:39:29.862287 567 net.cpp:143] Memory required for data: 1587930496
I0908 05:39:29.862294 567 layer_factory.hpp:77] Creating layer loss3/classifier
I0908 05:39:29.862305 567 net.cpp:90] Creating Layer loss3/classifier
I0908 05:39:29.862311 567 net.cpp:635] loss3/classifier <- pool5/7x7_s1
I0908 05:39:29.862320 567 net.cpp:609] loss3/classifier -> loss3/classifier
I0908 05:39:36.385628 567 net.cpp:128] Setting up loss3/classifier
I0908 05:39:36.385684 567 net.cpp:135] Top shape: 1 1000 (1000)
I0908 05:39:36.385696 567 net.cpp:143] Memory required for data: 1587934496
I0908 05:39:36.385712 567 layer_factory.hpp:77] Creating layer prob
I0908 05:39:36.385728 567 net.cpp:90] Creating Layer prob
I0908 05:39:36.385737 567 net.cpp:635] prob <- loss3/classifier
I0908 05:39:36.385749 567 net.cpp:609] prob -> prob
I0908 05:39:36.386745 567 net.cpp:128] Setting up prob
I0908 05:39:36.386756 567 net.cpp:135] Top shape: 1 1000 (1000)
I0908 05:39:36.386765 567 net.cpp:143] Memory required for data: 1587938496
I0908 05:39:36.386771 567 net.cpp:206] prob does not need backward computation.
...

잠깐만요, 1.6GB라고요 ? P100의 GPU 메모리 크기가 16GB 밖에 안되는데, 저런 image를 10장을 한꺼번에 inference하면 어떻게 된다는 것일까요 ? 설마 error가 날까요 ? 한번 해보겠습니다. 위와 동일한 모델을 사용하되, 단지 맨 앞의 dim, 즉 batch size를 1에서 10으로 바꾸겠습니다.

nimbix@JARVICENAE-0A0A1844:/data$ vi bvlc_googlenet/deploy.prototxt
name: "GoogleNet"
layer {
name: "data"
type: "Input"
top: "data"
input_param { shape: { dim: 10 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 1 dim: 3 dim: 1200 dim: 1200 } }
# input_param { shape: { dim: 10 dim: 3 dim: 224 dim: 224 } }
}

이제 동일하게 caffe time을 수행합니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 05:43:44.249899 646 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 05:43:44.249914 646 net.cpp:143] Memory required for data: 172800000
I0908 05:43:44.249928 646 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:43:44.249949 646 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:43:44.249956 646 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:43:44.249967 646 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:43:44.614331 646 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:43:44.614367 646 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 05:43:44.614382 646 net.cpp:143] Memory required for data: 1094400000
...
I0908 05:43:44.763245 646 net.cpp:135] Top shape: 10 1024 31 31 (9840640)
I0908 05:43:44.763254 646 net.cpp:143] Memory required for data: 15839942400
I0908 05:43:44.763260 646 layer_factory.hpp:77] Creating layer pool5/drop_7x7_s1
I0908 05:43:44.763272 646 net.cpp:90] Creating Layer pool5/drop_7x7_s1
I0908 05:43:44.763278 646 net.cpp:635] pool5/drop_7x7_s1 <- pool5/7x7_s1
I0908 05:43:44.763285 646 net.cpp:596] pool5/drop_7x7_s1 -> pool5/7x7_s1 (in-place)
I0908 05:43:44.763319 646 net.cpp:128] Setting up pool5/drop_7x7_s1
I0908 05:43:44.763325 646 net.cpp:135] Top shape: 10 1024 31 31 (9840640)
I0908 05:43:44.763334 646 net.cpp:143] Memory required for data: 15879304960
I0908 05:43:44.763340 646 layer_factory.hpp:77] Creating layer loss3/classifier
I0908 05:43:44.763352 646 net.cpp:90] Creating Layer loss3/classifier
I0908 05:43:44.763358 646 net.cpp:635] loss3/classifier <- pool5/7x7_s1
I0908 05:43:44.763367 646 net.cpp:609] loss3/classifier -> loss3/classifier
I0908 05:43:51.338423 646 net.cpp:128] Setting up loss3/classifier
I0908 05:43:51.345638 646 net.cpp:135] Top shape: 10 1000 (10000)
I0908 05:43:51.345651 646 net.cpp:143] Memory required for data: 15879344960
I0908 05:43:51.345667 646 layer_factory.hpp:77] Creating layer prob
I0908 05:43:51.345683 646 net.cpp:90] Creating Layer prob
I0908 05:43:51.345693 646 net.cpp:635] prob <- loss3/classifier
I0908 05:43:51.345705 646 net.cpp:609] prob -> prob
I0908 05:43:51.346666 646 net.cpp:128] Setting up prob
I0908 05:43:51.346678 646 net.cpp:135] Top shape: 10 1000 (10000)
I0908 05:43:51.346685 646 net.cpp:143] Memory required for data: 15879384960
...
I0908 05:43:51.724148 646 caffe.cpp:465] Initial loss: 0
I0908 05:43:51.724202 646 caffe.cpp:466] Performing Backward
I0908 05:43:51.724215 646 caffe.cpp:474] *** Benchmark begins ***
I0908 05:43:51.724222 646 caffe.cpp:475] Testing for 1 iterations.
F0908 05:43:51.915272 646 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x100000f5ce0c google::LogMessage::Fail()
@ 0x100000f5f284 google::LogMessage::SendToLog()
@ 0x100000f5c768 google::LogMessage::Flush()
@ 0x100000f611c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x10000026e3a0 caffe::SyncedMemory::mutable_gpu_data()
@ 0x1000002736c4 caffe::Blob<>::mutable_gpu_diff()
@ 0x1000004e774c caffe::InnerProductLayer<>::Backward_gpu()
@ 0x10018ca8 (unknown)
@ 0x10012974 (unknown)
@ 0x100001c2309c (unknown)
@ 0x100001c23298 __libc_start_main
@ (nil) (unknown)

아 ! 정말 error가 납니다. 정말 data에만 무려 15.8GB의 메모리가 필요하다고 나오더니, 실제 벤치마크에 들어가자마자 out of memory 에러가 나면서 중단됩니다. 정말 GPU의 발목을 잡는 것은 GPU 메모리 크기의 한계라는 것을 절실히 깨닫는 순간입니다.

하지만 IBM과 NVIDIA는 여기서 포기하지 않습니다. 원래 NVIDIA의 CUDA에서는 Unified Memory라고 해서, GPU가 CPU 메모리를 마치 GPU 메모리처럼 쓸 수 있는 기능을 내놓았지요. 그러나 실제로는 그렇게 GPU가 CPU memory에 접근하는 통로가 느려터진 PCIe이다보니, Unified Memory를 쓰면 편리하기는 해도 성능은 거의 1/10 수준으로 떨어져 버리는 것이 상식이었습니다. 이는 NVLink P100을 장착한 DGX-1 서버에서도 마찬가지였습니다. DGX-1도 GPU끼리만 NVLink로 연결될 뿐, 정작 CPU와 GPU를 연결하는 것은 PCIe거든요. 그래서 결국 아무도 caffe에서 unified memory를 쓸 생각을 하지 않았습니다.

그러나 IBM Minsky는 다릅니다. POWER8 processor에는 NVLink port가 박혀있으므로, CPU와 GPU가 NVLink로 직접 연결되며, 그것도 NVLink 2개를 뭉쳐서 무려 80GB/sec로 연결됩니다. PCIe의 2.5배나 되는 대역폭입니다. 이를 활용하여 caffe에서 CPU-GPU 간에 data를 직접 주고받을 수 있습니다 ! 실제로 IBM은 최근 발표한 PowerAI 4.0에 포함된 IBM caffe (caffe-ibm)에 이를 적용했습니다. 그 결과, IBM caffe에서는 일반 bvlc caffe나 NV caffe에는 없는 새로운 옵션, -lms (LMS, Large Model Support)를 사용할 수 있습니다.

이에 대해서는 아래 문서를 참조하시면 됩니다.

https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/README.html

역시 귀찮으신 분들을 위해 간략히 요약해드리면 이렇습니다.

-lms 8000000 : 이는 8000000 (kbyte 단위, 즉 8GB) 이상의 메모리 덩어리는 그냥 CPU 메모리 상에 두라는 뜻입니다.

즉, -lms 뒤에 큰 수를 적을 수록 가급적 GPU 메모리를 많이 쓰고 CPU 메모리는 정말 필요한 경우에만 쓰라는 이야기입니다. 당연히 최대치는 16000000 정도가 될 것이고, 이보다 더 큰 수를 적는 것은 사실상 LMS 옵션을 disable하는 효과를 냅니다. 반면에 -lms를 매우 작게, 가령 100으로 주는 것은 사실상 GPU 메모리를 쓰지 말고 다 CPU 메모리를 쓰라는 이야기가 됩니다.

또 -lms_frac <0~1.0> 이라는 옵션을 줄 수도 있습니다. 가령 -lms_frac 0.4로 주면, GPU 메모리 사용률이 40%가 되기 전에는 LMS 기능을 쓰지 말라는 것이 됩니다. 작은 크기의 model을 수행할 때는 굳이 느린 CPU 메모리를 쓸 필요가 없으므로, -lms_frac 0.9 정도로 주는 것이 좋습니다.

이제 위에서 out of memory를 낸 model에 대해 실제로 -lms 옵션을 적용해 보시지요. 먼저 -lms 8192, 즉 8MB 이상의 메모리 덩어리는 모두 CPU 메모리에 두라고 지시했습니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -lms 8192 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 05:47:44.949090 676 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 05:47:44.949105 676 net.cpp:143] Memory required for data: 172800000
I0908 05:47:44.949124 676 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 05:47:44.949146 676 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 05:47:44.949153 676 net.cpp:635] conv1/7x7_s2 <- data
I0908 05:47:44.949167 676 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 05:47:45.580006 676 net.cpp:128] Setting up conv1/7x7_s2
I0908 05:47:45.580046 676 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 05:47:45.580060 676 net.cpp:143] Memory required for data: 1094400000
...
I0908 05:47:57.704324 676 caffe.cpp:465] Initial loss: 0
I0908 05:47:57.704356 676 caffe.cpp:466] Performing Backward
I0908 05:47:57.704371 676 caffe.cpp:474] *** Benchmark begins ***
I0908 05:47:57.704377 676 caffe.cpp:475] Testing for 1 iterations.
I0908 05:47:57.711424 676 syncedmem.cpp:355] [LMS] memory[0x110024232400] device_=0 size_ = 921600000 allocation=7349057792 fragmented size = 655558000 gpu_ptr_=1155371368464
I0908 05:47:57.769644 676 syncedmem.cpp:355] [LMS] memory[0x110024258aa0] device_=0 size_ = 230400000 allocation=7579458048 fragmented size = 425158224 gpu_ptr_=1122381070352
I0908 05:47:57.778683 676 syncedmem.cpp:355] [LMS] memory[0x110024286d30] device_=0 size_ = 230400000 allocation=7809858304 fragmented size = 425158464 gpu_ptr_=1122842444032
I0908 05:47:57.790587 676 syncedmem.cpp:355] [LMS] memory[0x1100242c0be0] device_=0 size_ = 691200000 allocation=8731458560 fragmented size = 655558704 gpu_ptr_=1156294115344
I0908 05:47:57.838747 676 syncedmem.cpp:355] [LMS] memory[0x1100242df300] device_=0 size_ = 691200000 allocation=9653058816 fragmented size = 885958944 gpu_ptr_=1157447262464
...
I0908 05:47:58.203995 676 caffe.cpp:513] pool5/7x7_s1 forward: 4.48429 ms.
I0908 05:47:58.204002 676 caffe.cpp:516] pool5/7x7_s1 backward: 0.002144 ms.
I0908 05:47:58.204010 676 caffe.cpp:513] pool5/drop_7x7_s1 forward: 0.367552 ms.
I0908 05:47:58.204015 676 caffe.cpp:516] pool5/drop_7x7_s1 backward: 0.002112 ms.
I0908 05:47:58.204022 676 caffe.cpp:513] loss3/classifier forward: 18.1078 ms.
I0908 05:47:58.204033 676 caffe.cpp:516] loss3/classifier backward: 0.002112 ms.
I0908 05:47:58.204041 676 caffe.cpp:513] prob forward: 0.022848 ms.
I0908 05:47:58.204047 676 caffe.cpp:516] prob backward: 0.011328 ms.
I0908 05:47:58.204061 676 caffe.cpp:521] Average Forward pass: 495.206 ms.
I0908 05:47:58.204067 676 caffe.cpp:523] Average Backward pass: 2.21437 ms.
I0908 05:47:58.204074 676 caffe.cpp:525] Average Forward-Backward: 499.65 ms.
I0908 05:47:58.204092 676 caffe.cpp:527] Total Time: 499.65 ms.
I0908 05:47:58.204107 676 caffe.cpp:528] *** Benchmark ends ***

예 ! 도중에 LMS가 사용된다는 메시지가 display되면서 성공적으로 완료되었습니다 ! 아무래도 느린 CPU 메모리를 사용하니까 당연히 성능은 떨어졌을 것입니다. 얼마나 떨어졌을까요 ? 여기서의 결과는 Average Forward pass: 495.206 ms 인데, batch size가 10이므로 이미지 1장당 0.0495초 걸린 것입니다. 위에서 1장씩 테스트했을 때의 결과 0.045초보다 10% 정도 느려졌습니다. 10장씩 batch로 돌리면 사실 1장씩 돌린 것보다는 빨리 나와야 하는데 오히려 10% 느려진 것은 많이 느려진 것이지요.

결국 LMS를 사용하면 심각한 성능 저하는 어쩔 수 없이 발생하는 것일까요 ? 꼭 그렇지는 않습니다. 방금 제가 수행한 것은 극단적으로 거의 모든 메모리 덩어리를 CPU 메모리에 두라고 지시한 것입니다. GPU 메모리를 적극적으로 활용하되, GPU 메모리 크기보다 큰 것들만 어쩔 수 없이 CPU 메모리를 사용하라고 지시하면 성능이 훨씬 더 좋을 것입니다.

이번에는 그렇게 -lms 160000000 옵션으로 돌려 보겠습니다.

nimbix@JARVICENAE-0A0A1844:/data$ caffe time -gpu 0 -lms 160000000 -model=/data/bvlc_googlenet/deploy.prototxt --iterations=1

I0908 06:32:20.006875 1126 net.cpp:135] Top shape: 10 3 1200 1200 (43200000)
I0908 06:32:20.006891 1126 net.cpp:143] Memory required for data: 172800000
I0908 06:32:20.006904 1126 layer_factory.hpp:77] Creating layer conv1/7x7_s2
I0908 06:32:20.006927 1126 net.cpp:90] Creating Layer conv1/7x7_s2
I0908 06:32:20.006933 1126 net.cpp:635] conv1/7x7_s2 <- data
I0908 06:32:20.006944 1126 net.cpp:609] conv1/7x7_s2 -> conv1/7x7_s2
I0908 06:32:20.591289 1126 net.cpp:128] Setting up conv1/7x7_s2
I0908 06:32:20.591329 1126 net.cpp:135] Top shape: 10 64 600 600 (230400000)
I0908 06:32:20.591343 1126 net.cpp:143] Memory required for data: 1094400000
...
I0908 06:32:28.272960 1126 net.cpp:296] [LMS] BuildLargeModelSupport
W0908 06:32:28.273018 1126 net.cpp:348] [LMS] ######################################################
W0908 06:32:28.273172 1126 net.cpp:349] [LMS] uncovered layer type: Softmax
W0908 06:32:28.273182 1126 net.cpp:350] [LMS] ######################################################
W0908 06:32:28.273310 1126 net.cpp:348] [LMS] ######################################################
W0908 06:32:28.273320 1126 net.cpp:349] [LMS] uncovered layer type: Input
W0908 06:32:28.273329 1126 net.cpp:350] [LMS] ######################################################
I0908 06:32:28.273347 1126 net.cpp:425] [LMS] data_forward [0] data: -> data: 0x110009bfa4f0(172800000) ### flag=0 data:
I0908 06:32:28.273361 1126 net.cpp:425] [LMS] conv1/7x7_s2_forward [1] data: 0x110009bfa4f0(172800000) -> data: 0x1100233f7520(921600000) ### flag=0 data: 0x110009bfa4f0(1,1)
...
I0908 06:32:29.055697 1126 caffe.cpp:513] prob forward: 0.022016 ms.
I0908 06:32:29.055704 1126 caffe.cpp:516] prob backward: 0.006848 ms.
I0908 06:32:29.055716 1126 caffe.cpp:521] Average Forward pass: 263.516 ms.
I0908 06:32:29.055724 1126 caffe.cpp:523] Average Backward pass: 2.21066 ms.
I0908 06:32:29.055730 1126 caffe.cpp:525] Average Forward-Backward: 267.967 ms.
I0908 06:32:29.055748 1126 caffe.cpp:527] Total Time: 267.967 ms.
I0908 06:32:29.055764 1126 caffe.cpp:528] *** Benchmark ends ***

이번에는 10장에 대해 263.516 ms, 즉 1장에 대해서는 0.0263초가 걸렸습니다. 이는 1장씩 테스트했을 때의 결과 0.045초보다 무려 71% 빠른 결과입니다 ! LMS 덕분에 10장씩 batch로 돌리니까 더 빨라진 것이지요. 결국 LMS를 사용하면 오히려 더 빠른 성능을 낼 수도 있는 것입니다.