HW 엔지니어를 위한 Deep Learning: LMS

레이블이 LMS인 게시물을 표시합니다. 모든 게시물 표시

2020년 6월 12일 금요일

tf_cnn_benchmarks.py를 이용한 Tensorflow Large Model Support (LMS)의 demo

IBM Watson Machine Learning Community Edition (WML-CE) 1.6.2 속에 포함된 Tensorflow 1.15를 이용하여 large model support (LMS)에 대한 demo를 해보는 방법입니다.

** 아래 환경은 IBM CECC cloud에서 제공되는 가상화 환경의 P100 GPU를 이용했기 때문에, training 성능 자체는 떨어진다는 점을 인지해주시기 바랍니다.

가장 간단한 것은 WML-CE 속에 들어있는 tf_cnn_benchmarks suite를 이용하는 것입니다.

먼저 아래 명령어를 이용하여 tf_cnn_benchmarks suite를 원하는 directory에 copy합니다. (이건 optional step이고, 그냥 해당 directory로 직접 찾아 들어가도 됩니다.)

(wmlce_162) [cecuser@p1290-kvm1 ~]$ ./anaconda3/envs/wmlce_162/bin/install_tf_cnn_benchmarks .

(wmlce_162) [cecuser@p1290-kvm1 ~]$ cd tf_cnn_benchmarks

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ ls
all_reduce_benchmark.py cnn_util.py mlperf_test.py test_data
all_reduce_benchmark_test.py cnn_util_test.py models test_util.py
allreduce.py coco_metric.py platforms tf_cnn_benchmarks.py
allreduce_test.py constants.py preprocessing.py variable_mgr.py
batch_allreduce.py convnet_builder.py __pycache__ variable_mgr_util.py
benchmark_cnn_distributed_test.py datasets.py README.md variable_mgr_util_test.py
benchmark_cnn_distributed_test_runner.py flags.py run_tests.py
benchmark_cnn.py leading_indicators_test.py ssd_constants.py
benchmark_cnn_test.py mlperf.py ssd_dataloader.py

다만 여기서 benchmark_cnn.py 에서 일부 source code를 수정해야 합니다. 이는 source code 안에 들어있는 LMS 관련 parameter인 lms_swapout_threshold 관련 bug 때문입니다. 원래 값인 -1을 그대로 놔두면 원래는 auto-tuning이 되어야 하는데, TF 버전 등과의 호환 문제로 거기서 에러가 나므로, 일단은 그냥 1로 수정합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ vi benchmark_cnn.py

...

#flags.DEFINE_integer('lms_swapout_threshold', -1,

flags.DEFINE_integer('lms_swapout_threshold', 1,

...

그렇게 한 뒤 아래와 같이 tf_cnn_benchmarks.py를 수행해 봅니다. 여기서는 batch_size를 150으로 주고 해봅니다. 일단 잘 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=150 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
I0612 04:14:43.269922 140736229690240 session_manager.py:502] Done running local_init_op.
Running warm up
2020-06-12 04:14:45.534222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:14:45.728889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 255.8 +/- 0.0 (jitter = 0.0) 7.820
10 images/sec: 255.8 +/- 0.1 (jitter = 0.4) 8.082
20 images/sec: 255.7 +/- 0.1 (jitter = 0.4) 7.856
30 images/sec: 255.6 +/- 0.1 (jitter = 0.3) 7.832
40 images/sec: 255.5 +/- 0.0 (jitter = 0.3) 7.879
50 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.701
60 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.918
70 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.845
80 images/sec: 255.4 +/- 0.0 (jitter = 0.2) 7.750
90 images/sec: 255.3 +/- 0.0 (jitter = 0.2) 7.806
100 images/sec: 255.3 +/- 0.0 (jitter = 0.3) 7.856
----------------------------------------------------------------
total images/sec: 255.22
----------------------------------------------------------------

이때 OS에서 nmon tool을 이용해서 관찰해보면, host RAM의 메모리 사용량이 5GB 정도에 불과하고 tf_cnn_benchmarks의 data size도 32GB 정도, res set size도 2.1GB 정도에 불과한 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 58734.8 3518.5 x
x Free Percent 92.7% 85.9% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13671 42.1 32142m 2153m 3200 2866m 0 633216 12 0 tf_cnn_benchmar x

이제 batch_size를 200으로 늘려 보겠습니다. 그러면 16GB에 불과한 P100 GPU의 메모리가 꽉 차서 결국 Out-Of-Memory(OOM) error가 발생합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[main_fetch_group/_566]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
...

하지만 동일한 batch_size를 그대로 주더라도, LMS를 적용하여 --lms=True 옵션을 주고 수행해보면 (비록 많이 느려졌지만) error 없이 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10 --lms=True
....
I0612 04:27:06.439511 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: -0.09 GiB (memory ratio: 0.9)
I0612 04:27:06.439677 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0612 04:27:06.440271 140735558208384 lms.py:1275] [LMS][0] LMS will use the latest parameter set found by Simulator for the best performance. However, if you encounter an out-of-memory error, please manually use the previous parameter set found by Simulator.
I0612 04:27:06.440359 140735558208384 lms.py:1275] [LMS][0] sync_mode: 3 (Synchronous memory copy between host and device)
I0612 04:27:06.440439 140735558208384 lms.py:1275] [LMS][0] swapout_threshold: 1
I0612 04:27:06.440520 140735558208384 lms.py:1275] [LMS][0] swapin_ahead: -1 (ignored since sync_mode is 3)
I0612 04:27:06.440600 140735558208384 lms.py:1275] [LMS][0] swapin_groupby: -1 (ignored since sync_mode is 3)
I0612 04:27:06.869183 140735558208384 lms.py:1275] [LMS][0] Added 425 operations to the model (180 swap-out operations (20.33 GiB) and 245 swap-in operations (31.36 GiB))
I0612 04:27:06.869335 140735558208384 lms.py:1275] [LMS][0] Editing model for LMS, took: 799.3814945220947 ms
...
Running warm up
2020-06-12 04:27:15.098435: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:27:15.371592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.988
10 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.900
20 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.914
30 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 8.043
40 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.880
50 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.903
60 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.889
70 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.770
80 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.906
90 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.813
100 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.824
----------------------------------------------------------------
total images/sec: 21.13
----------------------------------------------------------------

이때 host의 OS를 관찰해보면 host RAM 사용량이 22GB 정도로 대폭 늘었고, tf_cnn_benchmarks의 data size도 50GB 정도, res set size도 19GB 정도로 대폭 늘어난 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 41505.6 3490.5 x
x Free Percent 65.5% 85.2% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13427 10.4 49577m19322m 3200 3399m 0 176710 0 0 tf_cnn_benchmar x

2018년 8월 22일 수요일

PowerAI 5.2 tensorflow의 LMS 테스트

Caffe와는 달리 tensorflow는 그 자체가 python library이므로, tensorflow에서 LMS를 사용하기 위해서는 python coding이 필요하며, 그를 위한 가이드를 이전 posting에 올린 바 있습니다. 이번에는 그렇게 적용된 example code를 이용하여 LMS가 주는 효과를 테스트 해보겠습니다.

먼저, PowerAI에서 제공되는 High Performance Model을 아래 script를 이용해서 원하는 위치로 copy합니다.

[[bsyu@p57a22 ~]$ /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models ~/models

여기서 제공되는 tf_cnn_benchmarks.py를 다음과 같이 이용해서 간단한 테스트가 가능합니다. 이 python code에서는 data_dir을 지정하지 않을 경우 synthetic data, 즉 임의로 합성한 가상 data를 이용해서 training을 합니다. 여기서 image_size=6944라는 것은 6944 * 6944 = 약 48 megapixel의 이미지 크기를 지정하는 것입니다. Color image의 경우 1 pixel이 3 byte이고, black/white image인 경우 1 pixel = 1 byte입니다. 아래와 같이 image_size=6944로 주면 140MB 정도 되는 큰 이미지이므로, batch_size=1이라고 해도 당연히 out-of-memory error가 나는 것을 보실 수 있습니다.

[bsyu@p57a22 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=1 --num_batches=30 --model=googlenet –num_gpus=1 --image_size=6944
TensorFlow: 1.8
Model: googlenet
Dataset: imagenet (synthetic)
Mode: training
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,480,868,868] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: v0/tower_0/cg/incept_v10_1/concat = ConcatV2[N=4, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](v0/tower_0/cg/incept_v10_1/conv9/Relu, v0/tower_0/cg/incept_v10_1/conv11/Relu, v0/tower_0/cg/incept_v10_1/conv13/Relu, v0/tower_0/cg/incept_v10_1/conv14/Relu, v0/tower_0/cg/incept_v10_8/concat/axis)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

환경에 따라 다르겠습니다만, googlenet model에서 16GB의 GPU memory로 처리가능한 image 크기는 대략 5000^2 즉 25 megapixel 정도로 알려져 있습니다. 그러나 LMS를 이용하면 더 큰 크기도 error 없이 처리가 가능합니다.

[bsyu@p57a22 tf_cnn_benchmarks]$ time CUDA_VISIBLE_DEVICES=2 python tf_cnn_benchmarks.py --batch_size=1 --num_batches=30 --model=googlenet --num_gpus=1 --lms=True --image_size=8192
TensorFlow: 1.8
Model: googlenet
Dataset: imagenet (synthetic)
…
I0820 08:25:47.774425 70366577052288 tf_logging.py:116] [LMS][0] n_tensors: all tensors
I0820 08:25:47.774482 70366577052288 tf_logging.py:116] [LMS][0] lb: 1
I0820 08:25:49.216497 70366577052288 tf_logging.py:116] [LMS][0] Edited model is valid and logically equivalent to the original one
I0820 08:25:49.216679 70366577052288 tf_logging.py:116] [LMS][0] Added 227 ops into the model
I0820 08:25:49.216750 70366577052288 tf_logging.py:116] [LMS][0] Editing model for LMS, took: 1442.2016143798828 ms
I0820 08:25:49.216805 70366577052288 tf_logging.py:116] [LMS][0] 83 tensors will be swapped out(in) to(from) the host
…
----------------------------------------------------------------
total images/sec: 0.21
----------------------------------------------------------------
real 3m44.957s

저 위 message에서 보시듯이, tensorflow LMS에서도 일부 tuning 가능한 요소가 있습니다.

n_tensors :
host 서버의 RAM으로 swap-out 했다가 필요시 swap-in 할 tensor의 개수입니다. 많을 수록 GPU memory를 적게 쓰지만, 대신 training 성능은 나빠집니다. Default 값은 -1로서, 일단 모든 tensor를 다 swap 대상으로 삼은 뒤, 초기 estimation 후 자동으로 적절한 값을 산정하여 사용합니다.

lb :
Lower Bound의 약자로서, 미리 swap-in 시켜놓을 tensor의 개수입니다. 많을 수록 성능은 좋아지겠지만, 대신 GPU memory를 많이 쓰게 됩니다. Default 값은 1입니다.

이 tf_cnn_benchmarks.py에서는 저 tuning 요소들도 parameter로 받아들이게 되어 있습니다. 다음과 같이 lb를 3으로 해서 해보겠습니다. 확실히 좀 더 빨라지는 것을 보실 수 있습니다.

[bsyu@p57a22 tf_cnn_benchmarks]$ time CUDA_VISIBLE_DEVICES=2 python tf_cnn_benchmarks.py --batch_size=1 --num_batches=30 --model=googlenet --num_gpus=1 --lms=True --image_size=8192 --lms_lb=3
…
I0821 07:06:26.437339 70366510140032 tf_logging.py:116] [LMS][0] Editing model for LMS
I0821 07:06:26.437493 70366510140032 tf_logging.py:116] [LMS][0] n_tensors: all tensors
I0821 07:06:26.437549 70366510140032 tf_logging.py:116] [LMS][0] lb: 3
I0821 07:06:27.872007 70366510140032 tf_logging.py:116] [LMS][0] Edited model is valid and logically equivalent to the original one
I0821 07:06:27.872166 70366510140032 tf_logging.py:116] [LMS][0] Added 227 ops into the model
I0821 07:06:27.872235 70366510140032 tf_logging.py:116] [LMS][0] Editing model for LMS, took: 1434.621810913086 ms
I0821 07:06:27.872288 70366510140032 tf_logging.py:116] [LMS][0] 83 tensors will be swapped out(in) to(from) the host
…
----------------------------------------------------------------
total images/sec: 0.23
----------------------------------------------------------------
real 3m28.466s

참고로 n_tensors는 일부러 크게 키워도 여전히 83개의 tensor만을 처리하고, 또 lb를 5로 주니 LMS에서도 out-of-memory error가 나더군요.

PowerAI 5.2의 caffe-ibm에서의 LMS 테스트 (cifar10)

먼저, cifar10 dataset을 준비합니다. 보통 caffe에 포함되어 있는 get_cifar10.sh를 수행하면 lmdb로 포맷된 dataset을 일사천리로 download 받을 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ pwd
/opt/DL/caffe-ibm

[bsyu@p57a22 caffe-ibm]$ ./data/cifar10/get_cifar10.sh
Downloading...
--2018-08-21 04:14:32-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[============================================================================>] 170,052,171 37.4MB/s in 4.6s

2018-08-21 04:14:37 (35.4 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

LMS의 효용성을 보기 위해서는 image size가 커야 합니다만, cifar10에 포함된 image들은 6만개의 32x32 칼러 이미지들로서 size가 매우 작습니다. 대신 한번에 GPU에서 처리되는 image 개수인 batch_size를 크게 하여 GPU memory를 꽉 차게 해보겠습니다.

[bsyu@p57a22 caffe-ibm]$ vi examples/cifar10/cifar10_full_train_test.prototxt
...
data_param {
source: "examples/cifar10/cifar10_train_lmdb"
batch_size: 22000 # 원래는 100
backend: LMDB
}

[bsyu@p57a22 caffe-ibm]$ which caffe
/opt/DL/caffe-ibm/bin/caffe

이제 수행해봅니다. batch_size: 22000 정도로는 LMS 없이도 잘 수행되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 03:51:25.516746 52459 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 03:51:28.164131 52459 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 03:51:28.753823 52459 solver.cpp:336] Iteration 200, loss = 1.71708
I0821 03:51:28.753847 52459 solver.cpp:341] Optimization Done.
I0821 03:51:28.753859 52459 caffe.cpp:421] Optimization Done.

이때 nvidia-smi로 GPU memory 사용량을 보면 거의 아슬아슬하게 GPU memory가 꽉 찬 것을 보실 수 있습니다.

이번에는 batch_size: 24000으로 늘려서 다시 동일한 training을 수행해봅니다. 이번에는 error가 납니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
F0821 04:26:41.953693 60726 syncedmem.cpp:569] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x3fffa30acf8c (unknown)
@ 0x3fffa30afa0c (unknown)
@ 0x3fffa30ac9b4 (unknown)
@ 0x3fffa30b0634 (unknown)
@ 0x3fffaac7c154 caffe::SyncedMemory::get_gpu_ptr()
@ 0x3fffaac77650 caffe::SyncedMemory::mutable_gpu_data()
@ 0x3fffaaa8aa28 caffe::Blob<>::mutable_gpu_diff()
@ 0x3fffaaced94c caffe::PoolingLayer<>::Backward_gpu()
@ 0x3fffaac207a4 caffe::Net<>::BackwardFromTo()
@ 0x3fffaac20ad8 caffe::Net<>::Backward()
@ 0x3fffaac5aabc caffe::Solver<>::Step()
@ 0x3fffaac5b328 caffe::Solver<>::Solve()
@ 0x1000e9e4 train()
@ 0x1000a848 main
@ 0x3fff88155100 generic_start_main.isra.0
@ 0x3fff881552f4 __libc_start_main
@ (nil) (unknown)

이번에는 역시 batch_size: 24000으로 하되, training시킬 때 -lms라는 옵션을 넣어서 수행합니다. 이번에는 LMS 관련 message가 나오면서, 문제없이 training이 완료되는 것을 보실 수 있습니다.

[bsyu@p57a22 caffe-ibm]$ caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
I0821 04:30:40.643229 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=35% available=19%]
I0821 04:30:42.045287 75342 syncedmem.cpp:349] [LMS:2] allocate: size=786432008 [count=80 demanded=159% allocated=39% available=36%]
I0821 04:30:44.427603 75342 syncedmem.cpp:349] [LMS:2] allocate: size=3145728008 [count=80 demanded=159% allocated=53% available=61%]
I0821 04:30:44.605427 75342 solver.cpp:244] Iteration 0 (0 iter/s, 0s/200 iters), loss = 2.30264
I0821 04:30:44.605461 75342 solver.cpp:263] Train net output #0: loss = 2.30264 (* 1 = 2.30264 loss)
I0821 04:30:44.605484 75342 sgd_solver.cpp:128] Iteration 0, lr = 0.001
I0821 04:30:44.636401 75346 data_layer.cpp:86] Restarting data prefetching from start.
…
I0821 04:42:50.161826 75342 solver.cpp:497] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_200.caffemodel.h5
I0821 04:42:50.164942 75342 sgd_solver.cpp:373] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_200.solverstate.h5
I0821 04:42:50.908159 75342 solver.cpp:336] Iteration 200, loss = 1.72997
I0821 04:42:50.908186 75342 solver.cpp:341] Optimization Done.
I0821 04:42:50.908195 75342 caffe.cpp:421] Optimization Done.

그리고 LMS 덕분에 GPU memory 사용량이 확 줄어든 것을 보실 수 있습니다.

그러나 lms를 쓸 경우, 분명히 성능은 다소 느려집니다. PCIe 대신 NVLink를 쓴다고 해도, GPU 메모리보다 host 서버의 DRAM이 느린 것이 당연하니까요. 그러나 lms도 약간이나마 튜닝을 할 수는 있습니다. 주로 다음의 2가지 추가 옵션을 쓰시면 됩니다.

-lms_size_threshold <size in KB> : Default는 1000.
여기에 명기하는 것보다 작은 크기의 memory chunk는 LMS에 의해 swap-out/in 되지 않고 GPU 메모리에 상주하게 하라는 뜻입니다.

-lms_exclude <size in MB> : Default는 0.
GPU 메모리 크기에서 이 값을 뺀 크기가 LMS cache로 사용되는 GPU memory 할당량의 soft limit입니다. 이 값을 작게 할 수록 GPU 메모리 사용량이 커져 성능이 좋아집니다.

가령 아래의 명령어 옵션은 crop size 2240x2240, batch size 5인 고해상도 이미지 dataset에서 GoogleNet model을 사용할 경우 가장 좋은 LMS 성능을 낸 example입니다. 그러나 모델의 신경망 구조와 data 크기, batch_size와 GPU 메모리 크기에 따라 이상적인 튜닝값은 다 다릅니다.

$ caffe train -solver=solver.prototxt -gpu all -lms -lms_size_threshold 1000 -lms_exclude 1400

실제 cifar10에서는 어떨까요 ? 먼저 lms 없이 수행할 수 있는 최대 크기의 batch_size로 non-LMS caffe training을 해보겠습니다. 그럴 경우, LMS에서 사용했던 batch_size 24000 * max_iter 200와 동일한 개수의 image를 non-LMS에서 처리하기 위해서는 batch_size 22000 * max_iter 219를 쓰시면 됩니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m46.926s

이와 거의 같은 개수의 image인 batch_size 24000 * max_iter 200를 LMS로 별도 튜닝 옵션 없이 처리해보겠습니다. 확실히 조금 더 느립니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 12m12.219s

이번에는 동일한 training에 -lms_size_threshold와 -lms_exclude를 추가해서 수행해봅니다. 성능이 거의 non-LMS만큼 빨라진 것이 보입니다.

[bsyu@p57a22 caffe-ibm]$ time caffe train -gpu 2 -lms -lms_size_threshold 1000 -lms_exclude 1400 --solver=examples/cifar10/cifar10_full_solver.prototxt
…
real 11m47.405s

그리고 더 큰 batch_size를 사용함에도 LMS에 의해 줄어들었던 GPU memory 사용량이 거의 non-LMS 수준으로 다시 늘어난 것이 보입니다.

2018년 8월 21일 화요일

Session-based tensorflow training에 LMS 적용한 MNIST python code

이 MNIST training을 위한 tensorflow python code는 원래 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/examples/tutorials/mnist/mnist_deep.py 에 LMS를 적용한 것입니다. 보시다시피 graph라는 단어가 나오기 때문에 Session-based tensorflow training이고, 그에 따라 LMS가 적용되어 있습니다.

실제 LMS 구현을 위한 부분은 굵은 파란색으로 표시했습니다. 의외로 간단하다는 것을 보실 수 있습니다. 해당 부분들을 제거하면 그냥 LMS 없는 평범한 MNIST training code가 됩니다.

이 example code도 PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py 을 그대로 가져다 놓은 것입니다.

실제 수행해보면 다음과 같이 동작하며, 27개의 tensor가 host 서버의 RAM으로 swap-out/in 되는 것을 보실 수 있습니다.

[bsyu@p57a22 examples]$ CUDA_VISIBLE_DEVICES=3 python mnist_deep_lms.py

...

INFO:tensorflow:[LMS][0] Editing model for LMS
INFO:tensorflow:[LMS][0] n_tensors: all tensors
INFO:tensorflow:[LMS][0] lb: 3
INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one
INFO:tensorflow:[LMS][0] Added 53 ops into the model
INFO:tensorflow:[LMS][0] Editing model for LMS, took: 179.62193489074707 ms
INFO:tensorflow:[LMS][0] 27 tensors will be swapped out(in) to(from) the host
Saving graph to: /tmp/tmpz5ozc9jr
2018-08-20 21:37:47.954552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
....

step 19900, training accuracy 1
test accuracy 0.9918

[bsyu@p57a22 doc]$ cd /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/

[bsyu@p57a22 examples]$ source /opt/DL/tensorflow/bin/tensorflow-activate

[bsyu@p57a22 examples]$ cat /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""A deep MNIST classifier using convolutional layers.

See extensive documentation at
https://www.tensorflow.org/get_started/mnist/pros
"""
# Disable linter warnings to maintain consistency with tutorial.
# pylint: disable=invalid-name
# pylint: disable=g-bad-import-order

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
FLAGS = None

def deepnn(x):
"""deepnn builds the graph for a deep net for classifying digits.

Args:
x: an input tensor with the dimensions (N_examples, 784), where 784 is the
number of pixels in a standard MNIST image.

Returns:
A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values
equal to the logits of classifying the digit into one of 10 classes (the
digits 0-9). keep_prob is a scalar placeholder for the probability of
dropout.
"""
# Reshape to use within a convolutional neural net.
# Last dimension is for "features" - there is only one here, since images are
# grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
with tf.name_scope('reshape'):
x_image = tf.reshape(x, [-1, 28, 28, 1])

# First convolutional layer - maps one grayscale image to 32 feature maps.
with tf.name_scope('conv1'):
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

# Pooling layer - downsamples by 2X.
with tf.name_scope('pool1'):
h_pool1 = max_pool_2x2(h_conv1)

# Second convolutional layer -- maps 32 feature maps to 64.
with tf.name_scope('conv2'):
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

# Second pooling layer.
with tf.name_scope('pool2'):
h_pool2 = max_pool_2x2(h_conv2)

# Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
# is down to 7x7x64 feature maps -- maps this to 1024 features.
with tf.name_scope('fc1'):
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# Dropout - controls the complexity of the model, prevents co-adaptation of
# features.
with tf.name_scope('dropout'):
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Map the 1024 features to 10 classes, one for each digit
with tf.name_scope('fc2'):
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
return y_conv, keep_prob

def conv2d(x, W):
"""conv2d returns a 2d convolution layer with full stride."""
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
"""max_pool_2x2 downsamples a feature map by 2X."""
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')

def weight_variable(shape):
"""weight_variable generates a weight variable of a given shape."""
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)

def bias_variable(shape):
"""bias_variable generates a bias variable of a given shape."""
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)

def main(_):
# Import data
mnist = input_data.read_data_sets(FLAGS.data_dir)

# Create the model
x = tf.placeholder(tf.float32, [None, 784])

# Define loss and optimizer
y_ = tf.placeholder(tf.int64, [None])

# Build the graph for the deep net
y_conv, keep_prob = deepnn(x)

with tf.name_scope('loss'):
cross_entropy = tf.losses.sparse_softmax_cross_entropy(
labels=y_, logits=y_conv)
cross_entropy = tf.reduce_mean(cross_entropy)

with tf.name_scope('adam_optimizer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

with tf.name_scope('accuracy'):
correct_prediction = tf.equal(tf.argmax(y_conv, 1), y_)
correct_prediction = tf.cast(correct_prediction, tf.float32)
accuracy = tf.reduce_mean(correct_prediction)

# Enable Large Model Support
from tensorflow.contrib.lms import LMS
lms_model = LMS({'adam_optimizer'},
excl_scopes = {'loss', 'accuracy', 'dropout'},
lb=3)
lms_model.run(tf.get_default_graph())

graph_location = tempfile.mkdtemp()
print('Saving graph to: %s' % graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(tf.get_default_graph())

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(20000):
batch = mnist.train.next_batch(50)
if i % 100 == 0:
train_accuracy = accuracy.eval(feed_dict={
x: batch[0], y_: batch[1], keep_prob: 1.0})
print('step %d, training accuracy %g' % (i, train_accuracy))
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print('test accuracy %g' % accuracy.eval(feed_dict={
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', type=str,
default='/tmp/tensorflow/mnist/input_data',
help='Directory for storing input data')
FLAGS, unparsed = parser.parse_known_args()

tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

Estimator-based tensorflow training에 LMS 적용한 MNIST python code

이 MNIST training을 위한 tensorflow python code는 원래 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py 에 LMS를 적용한 것입니다. 보시다시피 hook를 사용하기 때문에 Estimator-based tensorflow training이고, 그에 따라 LMS가 적용되어 있습니다.

LMS가 동작하는 message 확인이라든가, 다중 사용자를 위한 permission 등을 위한 부분도 있습니다만 그건 LMS와는 무관한 부분이고, 그 부분들은 빨간색으로 표시를 했습니다. 실제 LMS 구현을 위한 부분은 굵은 파란색으로 표시했습니다. 의외로 간단하다는 것을 보실 수 있습니다. 해당 부분들을 제거하면 그냥 LMS 없는 평범한 MNIST training code가 됩니다.

이 example code도 PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py 을 그대로 가져다 놓은 것입니다.

실제 수행해보면 다음과 같이 동작하며, 12개의 tensor가 host 서버의 RAM으로 swap-out/in 되는 것을 보실 수 있습니다.

[bsyu@p57a22 ~]$ cd /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples

[bsyu@p57a22 examples]$ source /opt/DL/tensorflow/bin/tensorflow-activate

[bsyu@p57a22 examples]$ python cnn_mnist_lms.py
...
INFO:tensorflow:[LMS][1] Tensor sparse_softmax_cross_entropy_loss/Sum_1:0 will be placed on /cpu:0
INFO:tensorflow:[LMS][1] Operation: adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1, order 23, type RealDiv
INFO:tensorflow:[LMS][1] Tensor adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1:0 will be placed on /cpu:0
INFO:tensorflow:[LMS][1] Consuming op adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_2 (order 24) swaps in adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1:0
INFO:tensorflow:[LMS][1] No control dependency op needed for swap in of op adam_optimizer/gradients/sparse_softmax_cross_entropy_loss/div_grad/RealDiv_1.
INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one
INFO:tensorflow:[LMS][0] Added 25 ops into the model
INFO:tensorflow:[LMS][0] Editing model for LMS, took: 88.58513832092285 ms
INFO:tensorflow:[LMS][0] 12 tensors will be swapped out(in) to(from) the host
INFO:tensorflow:Graph was finalized.
...
{'accuracy': 0.9702, 'loss': 0.098796368, 'global_step': 20000}

전체 code 내용은 아래를 보시기 바랍니다.

[bsyu@p57a22 examples]$ cat /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py

# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tempfile # Change not related to LMS
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO) #LMS 기능과는 무관. LMS 메시지를 보기 위한 설정.

def cnn_model_fn(features, labels, mode):
"""Model function for CNN."""
# Input Layer
# Reshape X to 4-D tensor: [batch_size, width, height, channels]
# MNIST images are 28x28 pixels, and have one color channel
input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])

# Convolutional Layer #1
# Computes 32 features using a 5x5 filter with ReLU activation.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 28, 28, 1]
# Output Tensor Shape: [batch_size, 28, 28, 32]
conv1 = tf.layers.conv2d(
inputs=input_layer,
filters=32,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #1
# First max pooling layer with a 2x2 filter and stride of 2
# Input Tensor Shape: [batch_size, 28, 28, 32]
# Output Tensor Shape: [batch_size, 14, 14, 32]
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

# Convolutional Layer #2
# Computes 64 features using a 5x5 filter.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 14, 14, 32]
# Output Tensor Shape: [batch_size, 14, 14, 64]
conv2 = tf.layers.conv2d(
inputs=pool1,
filters=64,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #2
# Second max pooling layer with a 2x2 filter and stride of 2
# Input Tensor Shape: [batch_size, 14, 14, 64]
# Output Tensor Shape: [batch_size, 7, 7, 64]
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

# Flatten tensor into a batch of vectors
# Input Tensor Shape: [batch_size, 7, 7, 64]
# Output Tensor Shape: [batch_size, 7 * 7 * 64]
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])

# Dense Layer
# Densely connected layer with 1024 neurons
# Input Tensor Shape: [batch_size, 7 * 7 * 64]
# Output Tensor Shape: [batch_size, 1024]
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

# Add dropout operation; 0.6 probability that element will be kept
dropout = tf.layers.dropout(
inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

# Logits layer
# Input Tensor Shape: [batch_size, 1024]
# Output Tensor Shape: [batch_size, 10]
logits = tf.layers.dense(inputs=dropout, units=10)

predictions = {
# Generate predictions (for PREDICT and EVAL mode)
"classes": tf.argmax(input=logits, axis=1),
# Add `softmax_tensor` to the graph. It is used for PREDICT and by the
# `logging_hook`.
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# Calculate Loss (for both TRAIN and EVAL modes)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

# Configure the Training Op (for TRAIN mode)
if mode == tf.estimator.ModeKeys.TRAIN:
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

# Add evaluation metrics (for EVAL mode)
eval_metric_ops = {
"accuracy": tf.metrics.accuracy(
labels=labels, predictions=predictions["classes"])}
return tf.estimator.EstimatorSpec(
mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

def main(unused_argv):
# Load training and eval data
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)

# The graph_location changes are not related to LMS enablement
# but rather allow multiple users to run the example without
# having permission issues on temp directories.
graph_location = tempfile.mkdtemp()
print('Saving graph to: %s' % graph_location)

# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir=graph_location)

# Set up logging for predictions
# Log the values in the "Softmax" tensor with label "probabilities"
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=50)

# Train the model
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_data},
y=train_labels,
batch_size=100,
num_epochs=None,
shuffle=True)

# Hook for Large Model Support
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'}, lb=3, debug=True)

mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook, lms_hook])

# Evaluate the model and print results
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

if __name__ == "__main__":
tf.app.run()

Tensorflow LMS를 이용한 python coding guide (README-LMS.md)

PowerAI 5.2에 포함된 Tensorflow 1.8에는 IBM이 자랑하는 LMS (Large Model Support) 기능이 통합되어 있습니다. 즉, GPU 메모리가 부족할 경우 host 서버의 RAM을 GPU 메모리처럼 사용할 수 있도록 해주는 기능으로서, 마치 real memory가 부족할 때 disk 상의 swap space로 swap in/out 하는 것과 비슷하다고 생각하시면 됩니다.

Caffe-ibm에는 이것이 -lms 옵션으로 구현되어 있으므로 별다른 코딩 없이 사용하시면 되는데, tensorflow는 그 자체가 원래 python의 library이므로 LMS를 사용하기 위해서는 python coding이 약간 필요합니다. 그에 대한 가이드는 아직 internet 상에는 없고, PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/doc/README-LMS.md에 비교적 상세한 설명이 들어 있습니다.

IBM PowerAI 5.2는 Minsky 또는 AC922 서버를 소유하고 계신 고객께서는 추가 비용없이 주문하실 수 있는 무료 SW입니다. (그러나 공식적으로 주문을 하셔야 받으실 수 있습니다.) 고객 여러분의 편의를 위해 그 README-LMS.md를 여기에 올려놓습니다.

아래 문서에 자세히 나와 있습니다만, tensorflow에서 training하는 방법에는 크게 2가지 방법이 있고, LMS 구현도 그에 따라 2가지 방법이 있습니다. 제가 얼치기로 이해한 것을 요약하면 이렇습니다.

1) Session-based training :
기존 tensorflow python code에 graph라는 단어가 나오면 training 부분 전에 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph()) # 파란색 부분을 삽입

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})

2) Estimator-based training :
기존 tensorflow python code에 hook이라는 단어가 나오면 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'}) # 파란색 부분을 삽입
...

mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook]) #lms_hook을 삽입

그리고 LMS 튜닝에 있어서는 2가지 parameter가 중요합니다. 실제로는 n_tensor를 code에서 직접 쓰실 일은 거의 없는 것 같고, lb는 가끔 사용하는 것 같습니다.

n_tensor : host 서버의 RAM으로 swap-out 했다가 필요시 swap-in 할 tensor의 개수입니다. 당연히 많을 수록 GPU memory를 적게 쓰지만, 대신 training 성능은 나빠집니다. Default 값은 -1로서, 모든 tensor를 다 swap 대상으로 삼습니다만, tensorflow LMS가 초기 estimation 후 자동으로 적절한 값을 산정하여 사용합니다.

lb : Lower Bound의 약자로서, 미리 swap-in 시켜놓을 tensor의 개수입니다. 많을 수록 성능은 좋아지겠지만, 대신 GPU memory를 많이 쓰게 됩니다. Default 값은 1입니다.

전체 내용은 아래 README-LMS.md를 참조하십시요.

[bsyu@p57a22 doc]$ vi /opt/DL/tensorflow/doc/README-LMS.md

# TFLMS: Graph Editing Library for Large Model Support (LMS) in TensorFlow

This library provides an approach to training large models that cannot be fit into GPU memory.
It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa.
The computational graph is statically modified. Hence, it needs to be done before a session actually starts.

## How to use
TFLMS needs to know some information about user-defined models.
There is one requirement for a user-defined model: it must have scopes for the optimizers/solvers.

Enabling LMS for a model depends on how users write their training. Followings are guidelines for two ways: [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training and [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training.

### [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
```
#### Step 2: define an LMS object and run it
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())
```
The above lines must be put before starting a training session, for example:
- Before inserting LMS code
```python
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
- After inserting LMS code
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
For a working example of LMS integration with Session based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/mnist/mnist_deep.py`.

### [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
```
#### Step 2: define an LMSHook (LMSHook and LMS share the same set of parameters)
```python
# Hook for Large Model Support
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'})
```
#### Step 3: add the LMSHook into Estimator's hook list
```python
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook])
```

For a working example of LMS integration with Estimator based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py`.

### High-Performance Models
A version of TensorFlow High-Performance Models which includes options to use Distributed Deep Learning is included in the tensorflow-performance-models package. These models also have integrated
Large Model Support. For more information, see:

`/opt/DL/tensorflow-performance-models/tf_cnn_benchmarks/README.md`

### GPU scaling tips

To achieve better scaling performance with LMS on multiple GPUs,
the train script should updated to use PowerAI Distributed Deep Learning and
the training script should be run with the `ddlrun` command. Additionally,
if running on a single system without an Infiniband set up,
the `--mpiarg -pami_noib` parameter must be added to the ddlrun command
line, for example:

```bash
ddlrun --mpiarg -pami_noib -H host1 python tf_cnn_benchmarks.py --batch_size=512 --num_batches=100 --model=resnet50 --gpu_thread_mode=gpu_shared --num_gpus=1 --display_every=10 --lms=True --lms_lb=3 --lms_n_tensors=80 --variable_update=ddl
```

For more information on using ddlrun, see: `/opt/DL/ddl/doc/README.md` and `/opt/DL/ddl-tensorflow/doc/README.md`.

### Parameters for LMS/LMSHook
#### Required parameters
_graph_ :: the graph we will modify for LMS. This should be the graph of user-defined neural network. (not required in LMSHook)

_optimizer_scopes_ :: scopes for the optimizers/solvers.

#### Optional parameters
_starting_scope_ :: Tensors that are reachable from the operations in this scope will be swapped for LMS. Set this to the scope of the first layer if we would like to modify the whole graph. Default `None`.

_starting_op_names_ :: Tensors that are reachable from the operations with these names will be swapped for LMS. Default `None`.

_excl_scopes_ :: a set of scopes for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_scopes_ :: a set of scopes for operations whose tensors will be swapped out to the host. Default `empty`.

_excl_types_ :: a set of types for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_types_ :: a set of types for operations whose tensors will be swapped out to the host. Default `empty`.

_n_tensors_ :: The number of tensors for LMS, counting from the `starting_scope`. To turn off LMS, set `n_tensors` to `0`. Default `-1` (all reachable tensors will be swapped for LMS).

_lb_ :: Lowerbound value for LMS. A tensor will be swapped in during the backward phase at least `lb` nodes before it in the graph. Default `1`.

_ub_ :: Upperbound value for LMS. Default `10000`.

_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.

_ctrld_strategy_ :: Two strategies to find control dependency ops for swapin ops: `chain_rule` and `direct_order`. `chain_rule` strategy starts from a forward operation, goes forward and finds a corresponding backward operation to be a control dependency opepartion. `direct_order` strategy directly gets a backward ops in the topological order to be a control dependency operation. Both strategies depend on `lb` and `ub` to choose a control dependency operation. While the `direct_order` is more exact than `chain_rule` in relation to `lb` and `ub`, it experimentally often results in smaller maximum batch size than `chain_rule`. Default `chain_rule`.

_swap_branches_ :: If True, LMS will swap tensors in branches in the forward phase. Default `False`.

_branch_threshold_ :: If `swap_branches` is enabled and the topological-sort distance between the consuming operation and generating operation of a tensor is greater than `branch_threshold`, then swap the tensor. Default `0`.

_debug_ :: Debug mode for LMS. Default `False`.

_debug_level_ :: Debug level for LMS (1 or 2). Default `1`.

### Performance Tuning LMS

Once you have enabled LMS graph modification in your code you will want to find the combination of tuning parameters that gives the fastest training time and best accuracy with your model. The goal of the performance tuning is to swap out enough tensors to allow your training to run without hitting out of memory errors, while not swapping too many such that the extra swapping communication overhead degrades performance.

The two tuning parameters you should focus on are `n_tensors` and `lb`. Since `n_tensors` controls the number of tensors that will be swapped, the higher this is set, the lower the peak GPU memory usage will be. The `lb` controls how soon the tensor is swapped back in before use. A low value of `lb` can make the training on the GPU pause and wait while the swap in finishes. This will degrade performance. A higher value of `lb` can allow the tensor swap in to finish before it's needed and allow training to run without pause. The downside to swapping in too early is that more tensors will be in GPU memory at any point in time, resulting in higher peak GPU memory usage.

The tuning thus becomes finding the correct balance between `n_tensors` and `lb` that provides the best performance for given model. To start the performance tuning it's suggested that `n_tensors` be set to -1 which will swap all reachable tensors. The `lb` should be set to the default 1, which is the latest possible swap in. If `tf.logging` verbosity is set to `tf.logging.INFO`, LMS will output a log statement with a count of the number of tensors swapped. It is useful to run with `n_tensors=-1` for the first run to find this maximum value and then adjust it downward. If your model has branches like some UNet models do, you will likely want to set `swap_branches=True` and tune the branch threshold as well.

By default LMS will analyze your graph to find the starting operations to use for finding tensor swap candidates. You can bypass this analysis by placing your starting operations in a named scope and providing the scope on the `starting_scope` parameter, or by providing the names of the starting operations on the `starting_op_names` parameter. This can speed up repeated runs of LMS during tuning. Furthermore, you can enable `debug=True` and `debug_level=1` and LMS will print out the name and type of the starting operations it finds. These names could be passed in on the `starting_op_names` parameter on subsequent runs.

It is recommended that you start with tuning training on a single GPU before enabling your code for multi-GPU with DDL.

2017년 10월 27일 금요일

caffe-ibm의 LMS 기능에 대한 설명

전에 올린 DDL 관련 포스팅(https://hwengineer.blogspot.kr/2017/10/caffe-ddl-alexnet-training.html)에서, LMS(large model support) 기능을 믿고 batch_size를 화끈하게 2048이나 4096으로 올리면 어떤가라는 질문이 있을 수 있습니다. 결론부터 말씀드리면 LMS를 쓴다고 해도 batch_size를 무한정 키울 수는 없습니다.

먼저 저 포스팅에 나와 있듯이, LMS 기능은 '기존 GPU memory 한계 때문에 돌릴 수 없었던 큰 모델도 돌릴 수 있게 해주는 기능'이지, 이 때문에 반드시 더 빠른 성능을 낼 수 있는 것은 아닙니다. 아무리 NVLink를 통해 가져온다고 해도, host server memory가 GPU memory보다는 느리니까요.

그와는 별도로, LMS도 무한정 host memory를 쓸 수는 없습니다. Lab에서 들은 이야기입니다만, LMS를 쓴다고 해도 아래 정보들은 반드시 GPU memory 상에 올라가야 한다고 합니다.

- input tensor
- output tensor
- weight
- gradient (training인 경우)

그리고 이 정보들이 차지하는 memory의 양은 batch_size가 늘어날 수록 함께 늘어나는데, 그로 인해 결국 한계가 있습니다. LMS의 핵심은, deep learning에서 layer별로 training을 할 때, 당장 처리하고 있는 layer는 GPU memory 위에 두더라도, 이미 처리했거나 나중에 처리할 layer들은 host memory에 저장할 수 있다는 것입니다. 따라서, LMS로 처리가능한 최대 neural network 크기는 그 neural network에서 가장 큰 layer의 크기에 달려 있다고 할 수 있습니다.

가령 전에 테스트했던, Alexnet의 deploy.prototxt 이용해서 'caffe time'을 수행할 때 보면 아래와 같이 data, conv1, relu1 등 총 24개의 layer가 만들어집니다.

$ grep "Creating Layer" caffe_time.log
…
I1025 16:37:01.848961 29514 net.cpp:90] Creating Layer data
I1025 16:37:01.867213 29514 net.cpp:90] Creating Layer conv1
…
I1025 16:37:03.477823 29514 net.cpp:90] Creating Layer fc8
I1025 16:37:03.481210 29514 net.cpp:90] Creating Layer prob

그리고 각 layer마다 다음과 같이 Top shape가 정해지면서 "Memory required for data"가 연산됩니다. 그리고 그 값은 처음 layer에서는 작아도 나중에는 매우 커지지요.

…
I1025 16:37:01.867137 29514 net.cpp:135] Top shape: 10 3 1600 1200 (57600000)
I1025 16:37:01.867183 29514 net.cpp:143] Memory required for data: 230400000
…
I1025 16:37:02.231456 29514 net.cpp:135] Top shape: 10 96 398 298 (113859840)
I1025 16:37:02.231468 29514 net.cpp:143] Memory required for data: 685839360
…
I1025 16:37:02.246103 29514 net.cpp:135] Top shape: 10 256 49 37 (4641280)
I1025 16:37:02.246112 29514 net.cpp:143] Memory required for data: 3315185920
…

이런 특성들로 인해, wide한 (layer별 메모리 필요량이 많은) neural network보다는 deep한 (layer 개수가 많은) neural network이 LMS의 잇점을 훨씬 더 잘 살릴 수 있습니다.

LMS의 진정한 장점을 정리하면 아래와 같습니다.

1) width로는 10-30배 정도 더 큰 모델 사용 가능
2) depth로는 무한대의 큰 모델을 사용 가능

특성상 layer들이 많은 RNN에서 LMS가 특히 유용하게 쓰일 수 있다고 합니다. 그리고 요즘 신경망 발전 방향이 wide해지는 방향이 아니라 점점 deep해지는 방향이라고 합니다. 가령 몇년 전에 나온 Alexnet 같은 경우 이론상 7개 layer라고 하는데, 그로부터 4년 뒤에 나온 ResNet 같은 경우 1000개 layer라고 하지요. LMS의 적용 범위는 점점 넓어지고 있다고 할 수 있습니다.