HW 엔지니어를 위한 Deep Learning: 6월 2020

IBM Watson Machine Learning Community Edition (WML-CE) 1.6.2 속에 포함된 Tensorflow 1.15를 이용하여 large model support (LMS)에 대한 demo를 해보는 방법입니다.

** 아래 환경은 IBM CECC cloud에서 제공되는 가상화 환경의 P100 GPU를 이용했기 때문에, training 성능 자체는 떨어진다는 점을 인지해주시기 바랍니다.

가장 간단한 것은 WML-CE 속에 들어있는 tf_cnn_benchmarks suite를 이용하는 것입니다.

먼저 아래 명령어를 이용하여 tf_cnn_benchmarks suite를 원하는 directory에 copy합니다. (이건 optional step이고, 그냥 해당 directory로 직접 찾아 들어가도 됩니다.)

(wmlce_162) [cecuser@p1290-kvm1 ~]$ ./anaconda3/envs/wmlce_162/bin/install_tf_cnn_benchmarks .

(wmlce_162) [cecuser@p1290-kvm1 ~]$ cd tf_cnn_benchmarks

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ ls
all_reduce_benchmark.py cnn_util.py mlperf_test.py test_data
all_reduce_benchmark_test.py cnn_util_test.py models test_util.py
allreduce.py coco_metric.py platforms tf_cnn_benchmarks.py
allreduce_test.py constants.py preprocessing.py variable_mgr.py
batch_allreduce.py convnet_builder.py __pycache__ variable_mgr_util.py
benchmark_cnn_distributed_test.py datasets.py README.md variable_mgr_util_test.py
benchmark_cnn_distributed_test_runner.py flags.py run_tests.py
benchmark_cnn.py leading_indicators_test.py ssd_constants.py
benchmark_cnn_test.py mlperf.py ssd_dataloader.py

다만 여기서 benchmark_cnn.py 에서 일부 source code를 수정해야 합니다. 이는 source code 안에 들어있는 LMS 관련 parameter인 lms_swapout_threshold 관련 bug 때문입니다. 원래 값인 -1을 그대로 놔두면 원래는 auto-tuning이 되어야 하는데, TF 버전 등과의 호환 문제로 거기서 에러가 나므로, 일단은 그냥 1로 수정합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ vi benchmark_cnn.py

...

#flags.DEFINE_integer('lms_swapout_threshold', -1,

flags.DEFINE_integer('lms_swapout_threshold', 1,

...

그렇게 한 뒤 아래와 같이 tf_cnn_benchmarks.py를 수행해 봅니다. 여기서는 batch_size를 150으로 주고 해봅니다. 일단 잘 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=150 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
I0612 04:14:43.269922 140736229690240 session_manager.py:502] Done running local_init_op.
Running warm up
2020-06-12 04:14:45.534222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:14:45.728889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 255.8 +/- 0.0 (jitter = 0.0) 7.820
10 images/sec: 255.8 +/- 0.1 (jitter = 0.4) 8.082
20 images/sec: 255.7 +/- 0.1 (jitter = 0.4) 7.856
30 images/sec: 255.6 +/- 0.1 (jitter = 0.3) 7.832
40 images/sec: 255.5 +/- 0.0 (jitter = 0.3) 7.879
50 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.701
60 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.918
70 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.845
80 images/sec: 255.4 +/- 0.0 (jitter = 0.2) 7.750
90 images/sec: 255.3 +/- 0.0 (jitter = 0.2) 7.806
100 images/sec: 255.3 +/- 0.0 (jitter = 0.3) 7.856
----------------------------------------------------------------
total images/sec: 255.22
----------------------------------------------------------------

이때 OS에서 nmon tool을 이용해서 관찰해보면, host RAM의 메모리 사용량이 5GB 정도에 불과하고 tf_cnn_benchmarks의 data size도 32GB 정도, res set size도 2.1GB 정도에 불과한 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 58734.8 3518.5 x
x Free Percent 92.7% 85.9% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13671 42.1 32142m 2153m 3200 2866m 0 633216 12 0 tf_cnn_benchmar x

이제 batch_size를 200으로 늘려 보겠습니다. 그러면 16GB에 불과한 P100 GPU의 메모리가 꽉 차서 결국 Out-Of-Memory(OOM) error가 발생합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[main_fetch_group/_566]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
...

하지만 동일한 batch_size를 그대로 주더라도, LMS를 적용하여 --lms=True 옵션을 주고 수행해보면 (비록 많이 느려졌지만) error 없이 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10 --lms=True
....
I0612 04:27:06.439511 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: -0.09 GiB (memory ratio: 0.9)
I0612 04:27:06.439677 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0612 04:27:06.440271 140735558208384 lms.py:1275] [LMS][0] LMS will use the latest parameter set found by Simulator for the best performance. However, if you encounter an out-of-memory error, please manually use the previous parameter set found by Simulator.
I0612 04:27:06.440359 140735558208384 lms.py:1275] [LMS][0] sync_mode: 3 (Synchronous memory copy between host and device)
I0612 04:27:06.440439 140735558208384 lms.py:1275] [LMS][0] swapout_threshold: 1
I0612 04:27:06.440520 140735558208384 lms.py:1275] [LMS][0] swapin_ahead: -1 (ignored since sync_mode is 3)
I0612 04:27:06.440600 140735558208384 lms.py:1275] [LMS][0] swapin_groupby: -1 (ignored since sync_mode is 3)
I0612 04:27:06.869183 140735558208384 lms.py:1275] [LMS][0] Added 425 operations to the model (180 swap-out operations (20.33 GiB) and 245 swap-in operations (31.36 GiB))
I0612 04:27:06.869335 140735558208384 lms.py:1275] [LMS][0] Editing model for LMS, took: 799.3814945220947 ms
...
Running warm up
2020-06-12 04:27:15.098435: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:27:15.371592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.988
10 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.900
20 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.914
30 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 8.043
40 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.880
50 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.903
60 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.889
70 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.770
80 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.906
90 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.813
100 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.824
----------------------------------------------------------------
total images/sec: 21.13
----------------------------------------------------------------

이때 host의 OS를 관찰해보면 host RAM 사용량이 22GB 정도로 대폭 늘었고, tf_cnn_benchmarks의 data size도 50GB 정도, res set size도 19GB 정도로 대폭 늘어난 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 41505.6 3490.5 x
x Free Percent 65.5% 85.2% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13427 10.4 49577m19322m 3200 3399m 0 176710 0 0 tf_cnn_benchmar x

HW 엔지니어를 위한 Deep Learning

2020년 6월 12일 금요일

tf_cnn_benchmarks.py를 이용한 Tensorflow Large Model Support (LMS)의 demo