HW 엔지니어를 위한 Deep Learning: benchmark

2018년 9월 10일 월요일

ILSVRC2012 dataset 중 강아지 사진만을 이용한 짧은 TF inception v3 테스트 방법

다음과 같이 benchmark/tensorflow 디렉토리에 들어가서, exec_img.sh를 수행하시면 됩니다. 이때 아래와 같이 nohup으로 수행하시면 도중에 연결 세션이 끊어져도 백그라운드로 job은 계속 수행될 뿐만 아니라, 수행 기록이 nohup.out에도 기록되므로 편리하실 것입니다.

[root@ac922 tensorflow]# pwd
/home/files/ilsvrc12/tensorflow

[root@ac922 tensorflow]# ls
benchmark exec_img.sh models nohup.out.final.tf output_final

[root@ac922 tensorflow]# nohup ./exec_img.sh &

위와 같이 exec_img.sh를 한번 수행하시면 그 속에서 아래와 같이 ./models/run_model.sh 스크립트가 batch_size=128로 순차적으로 GPU 개수 4, 2, 1에 대해서 각각 1번씩 총 3번을 수행합니다. 각 수행이 끝날 때마다 time 명령에 의해 수행 시간에 걸린 시간이 nohup.out에 기록됩니다. 원래 NVIDIA에서 준 script를 수행해보니, 매번 exec을 수행할 때마다 output directory를 새로 만들어야 제대로 수행되는 것 같아 아래와 같이 exec 수행시마다 ouput directory를 다른 이름으로 옮기고 새로 output을 만드는 문장을 추가했습니다.

mkdir output
time exec tf inception3 128 4 0
mv output output4gpu
mkdir output
time exec tf inception3 128 2 0
mv output output2gpu
mkdir output
time exec tf inception3 128 1 0
mv output output1gpu

결과 확인은 ouput directory에 쌓이는 아래의 log를 보셔도 되고, nohup.out을 보셔도 됩니다. 이 script에서는 total images/sec이 python 자체적으로 합산되어 표시되므로 그것을 기록하시면 됩니다. 단, python에서 계산되는 Elapsed Time은 일부 로직이 잘못되어 분:초 단위만 맞고 시간 단위는 9시간으로 나오니 그건 무시하십시요.

이 테스트를 위해 필요한 python code 및 model file을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1DNn-Nv4rlOiv2NLqk6Y0j2ANlJjw9VP6

그리고 이 테스트를 위해 필요한 종류별로 labeling된 강아지 사진을 tfrecord 포맷으로 만든 dataset을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1rQcxAWeNbByy0Yooj6IbROyVRsdQPn5-

위 dataset을 추출하고 tfrecord로 포맷하는 과정은 아래에 정리되어 있습니다.

http://hwengineer.blogspot.com/2018/04/ilsvrc2012imgtraint3tar-training-dataset.html

** 별첨 : tfrecord file들의 이름과 size

[root@ac922 ilsvrc12]# cd tfrecord/

[root@ac922 tfrecord]# ls -l | head
total 1509860
-rw-rw-r--. 1 1001 1001 6920780 Apr 11 19:20 train-00000-of-00120
-rw-rw-r--. 1 1001 1001 6422535 Apr 11 19:20 train-00001-of-00120
-rw-rw-r--. 1 1001 1001 6959007 Apr 11 19:21 train-00002-of-00120
-rw-rw-r--. 1 1001 1001 6885268 Apr 11 19:21 train-00003-of-00120
-rw-rw-r--. 1 1001 1001 5969364 Apr 11 19:21 train-00004-of-00120
-rw-rw-r--. 1 1001 1001 6143260 Apr 11 19:21 train-00005-of-00120
-rw-rw-r--. 1 1001 1001 6123517 Apr 11 19:21 train-00006-of-00120
-rw-rw-r--. 1 1001 1001 8585788 Apr 11 19:21 train-00007-of-00120
-rw-rw-r--. 1 1001 1001 6149957 Apr 11 19:21 train-00008-of-00120

[root@ac922 tfrecord]# ls -l | tail
-rw-rw-r--. 1 1001 1001 24124729 Apr 11 19:20 validation-00022-of-00032
-rw-rw-r--. 1 1001 1001 23741822 Apr 11 19:20 validation-00023-of-00032
-rw-rw-r--. 1 1001 1001 24759230 Apr 11 19:20 validation-00024-of-00032
-rw-rw-r--. 1 1001 1001 25225023 Apr 11 19:20 validation-00025-of-00032
-rw-rw-r--. 1 1001 1001 25273559 Apr 11 19:20 validation-00026-of-00032
-rw-rw-r--. 1 1001 1001 26820464 Apr 11 19:20 validation-00027-of-00032
-rw-rw-r--. 1 1001 1001 24115323 Apr 11 19:20 validation-00028-of-00032
-rw-rw-r--. 1 1001 1001 24459085 Apr 11 19:20 validation-00029-of-00032
-rw-rw-r--. 1 1001 1001 25246485 Apr 11 19:20 validation-00030-of-00032
-rw-rw-r--. 1 1001 1001 23561132 Apr 11 19:20 validation-00031-of-00032

[root@ac922 tfrecord]# du -sm .
1475 .

2018년 5월 10일 목요일

RNN PTB benchmark 수행방법

CNN은 Image classification이나 object detection 같은 정적인 image 처리에 많이 사용됩니다. 그러나 기계 번역(machine translation)이나 동영상 captioning 등을 deep learning으로 처리할 때는 시계열(time-series) 분석 등을 통해 미래를 예측하는 것이 필요합니다. 여기에는 CNN 대신 LSTM과 같은 RNN을 사용합니다.

문제는 CNN과는 달리, RNN/LSTM은 그 본질상 data history를 참조해야 하므로 메모리 사용량이 많다는 점입니다. 당연히 시스템 대역폭이 전체 시스템 성능에 영향을 끼치게 됩니다.

RNN 관련 가장 일반적인 벤치마크는 tensorflow models에 포함되어 있는 language modeling이며, 이는 영어 단어 모음인 PTB dataset을 이용합니다. 이것을 이용하여 적절한 성능 벤치마크를 해볼 수 있습니다. 먼저, python3에 tensorflow 1.5.1을 설치한 환경을 준비합니다.

[u0017649@sys-93214 ~]$ git clone https://github.com/tensorflow/models.git

[u0017649@sys-93214 ~]$ cd models/tutorials/rnn/ptb

이 벤치마크에서 사용하는 PTB dataset은 아래와 같이 download 받을 수 있습니다.

[u0017649@sys-93214 ptb]$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

[u0017649@sys-93215 ptb]$ tar -zxvf simple-examples.tgz

이제 다음과 같이 ptb_word_lm.py를 수행하면 됩니다.

[u0017649@sys-93214 ptb]$ time python ptb_word_lm.py --data_path=./simple-examples/data/ --model=large
...
Epoch: 1 Learning rate: 1.000
0.008 perplexity: 25072.184 speed: 1565 wps
0.107 perplexity: 1574.659 speed: 2033 wps
0.206 perplexity: 974.553 speed: 2057 wps
0.306 perplexity: 754.209 speed: 2065 wps
0.405 perplexity: 643.568 speed: 2069 wps
...
0.704 perplexity: 133.906 speed: 2085 wps
0.803 perplexity: 133.743 speed: 2085 wps
0.903 perplexity: 132.101 speed: 2085 wps
Epoch: 10 Train Perplexity: 131.618
Epoch: 10 Valid Perplexity: 117.277
Test Perplexity: 113.380
...

다만 이를 그대로 수행하면 무려 55 epochs를 수행하므로 (P100 4장으로 해도 약 3시간 정도), 좀 짧게 수행하시려면 아래와 같이 max_epoch과 max_max_epoch을 수정하시면 됩니다. 또 좀더 많은 hidden parameter를 사용하면 per word perplexity를 더 줄일 수 있는데, 대신 시간도 더 많이 걸리고 더 많은 메모리를 사용하게 됩니다.

[u0017649@sys-93214 ptb]$
...
class LargeConfig(object):
"""Large config."""
init_scale = 0.04
learning_rate = 1.0
max_grad_norm = 10
num_layers = 2
num_steps = 2
hidden_size = 1000
max_epoch = 4 #원래 14
max_max_epoch = 10 #원래 55
keep_prob = 0.35
lr_decay = 1 / 1.15
batch_size = 20
vocab_size = 10000
rnn_mode = BLOCK
...

저는 --num_gpus=0 옵션을 쓰서 GPU가 없는 CPU 환경에서 수행했는데, 이때 위의 python program이 차지하는 real memory 사용량(ps aux에서 봤을 때의 Res Set 항목)을 보면 RNN이 정말 메모리를 dynamic하게 늘였다줄였다를 반복하는 것을 보실 수 있습니다. 아래는 10분 동안만 2초 간격으로 그 메모리 사용량을 모니터링한 결과입니다. 계속 저 패턴이 반복됩니다.