HW 엔지니어를 위한 Deep Learning: inception v3

레이블이 inception v3인 게시물을 표시합니다. 모든 게시물 표시

2018년 9월 10일 월요일

ILSVRC2012 dataset 중 강아지 사진만을 이용한 짧은 TF inception v3 테스트 방법

다음과 같이 benchmark/tensorflow 디렉토리에 들어가서, exec_img.sh를 수행하시면 됩니다. 이때 아래와 같이 nohup으로 수행하시면 도중에 연결 세션이 끊어져도 백그라운드로 job은 계속 수행될 뿐만 아니라, 수행 기록이 nohup.out에도 기록되므로 편리하실 것입니다.

[root@ac922 tensorflow]# pwd
/home/files/ilsvrc12/tensorflow

[root@ac922 tensorflow]# ls
benchmark exec_img.sh models nohup.out.final.tf output_final

[root@ac922 tensorflow]# nohup ./exec_img.sh &

위와 같이 exec_img.sh를 한번 수행하시면 그 속에서 아래와 같이 ./models/run_model.sh 스크립트가 batch_size=128로 순차적으로 GPU 개수 4, 2, 1에 대해서 각각 1번씩 총 3번을 수행합니다. 각 수행이 끝날 때마다 time 명령에 의해 수행 시간에 걸린 시간이 nohup.out에 기록됩니다. 원래 NVIDIA에서 준 script를 수행해보니, 매번 exec을 수행할 때마다 output directory를 새로 만들어야 제대로 수행되는 것 같아 아래와 같이 exec 수행시마다 ouput directory를 다른 이름으로 옮기고 새로 output을 만드는 문장을 추가했습니다.

mkdir output
time exec tf inception3 128 4 0
mv output output4gpu
mkdir output
time exec tf inception3 128 2 0
mv output output2gpu
mkdir output
time exec tf inception3 128 1 0
mv output output1gpu

결과 확인은 ouput directory에 쌓이는 아래의 log를 보셔도 되고, nohup.out을 보셔도 됩니다. 이 script에서는 total images/sec이 python 자체적으로 합산되어 표시되므로 그것을 기록하시면 됩니다. 단, python에서 계산되는 Elapsed Time은 일부 로직이 잘못되어 분:초 단위만 맞고 시간 단위는 9시간으로 나오니 그건 무시하십시요.

이 테스트를 위해 필요한 python code 및 model file을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1DNn-Nv4rlOiv2NLqk6Y0j2ANlJjw9VP6

그리고 이 테스트를 위해 필요한 종류별로 labeling된 강아지 사진을 tfrecord 포맷으로 만든 dataset을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1rQcxAWeNbByy0Yooj6IbROyVRsdQPn5-

위 dataset을 추출하고 tfrecord로 포맷하는 과정은 아래에 정리되어 있습니다.

http://hwengineer.blogspot.com/2018/04/ilsvrc2012imgtraint3tar-training-dataset.html

** 별첨 : tfrecord file들의 이름과 size

[root@ac922 ilsvrc12]# cd tfrecord/

[root@ac922 tfrecord]# ls -l | head
total 1509860
-rw-rw-r--. 1 1001 1001 6920780 Apr 11 19:20 train-00000-of-00120
-rw-rw-r--. 1 1001 1001 6422535 Apr 11 19:20 train-00001-of-00120
-rw-rw-r--. 1 1001 1001 6959007 Apr 11 19:21 train-00002-of-00120
-rw-rw-r--. 1 1001 1001 6885268 Apr 11 19:21 train-00003-of-00120
-rw-rw-r--. 1 1001 1001 5969364 Apr 11 19:21 train-00004-of-00120
-rw-rw-r--. 1 1001 1001 6143260 Apr 11 19:21 train-00005-of-00120
-rw-rw-r--. 1 1001 1001 6123517 Apr 11 19:21 train-00006-of-00120
-rw-rw-r--. 1 1001 1001 8585788 Apr 11 19:21 train-00007-of-00120
-rw-rw-r--. 1 1001 1001 6149957 Apr 11 19:21 train-00008-of-00120

[root@ac922 tfrecord]# ls -l | tail
-rw-rw-r--. 1 1001 1001 24124729 Apr 11 19:20 validation-00022-of-00032
-rw-rw-r--. 1 1001 1001 23741822 Apr 11 19:20 validation-00023-of-00032
-rw-rw-r--. 1 1001 1001 24759230 Apr 11 19:20 validation-00024-of-00032
-rw-rw-r--. 1 1001 1001 25225023 Apr 11 19:20 validation-00025-of-00032
-rw-rw-r--. 1 1001 1001 25273559 Apr 11 19:20 validation-00026-of-00032
-rw-rw-r--. 1 1001 1001 26820464 Apr 11 19:20 validation-00027-of-00032
-rw-rw-r--. 1 1001 1001 24115323 Apr 11 19:20 validation-00028-of-00032
-rw-rw-r--. 1 1001 1001 24459085 Apr 11 19:20 validation-00029-of-00032
-rw-rw-r--. 1 1001 1001 25246485 Apr 11 19:20 validation-00030-of-00032
-rw-rw-r--. 1 1001 1001 23561132 Apr 11 19:20 validation-00031-of-00032

[root@ac922 tfrecord]# du -sm .
1475 .

2017년 6월 8일 목요일

POWER8에서 LSF를 이용하여 tensorflow docker image로 inception v3를 training 하기

여기서는 Spectrum LSF의 무료 community edition을 이용하여, sys-87548 서버와 sys-87549 서버로 LSF cluster를 만들겠습니다. sys-87548 서버가 master이자 slave이고, sys-87549 서버는 secondary master이자 역시 slave 입니다.

LSF community edition을 인터넷에서 download 받아 다음과 같이 tar를 압축 해제합니다. 속에는 LSF와 Platform Application Server가 들어있는 것을 보실 수 있습니다. 여기서는 일단 LSF만 설치합니다.

root@sys-87548:/home/u0017496# tar -zxvf lsfce10.1-ppc64le.tar.gz
lsfce10.1-ppc64le/
lsfce10.1-ppc64le/lsf/
lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall_linux_ppc64le.tar.Z
lsfce10.1-ppc64le/lsf/lsf10.1_lnx310-lib217-ppc64le.tar.Z
lsfce10.1-ppc64le/pac/
lsfce10.1-ppc64le/pac/pac10.1_basic_linux-ppc64le.tar.Z

root@sys-87548:/home/u0017496# cd lsfce10.1-ppc64le/lsf/

아래와 같이 LSF에는 두개의 *.tar.Z file이 있는데, 이중 압축 해제가 필요한 것은 lsf10.1_lsfinstall_linux_ppc64le.tar.Z 뿐입니다. lsf10.1_lnx310-lib217-ppc64le.tar.Z은 압축 해제하지 말고 그냥 놔두셔야 합니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z lsf10.1_lsfinstall_linux_ppc64le.tar.Z

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# cd lsf10.1_lsfinstall

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl install.config lap lsf_unix_install.pdf patchlib README rpm slave.config
hostsetup instlib lsfinstall patchinstall pversions rhostsetup scripts

먼저 install.config를 다음과 같이 편집합니다. LSF_MASTER_LIST에 적힌 순서에 따라 sys-87548가 primary master, sys-87549가 secondary master이고, 또 LSF_ADD_SERVERS에 적힌 순서에 따라 sys-87548와 sys-87549가 각각 host server, 즉 slave입니다. 만약 서버가 1대 뿐이라면 그 서버가 master이자 single slave가 될 수도 있습니다.

[root@sys-87538 lsf10.1_lsfinstall]# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="u0017496"
LSF_CLUSTER_NAME="cluster1"
LSF_MASTER_LIST="sys-87548 sys-87549"
LSF_TARDIR="/home/u0017496/lsfce10.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="sys-87548 sys-87549"

lsfinstall 명령을 이용해 다음과 같이 설치 시작합니다. 도중에 LSF distribution tar 파일의 위치를 묻는데, 아까 압축 해제하지 않고 놔둔 그 파일을 묻는 것입니다. 그냥 1을 눌러 default를 선택하면 됩니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config
...
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
...
Searching LSF 10.1 distribution tar files in /home/u0017496/lsfce10.1-ppc64le/lsf Please wait ...
1) linux3.10-glibc2.17-ppc64le
Press 1 or Enter to install this host type: 1

LSF 엔진 설치가 끝나면 다음 명령을 수행하여 이 서버를 host server, 즉 slave로 설정합니다. --boot="y"는 서버가 부팅될 때 LSF daemon을 자동으로 살리라는 뜻입니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"
Logging installation sequence in /usr/share/lsf/log/Install.log

------------------------------------------------------------
L S F H O S T S E T U P U T I L I T Y
------------------------------------------------------------
This script sets up local host (LSF server, client or slave) environment.

Setting up LSF server host "sys-87548" ...
Checking LSF installation for host "sys-87548" ... Done
grep: /etc/init/rc-sysinit.conf: No such file or directory
Copying /etc/init.d/lsf, /etc/rc3.d/S95lsf and /etc/rc3.d/K05lsf
Installing LSF RC scripts on host "sys-87548" ... Done
LSF service ports are defined in /usr/share/lsf/conf/lsf.conf.
Checking LSF service ports definition on host "sys-87548" ... Done
You are installing IBM Spectrum LSF - Community Edition.

... Setting up LSF server host "sys-87548" is done
... LSF host setup is done.

그리고 root user와 LSF 사용 user의 .bashrc에는 다음과 같이 profile.lsf가 수행되도록 entry를 넣어둡니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf

u0017496@sys-87481:~$ vi ~/.bashrc
. /usr/share/lsf/conf/profile.lsf

그리고 서버들 간에는 rsh이 아닌 ssh를 이용하도록 아래와 같이 lsf.conf 속에 LSF_RSH를 설정합니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /usr/share/lsf/conf/lsf.conf
LSF_RSH=ssh

그리고 LSF 사용자들을 /etc/lsf.sudoers 속에 등록해둡니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=u0017496
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="u0017496"

lsfadmin user는 물론, root로도 자체, 그리고 에 대해 passwd 없이 ssh가 가능하도록 설정해야 합니다.

u0017496@sys-87548:~$ ssh-keygen -t rsa
u0017496@sys-87548:~$ ssh-copy-id sys-87548
u0017496@sys-87548:~$ ssh-copy-id sys-87549

root@sys-87548:/home/u0017496# ssh-keygen -t rsa

[root@sys-87548:/home/u0017496# ls -l /root/.ssh
total 12
-rw------- 1 root root 1595 Jun 8 03:43 authorized_keys
-rw------- 1 root root 1679 Jun 8 03:42 id_rsa
-rw-r--r-- 1 root root 396 Jun 8 03:42 id_rsa.pub

root@sys-87548:/home/u0017496# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

root@sys-87548:/home/u0017496# cp /root/.ssh/id_rsa.pub /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# chmod 666 /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# exit

u0017496@sys-87548:~$ scp /tmp/sys-87548.id_rsa.pub sys-87549:/tmp
sys-87548.id_rsa.pub 100% 396 0.4KB/s 00:00

root@sys-87548:/home/u0017496# cat /tmp/sys-87549.id_rsa.pub >> /root/.ssh/authorized_keys

이제 lsfstartup 명령을 이용해 LSF daemon들을 살립니다.

root@sys-87548:/home/u0017496# lsfstartup

root@sys-87548:/home/u0017496# lsid
IBM Spectrum LSF Community Edition 10.1.0.0, Jun 15 2016
Copyright IBM Corp. 1992, 2016. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

My cluster name is cluster1
My master name is sys-87548.dal-ebis.ihost.com

아직은 sys-87548에만 LSF가 설치되어 있으므로, LSF daemon들에게는 sys-87549가 down된 상태로 보인다는 점에 유의하십시요.

u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis ok - 1 0 0 0 0 0
sys-87549.dal-ebis unavail - 1 0 0 0 0 0

다음과 같이 bsub 명령으로 tensorflow docker image를 이용해 inception v3를 training 하는 job을 submit 합니다. 이것들은 기본적으로 normal queue에 들어갑니다. 같은 명령을 2번 더 내려서, 3개의 training job을 queue에 던져 둡니다.

u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <1> is submitted to queue <normal>.

Job 3개 중 1개는 running 중이고, 2개는 pending 상태인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 3 2 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

이제 default인 normal queue가 아닌, short queue에 또 동일한 job을 submit 해봅니다.

u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <4> is submitted to queue <short>.

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
3 u001749 PEND normal sys-87548.d help Jun 8 04:02

잘못 들어간 3번 job을 아래 bkill 명령으로 kill 시킬 수 있습니다.

u0017496@sys-87548:~$ bkill 3
Job <3> is being terminated

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56

한창 running 중인 job은 bstop 명령으로 잠깐 suspend 시킬 수 있습니다. 나중에 resume할 수 있도록 수행 도중 상태는 임시파일로 disk에 write됩니다.

u0017496@sys-87548:~$ bstop 1
Job <1> is being stopped

u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 0 1
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 USUSP normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56

Suspend된 job은 bresume 명령으로 다시 수행할 수 있습니다. 이때 disk로부터 임시파일을 읽느라 I/O가 좀 발생합니다.

u0017496@sys-87548:~$ bresume 1
Job <1> is being resumed

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56

u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <5> is submitted to queue <short>.

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56

대기 중인 job들 중에 빨리 처리되어야 하는 job은 btop 명령을 통해 대기 queue 상에서 순서를 최상위로 바꿀 수도 있습니다.

u0017496@sys-87548:~$ btop 5
Job <5> has been moved to position 1 from top.

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56

현재 수행 중인 job의 standard out은 bpeek 명령을 통해 엿볼 수 있습니다. 이는 꼭 실시간으로 display되지는 않고, 약간 buffer가 쌓여야 보일 수도 있습니다.

u0017496@sys-87548:~$ bpeek 1
<< output from stdout >>
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: d61a01411a1c
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1065] LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1066] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (d61a01411a1c): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 07:56:19.985253: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 07:59:19.471085: step 0, loss = 2.89 (0.1 examples/sec; 55.400 sec/batch)
2017-06-08 08:04:13.045726: step 10, loss = 2.46 (0.4 examples/sec; 19.288 sec/batch)
2017-06-08 08:07:22.977419: step 20, loss = 2.49 (0.4 examples/sec; 19.100 sec/batch)
2017-06-08 08:10:37.519711: step 30, loss = 2.18 (0.4 examples/sec; 20.120 sec/batch)
2017-06-08 08:13:48.625858: step 40, loss = 2.28 (0.4 examples/sec; 20.342 sec/batch)

bhist 명령을 통해 각 job들의 history를 간략히, 그리고 -l option을 쓰면 자세히 볼 수 있습니다.

u0017496@sys-87548:~$ bhist
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
2 u001749 *_size=8 1275 0 106 0 0 0 1381
4 u001749 *_size=8 775 0 0 0 0 0 775
5 u001749 *_size=8 321 0 0 0 0 0 321

u0017496@sys-87548:~$ bhist -l

Job <2>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 03:56:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <normal>, CWD <$HOME>;
Thu Jun 8 04:13:10: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;
Thu Jun 8 04:17:46: Dispatched 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.
com>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.i
host.com>, Effective RES_REQ <select[type == local] order[
r15s:pg] >;
Thu Jun 8 04:17:46: Starting (Pid 25530);
Thu Jun 8 04:17:46: Running with execution home </home/u0017496>, Execution CW
D </home/u0017496>, Execution Pid <25530>;

Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1275 0 110 0 0 0 1385
------------------------------------------------------------------------------

Job <4>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:06:37: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;

Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
779 0 0 0 0 0 779
------------------------------------------------------------------------------

Job <5>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:14:11: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;
Thu Jun 8 04:14:34: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;

Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
325 0 0 0 0 0 325

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06

u0017496@sys-87548:~$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
3 u001749 EXIT normal sys-87548.d - help Jun 8 04:02
1 u001749 DONE normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54

이제 2번째 서버 sys-87549도 online으로 만듭니다. 동일한 install.conf를 통해 LSF를 설치하고, 동일하게 hostsetup 명령을 수행하면 됩니다. bhosts 명령을 통해 보면, 현재 job이 수행 중이라 추가적인 수행 여력이 없는 sys-87548 서버는 closed로 보이는 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis ok - 1 0 0 0 0 0

u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 0 1 0
normal 30 Open:Active - - - - 0 0 0 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5 u001749 RUN short sys-87548.d sys-87548.d *ch_size=8 Jun 8 04:14

이제 이 상태에서 bsub 명령으로 또 docker image를 이용한 training job을 4개 submit 합니다.

u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <8> is submitted to default queue <normal>.
... (4번 반복)
Job <11> is submitted to default queue <normal>.

이제 host 서버가 2대이므로, 2개의 job이 동시에 수행 중인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 4 2 2 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis closed - 1 1 1 0 0 0

bjobs 명령으로 8번 job은 sys-87548 서버에서, 9번 job은 sys-87549 서버에서 수행 중인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
8 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 05:57
9 u001749 RUN normal sys-87548.d sys-87549.d *ch_size=8 Jun 8 05:57
10 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57
11 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57

u0017496@sys-87548:~$ bqueues -l normal

QUEUE: normal
-- For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV
30 0 Open:Active - - - - 4 2 2 0 0 0
Interval for a host to accept two jobs is 0 seconds

SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

SCHEDULING POLICIES: FAIRSHARE NO_INTERACTIVE
USER_SHARES: [default, 1]

SHARE_INFO_FOR: normal/
USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME ADJUST
u0017496 1 0.111 2 0 11.2 116 0.000

USERS: all
HOSTS: all

u0017496@sys-87548:~$ bjobs -l

Job <8>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:23: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:24: Started 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:57:40: Resource usage collected.
MEM: 33 Mbytes; SWAP: 85 Mbytes; NTHREAD: 11
PGID: 29697; PIDs: 29697 29699 29701 29702

MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 33 Mbytes

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------

Job <9>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:27: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:28: Started 1 Task(s) on Host(s) <sys-87549.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87549.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:58:30: Resource usage collected.
MEM: 33 Mbytes; SWAP: 148 Mbytes; NTHREAD: 11
PGID: 14493; PIDs: 14493 14494 14496 14497

MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 23 Mbytes

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------

Job <10>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------

Job <11>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:33: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

POWER8에서 tensorflow docker image를 build하여 Inception v3 training 수행하기

먼저 docker hub에서 전에 만들어둔 cuda8과 cudnn5-devel package가 설치된 ppc64le 기반의 Ubuntu 16.04 LTS docker image를 pull 합니다.

root@sys-87548:/home/u0017496# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Extracting 41.78 MB/72.3 MB
6b239e02a89e: Download complete
aecbc9abccdc: Downloading 110.8 MB/415.3 MB
8f458a3f0497: Download complete
4903f7ce6675: Download complete
0c588ac98d19: Downloading 107 MB/450.9 MB
12e624e884fc: Download complete
18dd28bbb571: Downloading 45.37 MB/103.2 MB
...

이를 기반으로 PowerAI에 포함된 tensorflow를 설치한 docker image를 build 합니다. 먼저 dockerfile을 다음과 같이 만듭니다.

root@sys-87548:/home/u0017496# vi dockerfile.tensorflow

FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
RUN apt-get update && apt-get install -y nvidia-modprobe

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo* /tmp/temp/
COPY mldl-repo* /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo*deb && \
dpkg -i /tmp/temp/libcudnn5*deb && \
dpkg -i /tmp/temp/mldl-repo*deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y tensorflow && \
rm -rf /var/lib/apt/lists/* && \
dpkg -r mldl-repo-local

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/cuda-8.0/targets/ppc64le-linux/lib/stubs:/usr/lib/powerpc64le-linux-gnu/stubs:/usr/lib/powerpc64le-linux-gnu:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
ENV PATH="/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
ENV PYTHONPATH="/opt/DL/tensorflow/lib/python2.7/site-packages"

CMD /bin/bash

이제 이 dockerfile을 기반으로 bsyu/tensor_r1.0:ppc64le-xenial의 docker image를 빌드합니다.

root@sys-87548:/home/u0017496# docker build -t bsyu/tensor_r1.0:ppc64le-xenial -f dockerfile.tensorflow .
Sending build context to Docker daemon 3.436 GB
Step 1 : FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
---> d8d0da2fbdf2
Step 2 : RUN apt-get update && apt-get install -y nvidia-modprobe
---> Running in 204fe4e2c5f6
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports xenial InRelease [247 kB]
Get:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release [565 B]
Get:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release.gpg [819 B]
Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Packages [24.9 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports xenial-updates InRelease [102 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports xenial-security InRelease [102 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages [1470 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports xenial/universe ppc64el Packages [9485 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse ppc64el Packages [152 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main ppc64el Packages [613 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe ppc64el Packages [528 kB]
Get:13 http://ports.ubuntu.com/ubuntu-ports xenial-updates/multiverse ppc64el Packages [5465 B]
Get:14 http://ports.ubuntu.com/ubuntu-ports xenial-security/main ppc64el Packages [286 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports xenial-security/universe ppc64el Packages [138 kB]
Fetched 13.2 MB in 10s (1230 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
nvidia-modprobe
0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 16.3 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Get:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el nvidia-modprobe 375.51-0ubuntu1 [16.3 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 16.3 kB in 0s (191 kB/s)
Selecting previously unselected package nvidia-modprobe.
(Reading database ... 17174 files and directories currently installed.)
Preparing to unpack .../nvidia-modprobe_375.51-0ubuntu1_ppc64el.deb ...
Unpacking nvidia-modprobe (375.51-0ubuntu1) ...
Setting up nvidia-modprobe (375.51-0ubuntu1) ...
---> 5411319bbc05
Removing intermediate container 204fe4e2c5f6
Step 3 : RUN mkdir /tmp/temp
---> Running in cf13b03845f1
---> 66b2b250777f
Removing intermediate container cf13b03845f1
Step 4 : COPY libcudnn5* /tmp/temp/
---> 16d921e53451
Removing intermediate container 9d1efa9ed269
Step 5 : COPY cuda-repo* /tmp/temp/
...
Step 9 : ENV LD_LIBRARY_PATH "/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
---> Running in fe30af7c944e
---> f5faa1760ac7
Removing intermediate container fe30af7c944e
Step 10 : ENV PATH "/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
---> Running in 98a0e5bfd008
---> 7cfb0feaaee1
Removing intermediate container 98a0e5bfd008
Step 11 : ENV PYTHONPATH "/opt/DL/tensorflow/lib/python2.7/site-packages"
---> Running in d98d5352108e
---> affda7b26276
Removing intermediate container d98d5352108e
Step 12 : CMD /bin/bash
---> Running in d54a20fb7e3c
---> 4692368fb7ad
Removing intermediate container d54a20fb7e3c
Successfully built 4692368fb7ad

만들어진 docker image를 확인합니다.

root@sys-87548:/home/u0017496# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/tensor_r1.0 ppc64le-xenial 4692368fb7ad 3 minutes ago 6.448 GB
nvidia-docker deb 2830f66f0418 41 hours ago 429.8 MB
nvidia-docker build fa764787622c 41 hours ago 1.014 GB
ppc64le/ubuntu 14.04 0e6701cbf611 2 weeks ago 228.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 4 months ago 1.895 GB
ppc64le/golang 1.6.3 6a579d02d32f 9 months ago 704.7 MB
golang 1.5 99668503de15 10 months ago 725.3 MB

이 docker image를 나중에 다른 서버에서도 사용하기 위해 docker hub에 push 해둡니다.

root@sys-87548:/home/u0017496# docker push bsyu/tensor_r1.0:ppc64le-xenial
The push refers to a repository [docker.io/bsyu/tensor_r1.0]
f42db0829239: Pushed
6a6b4d4d9d2a: Pushing 184.1 MB/2.738 GB
6458d0633f20: Pushing 172.7 MB/390.2 MB
726e25ffdf3c: Pushing 173.2 MB/1.321 GB
1535936ab54b: Pushed
bc0917851737: Pushed
9a1e25cd5998: Pushed
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Mounted from bsyu/cuda8-cudnn5-devel
84f99f1bf79b: Mounted from bsyu/cuda8-cudnn5-devel
7f7c1dccec82: Mounted from bsyu/cuda8-cudnn5-devel
5b8880a35736: Mounted from bsyu/cuda8-cudnn5-devel
41b97cb9a404: Mounted from bsyu/cuda8-cudnn5-devel
08f34ce6b3fb: Mounted from bsyu/cuda8-cudnn5-devel

이제 docker image는 준비되었으니, tensorflow를 이용한 inception v3를 수행할 준비를 합니다. inception v3 training을 수행할 python package를 bazel를 이용하여 /home/inception directory 밑에 만듭니다. 이 directory를 나중에 docker image에서 mount하여 사용할 예정입니다.

root@sys-87548:/home# mkdir inception
root@sys-87548:/home# export INCEPTION_DIR=/home/inception

root@sys-87548:/home# cd inception/
root@sys-87548:/home/inception# curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 380M 100 380M 0 0 4205k 0 0:01:32 0:01:32 --:--:-- 4988k

root@sys-87548:/home/inception# tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585

root@sys-87548:/home/inception# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4866, done.
remote: Total 4866 (delta 0), reused 0 (delta 0), pack-reused 4866
Receiving objects: 100% (4866/4866), 153.36 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (2467/2467), done.
Checking connectivity... done.

root@sys-87548:/home/inception# export FLOWERS_DIR=/home/inception/models/inception
root@sys-87548:/home/inception# mkdir -p $FLOWERS_DIR/data
root@sys-87548:/home/inception# cd models/inception/
root@sys-87548:/home/inception/models/inception# . /opt/DL/bazel/bin/bazel-activate
root@sys-87548:/home/inception/models/inception# . /opt/DL/tensorflow/bin/tensorflow-activate

root@sys-87548:/home/inception/models/inception# export TEST_TMPDIR=/home/inception/.cache

root@sys-87548:/home/inception/models/inception# bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/inception/.cache'.
Extracting Bazel installation...
..............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 5.831s, Critical Path: 0.02s

root@sys-87548:/home/inception/models/inception# ls -l
total 76
lrwxrwxrwx 1 root root 116 Jun 8 02:36 bazel-bin -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/bin
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-genfiles -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/genfiles
lrwxrwxrwx 1 root root 86 Jun 8 02:36 bazel-inception -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception
lrwxrwxrwx 1 root root 96 Jun 8 02:36 bazel-out -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-testlogs -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/testlogs
drwxr-xr-x 2 root root 4096 Jun 8 02:32 data
drwxr-xr-x 2 root root 4096 Jun 8 02:29 g3doc
drwxr-xr-x 4 root root 4096 Jun 8 02:29 inception
-rw-r--r-- 1 root root 38480 Jun 8 02:29 README.md
-rw-r--r-- 1 root root 30 Jun 8 02:29 WORKSPACE

root@sys-87548:/home/inception/models/inception# bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 218M 100 218M 0 0 4649k 0 0:00:48 0:00:48 --:--:-- 5105k
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-06-08 02:01:56.169564 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:01:56.268917 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:02:01.252583 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-06-08 02:02:01.252638 [thread 1]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.306138 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-06-08 02:02:01.306178 [thread 0]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.578737: Finished writing all 3170 images in data set.

inception v3는 다음과 같이 꽃 사진을 종류별로 인식하는 model입니다.

root@sys-87548:/home/inception/models/inception# du -sm data/raw-data/train/*
29 data/raw-data/train/daisy
43 data/raw-data/train/dandelion
1 data/raw-data/train/LICENSE.txt
34 data/raw-data/train/roses
47 data/raw-data/train/sunflowers
48 data/raw-data/train/tulips

이제 flowers_train을 bazel로 build 합니다.

root@sys-87548:/home/inception/models/inception# bazel build inception/flowers_train
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
bazel-bin/inception/flowers_train
INFO: Elapsed time: 0.311s, Critical Path: 0.03s

이제 준비가 끝났습니다. Docker에서 수행하기 전에, 이 flowers_train을 그냥 그대로 수행해 봅니다.

root@sys-87548:/home/inception/models/inception# time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

이 서버에는 cuda가 설치되어 있지만 GPU는 없으므로, 아래와 같이 error message가 보이는 것이 당연합니다. GPU가 없으면 그냥 CPU에서 수행됩니다. 약 20분 이상 수행되므로 도중에 끊겠습니다.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
NVIDIA: no NVIDIA devices found
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87548): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 02:41:53.587744: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 02:44:28.213350: step 0, loss = 2.85 (0.2 examples/sec; 38.569 sec/batch)
...

이제 앞서 만들어 두었던 bsyu/tensor_r1.0:ppc64le-xenial라는 이름의 docker image를 이용하여 inception v3를 수행하겠습니다. 실제 flowers_train은 /home/inception 밑에 들어있으므로, 이 directory를 -v option을 이용하여 docker image에서도 mount 하도록 합니다.

root@sys-87548:/home/inception/models/inception# docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (b85c9a819a6a): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 06:48:27.996200: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 06:51:10.935895: step 0, loss = 2.83 (0.2 examples/sec; 39.389 sec/batch)
2017-06-08 06:56:21.408996: step 10, loss = 2.55 (0.4 examples/sec; 19.373 sec/batch)
2017-06-08 06:59:29.431547: step 20, loss = 2.33 (0.4 examples/sec; 19.856 sec/batch)
2017-06-08 07:02:36.828205: step 30, loss = 2.33 (0.4 examples/sec; 19.014 sec/batch)
2017-06-08 07:05:46.372759: step 40, loss = 2.17 (0.4 examples/sec; 18.428 sec/batch)

잘 수행되는 것을 보실 수 있습니다. 수행되는 중간에 Parent OS에서 nmon으로 관측해보면 python이 CPU 대부분을 사용하는 것을 보실 수 있습니다. 이 python process의 parent PID는 물론 docker daemon입니다.

root@sys-87548:/home/u0017496# ps -ef | grep 14190 | grep -v grep
root 14190 14173 78 02:46 ? 00:00:53 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

root@sys-87548:/home/u0017496# ps -ef | grep 14173 | grep -v grep
root 14173 15050 0 02:46 ? 00:00:00 docker-containerd-shim b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 /var/run/docker/libcontainerd/b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 docker-runc
root 14190 14173 80 02:46 ? 00:01:06 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

이제 다음 posting에서는 이 서버와 다른 서버에서 LSF를 통해 이 docker image로 inception v3를 training하는 것을 보시겠습니다. 이 서버인 sys-87548 외에, sys-87549에도 docker를 설치하고 docker image를 pull 해두고, 또 여기서 build된 /home/inception directory를 scp를 통해 sys-87549 서버에도 복사해 둡니다.

root@sys-87549:/home/u0017496# docker pull bsyu/tensor_r1.0:ppc64le-xenial

root@sys-87549:/home/u0017496# scp -r sys-87548:/home/inception /home

2017년 5월 15일 월요일

Minsky 서버에 Continuum 아나콘다(Anaconda) 설치하기 + Tensorflow로 inception v3 training 해보기

Continuum에서 내놓은 아나콘다(Anaconda)는 여태까지는 x86용으로만 존재했으나, 최근 ARM processor와 POWER8 processor를 위한 min-conda를 내놓았습니다. 아래 site에 해당 package를 download 받기 위한 link와 설치 방법이 정리되어 있습니다.

https://www.continuum.io/content/conda-support-raspberry-pi-2-and-power8-le

불행히도 ppc64le를 위한 link는 잘못 지정되어 있어 '404 Not Found'가 나옵니다만, 실제로는 link만 잘못된 것이고 아래와 같이 file은 실제로 존재합니다. 위의 것은 python version2 용이고, 아래 것은 python 3용입니다.

https://repo.continuum.io/miniconda/Miniconda2-4.3.14-Linux-ppc64le.sh
https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh

여기서는 python 3용을 설치해보겠습니다.

u0017496@sys-87250:~$ wget https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh
--2017-05-14 21:54:26-- https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.16.19.10, 104.16.18.10, 2400:cb00:2048:1::6810:120a, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.16.19.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34765794 (33M) [application/x-sh]
Saving to: ‘Miniconda3-4.3.14-Linux-ppc64le.sh’

Miniconda3-4.3.14-Linux-pp 100%[=====================================>] 33.15M 10.2MB/s in 3.3s

2017-05-17 22:05:18 (10.0 MB/s) - ‘Miniconda3-4.3.14-Linux-ppc64le.sh’ saved [34765794/34765794]

u0017496@sys-87250:~$ chmod a+x Miniconda3-4.3.14-Linux-ppc64le.sh

이제 이 shell script를 수행하면 license 동의 등에 답을 해야 하며, 그 외에도 여러가지 입력 값을 넣어야 합니다. 대부분 그냥 enter를 누르시면 됩니다.

u0017496@sys-87250:~$ ./Miniconda3-4.3.14-Linux-ppc64le.sh

Welcome to Miniconda3 4.3.14 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>

(중략)

[/home/u0017496/miniconda3] >>>
PREFIX=/home/u0017496/miniconda3
installing: python-3.6.0-0 ...
installing: cffi-1.9.1-py36_0 ...
installing: conda-env-2.6.0-0 ...
installing: cryptography-1.7.1-py36_0 ...
installing: idna-2.2-py36_0 ...
installing: libffi-3.2.1-1 ...
installing: openssl-1.0.2k-1 ...
installing: pyasn1-0.2.3-py36_0 ...
installing: pycosat-0.6.2-py36_0 ...
installing: pycparser-2.17-py36_0 ...
installing: pyopenssl-16.2.0-py36_0 ...
installing: requests-2.13.0-py36_0 ...
installing: ruamel_yaml-0.11.14-py36_1 ...
installing: setuptools-27.2.0-py36_0 ...
installing: six-1.10.0-py36_0 ...
installing: sqlite-3.13.0-0 ...
installing: xz-5.2.2-1 ...
installing: yaml-0.1.6-0 ...
installing: zlib-1.2.8-3 ...
installing: conda-4.3.14-py36_0 ...
installing: pip-9.0.1-py36_1 ...
installing: wheel-0.29.0-py36_0 ...
Python 3.6.0 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/u0017496/.bashrc ? [yes|no]

[no] >>> yes

Prepending PATH=/home/u0017496/miniconda3/bin to PATH in /home/u0017496/.bashrc
A backup will be made to: /home/u0017496/.bashrc-miniconda3.bak

For this change to become active, you have to open a new terminal.

Thank you for installing Miniconda2!

Share your notebooks and packages on Anaconda Cloud!
Sign up for free: https://anaconda.org

이제 mini-conda를 설치했으니 conda 명령을 쓸 수 있어야 합니다. 그러나 보시다시피 conda가 없습니다.

u0017496@sys-87250:~$ which conda

이는 mini-conda 설치시 ~/.bashrc에 conda의 PATH 정보가 자동으로 들어가긴 했지만 .bashrc가 수행되지 않았기 때문에 그런 것입니다. 수행하시면 conda의 PATH가 잡혀 있는 것을 보실 수 있습니다.

u0017496@sys-87250:~$ . ~/.bashrc
u0017496@sys-87250:~$ which conda
/home/u0017496/miniconda3/bin/conda

이제 다음과 같이 ananconda에 포함된 python library들을 보실 수 있습니다.

u0017496@sys-87250:~$ conda list
# packages in environment at /home/u0017496/miniconda3:
#
cffi 1.9.1 py36_0
conda 4.3.14 py36_0
conda-env 2.6.0 0
cryptography 1.7.1 py36_0
idna 2.2 py36_0
libffi 3.2.1 1
openssl 1.0.2k 1
pip 9.0.1 py36_1
pyasn1 0.2.3 py36_0
pycosat 0.6.2 py36_0
pycparser 2.17 py36_0
pyopenssl 16.2.0 py36_0
python 3.6.0 0
requests 2.13.0 py36_0
ruamel_yaml 0.11.14 py36_1
setuptools 27.2.0 py36_0
six 1.10.0 py36_0
sqlite 3.13.0 0
wheel 0.29.0 py36_0
xz 5.2.2 1
yaml 0.1.6 0
zlib 1.2.8 3

현재까지 Continuum에서 빌드해놓은 package들을 모조리 다 설치하는 명령을 다음에 정리했습니다.

u0017496@sys-87250:~$ for i in `conda list | awk '{print $1}' | grep -v \#`
> do
> conda install $i
> done

(중략)

이제 위에서 설치한 패키지 중 pip가 제대로 설치되었는지 conda search로 확인해보겠습니다. 아래와 같이 * 표시가 된 것이 설치된 것입니다.

u0017496@sys-87250:~$ conda search pip
Fetching package metadata .........
pip 7.1.0 py27_0 defaults
7.1.0 py34_0 defaults
7.1.0 py27_1 defaults
7.1.0 py34_1 defaults
7.1.2 py27_0 defaults
7.1.2 py34_0 defaults
8.1.0 py27_0 defaults
8.1.0 py34_0 defaults
8.1.0 py35_0 defaults
8.1.2 py27_0 defaults
8.1.2 py34_0 defaults
8.1.2 py35_0 defaults
9.0.0 py27_0 defaults
9.0.0 py34_0 defaults
9.0.0 py35_0 defaults
9.0.1 py27_1 defaults
9.0.1 py35_1 defaults
* 9.0.1 py36_1 defaults

u0017496@sys-87250:~$ which pip

/home/u0017496/miniconda3/bin/pip

u0017496@sys-87250:~$ pip --version

pip 9.0.1 from /home/u0017496/miniconda3/lib/python3.6/site-packages (python 3.6)

이렇게 conda에서 제공하는 pip로 keras 2.0.4를 설치해보겠습니다.

u0017496@sys-87250:~/miniconda3/lib$ pip install keras==2.0.4
Collecting keras==2.0.4
Downloading Keras-2.0.4.tar.gz (199kB)
100% |████████████████████████████████| 204kB 3.1MB/s
Collecting theano (from keras==2.0.4)
Downloading Theano-0.9.0.tar.gz (3.1MB)
100% |████████████████████████████████| 3.1MB 310kB/s
Collecting pyyaml (from keras==2.0.4)
Downloading PyYAML-3.12.tar.gz (253kB)
100% |████████████████████████████████| 256kB 3.6MB/s
Requirement already satisfied: six in ./python3.6/site-packages (from keras==2.0.4)
Requirement already satisfied: numpy>=1.9.1 in ./python3.6/site-packages (from theano->keras==2.0.4)
Requirement already satisfied: scipy>=0.14 in ./python3.6/site-packages (from theano->keras==2.0.4)
Building wheels for collected packages: keras, theano, pyyaml
Running setup.py bdist_wheel for keras ... done
Stored in directory: /home/u0017496/.cache/pip/wheels/48/82/42/f06a8c03a8f95ada523a81ba723e89f059693e6ad868d09727
Running setup.py bdist_wheel for theano ... done
Stored in directory: /home/u0017496/.cache/pip/wheels/d5/5b/93/433299b86e3e9b25f0f600e4e4ebf18e38eb7534ea518eba13
Running setup.py bdist_wheel for pyyaml ... done
Stored in directory: /home/u0017496/.cache/pip/wheels/2c/f7/79/13f3a12cd723892437c0cfbde1230ab4d82947ff7b3839a4fc
Successfully built keras theano pyyaml
Installing collected packages: theano, pyyaml, keras
Successfully installed keras-2.0.4 pyyaml-3.12 theano-0.9.0

또 gensim 2.0.0과 KoNLPy를 설치해보겠습니다.

u0017496@sys-87250:~$ pip install gensim==2.0.0
Collecting gensim==2.0.0
Downloading gensim-2.0.0.tar.gz (14.1MB)
100% |████████████████████████████████| 14.2MB 88kB/s
Requirement already satisfied: numpy>=1.3 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: scipy>=0.7.0 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: six>=1.5.0 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: smart_open>=1.2.1 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: boto>=2.32 in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Requirement already satisfied: bz2file in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Requirement already satisfied: requests in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Building wheels for collected packages: gensim
Running setup.py bdist_wheel for gensim ... done
Stored in directory: /home/u0017496/.cache/pip/wheels/e9/5f/e7/4ff23a3fe4b181b44f37eed5602f179c1cc92a0a34f337e745
Successfully built gensim
Installing collected packages: gensim
Found existing installation: gensim 1.0.1
Uninstalling gensim-1.0.1:
Successfully uninstalled gensim-1.0.1
Successfully installed gensim-2.0.0

u0017496@sys-87250:~$ pip install konlpy

Collecting konlpy

Downloading konlpy-0.4.4-py2.py3-none-any.whl (22.5MB)

100% |████████████████████████████████| 22.5MB 57kB/s

Installing collected packages: konlpy

Successfully installed konlpy-0.4.4

이제 conda 명령을 통해 추가로 numpy와 matplotlib, scipy와 scikit-learn를 설치해보겠습니다. matplotlib의 prerequisite이 numpy이고, scikit-learn의 prerequisite이 scipy라서 그것들은 자동으로 설치되니까, 실제로는 conda 명령은 두번만 쓰면 됩니다.

u0017496@sys-87250:~$ conda install matplotlib
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

cycler: 0.10.0-py36_0
freetype: 2.5.5-2
libpng: 1.6.27-0
matplotlib: 2.0.2-np112py36_0
numpy: 1.12.1-py36_0
openblas: 0.2.19-0
python-dateutil: 2.6.0-py36_0
pytz: 2017.2-py36_0

Proceed ([y]/n)? y

openblas-0.2.1 100% |###########################################################| Time: 0:00:00 10.21 MB/s
libpng-1.6.27- 100% |###########################################################| Time: 0:00:00 12.75 MB/s
freetype-2.5.5 100% |###########################################################| Time: 0:00:00 10.53 MB/s
numpy-1.12.1-p 100% |###########################################################| Time: 0:00:00 15.12 MB/s
pytz-2017.2-py 100% |###########################################################| Time: 0:00:00 13.25 MB/s
cycler-0.10.0- 100% |###########################################################| Time: 0:00:00 15.61 MB/s
python-dateuti 100% |###########################################################| Time: 0:00:00 6.43 MB/s
matplotlib-2.0 100% |###########################################################| Time: 0:00:00 14.62 MB/s

u0017496@sys-87250:~$ conda install scikit-learn
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

scikit-learn: 0.18.1-np112py36_1
scipy: 0.19.0-np112py36_0

Proceed ([y]/n)? y

scipy-0.19.0-n 100% |###########################################################| Time: 0:00:02 14.75 MB/s
scikit-learn-0 100% |###########################################################| Time: 0:00:00 15.56 MB/s

이렇게 설치된 것들은 아래와 같이 /home/u0017496/miniconda3/lib/python3.6/site-packages 에 들어갑니다.

u0017496@sys-87250:~$ ls /home/u0017496/miniconda3/lib/python3.6/site-packages/
asn1crypto mpl_toolkits python_dateutil-2.6.0-py3.6.egg-info
asn1crypto-0.22.0-py3.6.egg-info numpy-1.12.1.dist-info pytz
cffi OpenSSL pytz-2017.2-py3.6.egg-info
cffi-1.10.0-py3.6.egg-info packaging README.txt
_cffi_backend.so packaging-16.8-py3.6.egg-info requests
conda pip requests-2.14.2-py3.6.egg-info
conda-4.3.18-py3.6.egg-info pip-9.0.1-py3.6.egg-info ruamel_yaml
conda_env pyasn1 scikit_learn-0.18.1-py3.6.egg-info
cryptography pyasn1-0.2.3-py3.6.egg-info scipy
cryptography-1.8.1-py3.6.egg-info __pycache__ scipy-0.19.0-py3.6.egg-info
cycler-0.10.0-py3.6.egg-info pycosat-0.6.2-py3.6.egg-info setuptools-27.2.0-py3.6.egg
cycler.py pycosat.cpython-36m-powerpc64le-linux-gnu.so setuptools.pth
dateutil pycparser six-1.10.0-py3.6.egg-info
easy-install.pth pycparser-2.17-py3.6.egg-info six.py
idna pylab.py sklearn
idna-2.5-py3.6.egg-info pyOpenSSL-17.0.0-py3.6.egg-info test_pycosat.py
matplotlib pyparsing-2.1.4-py3.6.egg-info wheel
matplotlib-2.0.2-py3.6.egg-info pyparsing.py wheel-0.29.0-py3.6.egg-info

따라서 이것들을 사용하기 위해서는 PYTHONPATH는 다음과 같이 설정하시면 됩니다.

u0017496@sys-87250:~$ export PYTHONPATH=/home/u0017496/miniconda3/lib/python3.6/site-packages:$PYTHONPATH

이제 여기에 (PowerAI에 포함된 tensorflow 말고) conda로 bazel, tensorflow 및 tensorflow-gpu도 설치해보겠습니다.

u0017496@sys-87250:~$ conda install bazel
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

bazel: 0.4.5-0

Proceed ([y]/n)? y

bazel-0.4.5-0. 100% |#############################################| Time: 0:00:09 13.37 MB/s

u0017496@sys-87250:~$ conda install tensorflow
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

libprotobuf: 3.2.0-0
protobuf: 3.2.0-py36_0
tensorflow: 1.1.0-np112py36_0
werkzeug: 0.12.2-py36_0

Proceed ([y]/n)? y

libprotobuf-3. 100% |#############################################| Time: 0:00:00 13.84 MB/s
werkzeug-0.12. 100% |#############################################| Time: 0:00:00 18.67 MB/s
protobuf-3.2.0 100% |#############################################| Time: 0:00:00 10.39 MB/s
tensorflow-1.1 100% |#############################################| Time: 0:00:01 15.16 MB/s

u0017496@sys-87250:~$ conda install tensorflow-gpu
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

cudatoolkit: 8.0-0
cudnn: 6.0.21-0
tensorflow-gpu: 1.1.0-np112py36_0

Proceed ([y]/n)? y

cudatoolkit-8. 100% |#############################################| Time: 0:00:29 11.24 MB/s
cudnn-6.0.21-0 100% |#############################################| Time: 0:00:11 15.97 MB/s
tensorflow-gpu 100% |#############################################| Time: 0:00:06 14.27 MB/s

conda list 명령으로 보면 다음과 같은 것들이 설치된 것을 보실 수 있습니다.

u0017496@sys-87250:~$ conda list
# packages in environment at /home/u0017496/miniconda3:
#
asn1crypto 0.22.0 py36_0
bazel 0.4.5 0
boto 2.46.1 py36_0
bz2file 0.98 py36_0
cffi 1.10.0 py36_0
conda 4.3.18 py36_0
conda-env 2.6.0 0
cryptography 1.8.1 py36_0
cudatoolkit 8.0 0
cudnn 6.0.21 0
cycler 0.10.0 py36_0
freetype 2.5.5 2
gensim 1.0.1 np112py36_0
gensim 2.0.0 <pip>
idna 2.5 py36_0
Keras 2.0.4 <pip>
konlpy 0.4.4 <pip>
libffi 3.2.1 1
libpng 1.6.27 0
libprotobuf 3.2.0 0
matplotlib 2.0.2 np112py36_0
numpy 1.12.1 <pip>
numpy 1.12.1 py36_0
openblas 0.2.19 0
openssl 1.0.2k 2
packaging 16.8 py36_0
pip 9.0.1 py36_1
protobuf 3.2.0 py36_0
pyasn1 0.2.3 py36_0
pycosat 0.6.2 py36_0
pycparser 2.17 py36_0
pyopenssl 17.0.0 py36_0
pyparsing 2.1.4 py36_0
python 3.6.1 2
python-dateutil 2.6.0 py36_0
pytz 2017.2 py36_0
PyYAML 3.12 <pip>
requests 2.14.2 py36_0
ruamel_yaml 0.11.14 py36_1
scikit-learn 0.18.1 np112py36_1
scipy 0.19.0 np112py36_0
setuptools 27.2.0 py36_0
six 1.10.0 py36_0
smart_open 1.5.2 py36_0
sqlite 3.13.0 0
tensorflow 1.1.0 np112py36_0
tensorflow-gpu 1.1.0 np112py36_0
Theano 0.9.0 <pip>
werkzeug 0.12.2 py36_0
wheel 0.29.0 py36_0
xz 5.2.2 1
yaml 0.1.6 0
zlib 1.2.8 3

설치하는 김에, 이렇게 conda로 설치한 tensorflow를 이용하여 inception v3 model을 training 해보겠습니다. 다음 순서대로 따라 하시면 됩니다.

u0017496@sys-87250:~/inception$ pwd
/home/u0017496/inception

u0017496@sys-87250:~/inception$ export INCEPTION_DIR=/home/u0017496/inception

u0017496@sys-87250:~/inception$ curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 380M 100 380M 0 0 5918k 0 0:01:05 0:01:05 --:--:-- 4233k

u0017496@sys-87250:~/inception$ tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585

u0017496@sys-87250:~/inception$ git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4703, done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 4703 (delta 17), reused 31 (delta 11), pack-reused 4649
Receiving objects: 100% (4703/4703), 153.34 MiB | 5.62 MiB/s, done.
Resolving deltas: 100% (2374/2374), done.
Checking connectivity... done.

u0017496@sys-87250:~/inception/models/inception$ export FLOWERS_DIR=/home/u0017496/inception/models/inception

u0017496@sys-87250:~/inception/models/inception$ mkdir -p $FLOWERS_DIR/data

u0017496@sys-87250:~/inception/models/inception$ which bazel
/home/u0017496/miniconda3/bin/bazel

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/download_and_preprocess_flowers
Extracting Bazel installation...
....................
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 6.943s, Critical Path: 0.05s

u0017496@sys-87250:~/inception/models/inception$ export TEST_TMPDIR=/home/u0017496/.cache

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/u0017496/.cache'.
Extracting Bazel installation...
.............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 4.867s, Critical Path: 0.03s

u0017496@sys-87250:~/inception/models/inception$ bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 218M 100 218M 0 0 9372k 0 0:00:23 0:00:23 --:--:-- 10.1M
(중략)
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-05-19 05:33:44.191446 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-05-19 05:33:44.213856 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-05-19 05:33:54.902070 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-05-19 05:33:54.902172 [thread 1]: Wrote 1585 images to 1585 shards.
2017-05-19 05:33:54.911283 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-05-19 05:33:54.911360 [thread 0]: Wrote 1585 images to 1585 shards.
2017-05-19 05:33:55.171141: Finished writing all 3170 images in data set.

아래에서 보시다시피 이 inception v3는 꽃 사진을 분류하는 neural network입니다.

u0017496@sys-87250:~/inception/models/inception$ du -sm data/raw-data/train/*
29 data/raw-data/train/daisy
44 data/raw-data/train/dandelion
1 data/raw-data/train/LICENSE.txt
33 data/raw-data/train/roses
47 data/raw-data/train/sunflowers
48 data/raw-data/train/tulips

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/flowers_train
INFO: $TEST_TMPDIR defined: output root default is '/home/u0017496/.cache'.
............................
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
bazel-bin/inception/flowers_train
INFO: Elapsed time: 6.502s, Critical Path: 0.03s

이제 비로소 inception v3의 training 준비가 끝났습니다. 이제 다음 명령으로 training을 시작합니다.

u0017496@sys-87250:~/inception/models/inception$ time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=32

NVIDIA: no NVIDIA devices found
2017-05-19 05:41:03.740213: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-19 05:41:03.740670: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87250): /proc/driver/nvidia/version does not exist
2017-05-19 05:41:51.947244: Pre-trained model restored from /home/u0017496/inception/inception-v3/model.ckpt-157585
2017-05-19 05:47:22.023602: step 0, loss = 2.79 (0.2 examples/sec; 182.713 sec/batch)
2017-05-19 06:05:58.942671: step 10, loss = 2.53 (0.4 examples/sec; 78.882 sec/batch)
2017-05-19 06:19:26.875533: step 20, loss = 2.40 (0.4 examples/sec; 82.410 sec/batch)
2017-05-19 06:33:10.333275: step 30, loss = 2.20 (0.4 examples/sec; 77.844 sec/batch)
2017-05-19 06:48:27.688993: step 40, loss = 2.24 (0.3 examples/sec; 96.148 sec/batch)

real 84m30.882s
user 135m20.864s
sys 2m30.832s

이제 와서 고백하지만 제가 설치 demo를 보여드린 이 서버는 사실 GPU가 달려 있지 않은 POWER8 서버입니다. GPU가 없는 경우 CPU를 이용하게 되는데, 그런 경우 이 training의 완료는 보시다시피 매우, 매우 오래 걸립니다. 저 output을 보면 초당 example 0.4개 처리로 나옵니다만, P100을 이용하는 경우 (GPU 개수 및 batch size에 따라) 초당 50개~200개 단위로 처리가 됩니다.

아래는 전에 PowerAI를 설치한 Minsky 서버에서 수행했던 inception v3의 결과 log 일부입니다.

2017-05-16 03:48:46.352210: Pre-trained model restored from /gpfs/gpfs_gl4_16mb/b7p088za/inception-v3/model.ckpt-157585
2017-05-16 03:52:44.322381: step 0, loss = 2.72 (17.6 examples/sec; 21.830 sec/batch)
2017-05-16 03:55:29.550791: step 10, loss = 2.57 (213.6 examples/sec; 1.797 sec/batch)
2017-05-16 03:55:47.619990: step 20, loss = 2.35 (212.1 examples/sec; 1.810 sec/batch)
2017-05-16 03:56:05.953991: step 30, loss = 2.17 (206.6 examples/sec; 1.859 sec/batch)
2017-05-16 03:56:24.306742: step 40, loss = 1.98 (209.4 examples/sec; 1.834 sec/batch)
2017-05-16 03:56:42.490063: step 50, loss = 1.92 (217.8 examples/sec; 1.763 sec/batch)
2017-05-16 03:57:00.444537: step 60, loss = 1.67 (216.6 examples/sec; 1.773 sec/batch)
2017-05-16 03:57:18.366941: step 70, loss = 1.58 (212.7 examples/sec; 1.806 sec/batch)
2017-05-16 03:57:36.467837: step 80, loss = 1.55 (213.6 examples/sec; 1.798 sec/batch)