레이블이 inception v3인 게시물을 표시합니다. 모든 게시물 표시
레이블이 inception v3인 게시물을 표시합니다. 모든 게시물 표시

2018년 9월 10일 월요일

ILSVRC2012 dataset 중 강아지 사진만을 이용한 짧은 TF inception v3 테스트 방법



다음과 같이 benchmark/tensorflow 디렉토리에 들어가서, exec_img.sh를 수행하시면 됩니다.  이때 아래와 같이 nohup으로 수행하시면 도중에 연결 세션이 끊어져도 백그라운드로 job은 계속 수행될 뿐만 아니라, 수행 기록이 nohup.out에도 기록되므로 편리하실 것입니다.

[root@ac922 tensorflow]# pwd
/home/files/ilsvrc12/tensorflow

[root@ac922 tensorflow]# ls
benchmark  exec_img.sh  models  nohup.out.final.tf  output_final

[root@ac922 tensorflow]# nohup ./exec_img.sh &

위와 같이 exec_img.sh를 한번 수행하시면 그 속에서 아래와 같이 ./models/run_model.sh 스크립트가 batch_size=128로 순차적으로 GPU 개수 4, 2, 1에 대해서 각각 1번씩 총 3번을 수행합니다.  각 수행이 끝날 때마다 time 명령에 의해 수행 시간에 걸린 시간이 nohup.out에 기록됩니다.   원래 NVIDIA에서 준 script를 수행해보니, 매번 exec을 수행할 때마다 output directory를 새로 만들어야 제대로 수행되는 것 같아 아래와 같이 exec 수행시마다 ouput directory를 다른 이름으로 옮기고 새로 output을 만드는 문장을 추가했습니다.

mkdir output
time  exec  tf inception3  128  4  0
mv output output4gpu
mkdir output
time  exec  tf inception3  128  2  0
mv output output2gpu
mkdir output
time  exec  tf inception3  128  1  0
mv output output1gpu


결과 확인은 ouput directory에 쌓이는 아래의 log를 보셔도 되고, nohup.out을 보셔도 됩니다.  이 script에서는 total images/sec이 python 자체적으로 합산되어 표시되므로 그것을 기록하시면 됩니다.   단, python에서 계산되는 Elapsed Time은 일부 로직이 잘못되어 분:초 단위만 맞고 시간 단위는 9시간으로 나오니 그건 무시하십시요. 

이 테스트를 위해 필요한 python code 및 model file을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1DNn-Nv4rlOiv2NLqk6Y0j2ANlJjw9VP6

그리고 이 테스트를 위해 필요한 종류별로 labeling된 강아지 사진을 tfrecord 포맷으로 만든 dataset을 아래 google drive에 올려 놓았습니다.

https://drive.google.com/open?id=1rQcxAWeNbByy0Yooj6IbROyVRsdQPn5-

위 dataset을 추출하고 tfrecord로 포맷하는 과정은 아래에 정리되어 있습니다.

http://hwengineer.blogspot.com/2018/04/ilsvrc2012imgtraint3tar-training-dataset.html


** 별첨 :  tfrecord file들의 이름과 size

[root@ac922 ilsvrc12]# cd tfrecord/

[root@ac922 tfrecord]# ls -l | head
total 1509860
-rw-rw-r--. 1 1001 1001  6920780 Apr 11 19:20 train-00000-of-00120
-rw-rw-r--. 1 1001 1001  6422535 Apr 11 19:20 train-00001-of-00120
-rw-rw-r--. 1 1001 1001  6959007 Apr 11 19:21 train-00002-of-00120
-rw-rw-r--. 1 1001 1001  6885268 Apr 11 19:21 train-00003-of-00120
-rw-rw-r--. 1 1001 1001  5969364 Apr 11 19:21 train-00004-of-00120
-rw-rw-r--. 1 1001 1001  6143260 Apr 11 19:21 train-00005-of-00120
-rw-rw-r--. 1 1001 1001  6123517 Apr 11 19:21 train-00006-of-00120
-rw-rw-r--. 1 1001 1001  8585788 Apr 11 19:21 train-00007-of-00120
-rw-rw-r--. 1 1001 1001  6149957 Apr 11 19:21 train-00008-of-00120

[root@ac922 tfrecord]# ls -l | tail
-rw-rw-r--. 1 1001 1001 24124729 Apr 11 19:20 validation-00022-of-00032
-rw-rw-r--. 1 1001 1001 23741822 Apr 11 19:20 validation-00023-of-00032
-rw-rw-r--. 1 1001 1001 24759230 Apr 11 19:20 validation-00024-of-00032
-rw-rw-r--. 1 1001 1001 25225023 Apr 11 19:20 validation-00025-of-00032
-rw-rw-r--. 1 1001 1001 25273559 Apr 11 19:20 validation-00026-of-00032
-rw-rw-r--. 1 1001 1001 26820464 Apr 11 19:20 validation-00027-of-00032
-rw-rw-r--. 1 1001 1001 24115323 Apr 11 19:20 validation-00028-of-00032
-rw-rw-r--. 1 1001 1001 24459085 Apr 11 19:20 validation-00029-of-00032
-rw-rw-r--. 1 1001 1001 25246485 Apr 11 19:20 validation-00030-of-00032
-rw-rw-r--. 1 1001 1001 23561132 Apr 11 19:20 validation-00031-of-00032

[root@ac922 tfrecord]# du -sm .
1475    .


2017년 6월 8일 목요일

POWER8에서 LSF를 이용하여 tensorflow docker image로 inception v3를 training 하기


여기서는 Spectrum LSF의 무료 community edition을 이용하여, sys-87548 서버와 sys-87549 서버로 LSF cluster를 만들겠습니다.  sys-87548 서버가 master이자 slave이고, sys-87549 서버는 secondary master이자 역시 slave 입니다.

LSF community edition을 인터넷에서 download 받아 다음과 같이 tar를 압축 해제합니다.  속에는 LSF와 Platform Application Server가 들어있는 것을 보실 수 있습니다.   여기서는 일단 LSF만 설치합니다.

root@sys-87548:/home/u0017496# tar -zxvf lsfce10.1-ppc64le.tar.gz
lsfce10.1-ppc64le/
lsfce10.1-ppc64le/lsf/
lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall_linux_ppc64le.tar.Z
lsfce10.1-ppc64le/lsf/lsf10.1_lnx310-lib217-ppc64le.tar.Z
lsfce10.1-ppc64le/pac/
lsfce10.1-ppc64le/pac/pac10.1_basic_linux-ppc64le.tar.Z

root@sys-87548:/home/u0017496# cd lsfce10.1-ppc64le/lsf/

아래와 같이 LSF에는 두개의 *.tar.Z file이 있는데, 이중 압축 해제가 필요한 것은 lsf10.1_lsfinstall_linux_ppc64le.tar.Z 뿐입니다.  lsf10.1_lnx310-lib217-ppc64le.tar.Z은 압축 해제하지 말고 그냥 놔두셔야 합니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z  lsf10.1_lsfinstall_linux_ppc64le.tar.Z

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# cd lsf10.1_lsfinstall

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl  install.config  lap         lsf_unix_install.pdf  patchlib   README      rpm      slave.config
hostsetup  instlib         lsfinstall  patchinstall          pversions  rhostsetup  scripts

먼저 install.config를 다음과 같이 편집합니다.  LSF_MASTER_LIST에 적힌 순서에 따라 sys-87548가 primary master,  sys-87549가 secondary master이고, 또 LSF_ADD_SERVERS에 적힌 순서에 따라 sys-87548와 sys-87549가 각각 host server, 즉 slave입니다.   만약 서버가 1대 뿐이라면 그 서버가 master이자 single slave가 될 수도 있습니다.

[root@sys-87538 lsf10.1_lsfinstall]# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="u0017496"
LSF_CLUSTER_NAME="cluster1"
LSF_MASTER_LIST="sys-87548 sys-87549"
LSF_TARDIR="/home/u0017496/lsfce10.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="sys-87548 sys-87549"


lsfinstall 명령을 이용해 다음과 같이 설치 시작합니다.  도중에 LSF distribution tar 파일의 위치를 묻는데, 아까 압축 해제하지 않고 놔둔 그 파일을 묻는 것입니다.  그냥 1을 눌러 default를 선택하면 됩니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config
...
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
...
Searching LSF 10.1 distribution tar files in /home/u0017496/lsfce10.1-ppc64le/lsf Please wait ...
  1) linux3.10-glibc2.17-ppc64le
Press 1 or Enter to install this host type: 1


LSF 엔진 설치가 끝나면 다음 명령을 수행하여 이 서버를 host server, 즉 slave로 설정합니다.  --boot="y"는 서버가 부팅될 때 LSF daemon을 자동으로 살리라는 뜻입니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"
Logging installation sequence in /usr/share/lsf/log/Install.log

------------------------------------------------------------
    L S F    H O S T S E T U P    U T I L I T Y
------------------------------------------------------------
This script sets up local host (LSF server, client or slave) environment.

Setting up LSF server host "sys-87548" ...
Checking LSF installation for host "sys-87548" ... Done
grep: /etc/init/rc-sysinit.conf: No such file or directory
Copying /etc/init.d/lsf, /etc/rc3.d/S95lsf and /etc/rc3.d/K05lsf
Installing LSF RC scripts on host "sys-87548" ... Done
LSF service ports are defined in /usr/share/lsf/conf/lsf.conf.
Checking LSF service ports definition on host "sys-87548" ... Done
You are installing IBM Spectrum LSF -  Community Edition.

... Setting up LSF server host "sys-87548" is done
... LSF host setup is done.


그리고 root user와 LSF 사용 user의 .bashrc에는 다음과 같이 profile.lsf가 수행되도록 entry를 넣어둡니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf

u0017496@sys-87481:~$ vi ~/.bashrc
. /usr/share/lsf/conf/profile.lsf

그리고 서버들 간에는 rsh이 아닌 ssh를 이용하도록 아래와 같이 lsf.conf 속에 LSF_RSH를 설정합니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /usr/share/lsf/conf/lsf.conf
LSF_RSH=ssh

그리고 LSF 사용자들을 /etc/lsf.sudoers  속에 등록해둡니다.

root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=u0017496
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="u0017496"


lsfadmin user는 물론, root로도 자체, 그리고 에 대해 passwd 없이 ssh가 가능하도록 설정해야 합니다.

u0017496@sys-87548:~$ ssh-keygen -t rsa
u0017496@sys-87548:~$ ssh-copy-id sys-87548
u0017496@sys-87548:~$ ssh-copy-id sys-87549


root@sys-87548:/home/u0017496# ssh-keygen -t rsa

[root@sys-87548:/home/u0017496# ls -l /root/.ssh
total 12
-rw------- 1 root root 1595 Jun  8 03:43 authorized_keys
-rw------- 1 root root 1679 Jun  8 03:42 id_rsa
-rw-r--r-- 1 root root  396 Jun  8 03:42 id_rsa.pub

root@sys-87548:/home/u0017496# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys


root@sys-87548:/home/u0017496# cp /root/.ssh/id_rsa.pub /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# chmod 666 /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# exit

u0017496@sys-87548:~$ scp /tmp/sys-87548.id_rsa.pub sys-87549:/tmp
sys-87548.id_rsa.pub                                                         100%  396     0.4KB/s   00:00

root@sys-87548:/home/u0017496# cat /tmp/sys-87549.id_rsa.pub >> /root/.ssh/authorized_keys


이제 lsfstartup 명령을 이용해 LSF daemon들을 살립니다.

root@sys-87548:/home/u0017496# lsfstartup

root@sys-87548:/home/u0017496# lsid
IBM Spectrum LSF Community Edition 10.1.0.0, Jun 15 2016
Copyright IBM Corp. 1992, 2016. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

My cluster name is cluster1
My master name is sys-87548.dal-ebis.ihost.com


아직은 sys-87548에만 LSF가 설치되어 있으므로, LSF daemon들에게는 sys-87549가 down된 상태로 보인다는 점에 유의하십시요.

u0017496@sys-87548:~$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
sys-87548.dal-ebis ok              -      1      0      0      0      0      0
sys-87549.dal-ebis unavail         -      1      0      0      0      0      0

다음과 같이 bsub 명령으로 tensorflow docker image를 이용해 inception v3를 training 하는 job을 submit 합니다.  이것들은 기본적으로 normal queue에 들어갑니다.   같은 명령을 2번 더 내려서, 3개의 training job을 queue에 던져 둡니다.

u0017496@sys-87548:~$ bsub -n 1 sudo docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <1> is submitted to queue <normal>.

Job 3개 중 1개는 running 중이고, 2개는 pending 상태인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -     3     2     1     0
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

이제 default인 normal queue가 아닌, short queue에 또 동일한 job을 submit 해봅니다.

u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <4> is submitted to queue <short>.


u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56
3       u001749 PEND  normal     sys-87548.d             help       Jun  8 04:02

잘못 들어간 3번 job을 아래 bkill 명령으로 kill 시킬 수 있습니다.

u0017496@sys-87548:~$ bkill 3
Job <3> is being terminated

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56

한창 running 중인 job은 bstop 명령으로 잠깐 suspend 시킬 수 있습니다.  나중에 resume할 수 있도록 수행 도중 상태는 임시파일로 disk에 write됩니다.

u0017496@sys-87548:~$ bstop 1
Job <1> is being stopped

u0017496@sys-87548:~$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     1     1     0     0
normal           30  Open:Active       -    -    -    -     2     1     0     1
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 USUSP normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56

Suspend된 job은 bresume 명령으로 다시 수행할 수 있습니다.  이때 disk로부터 임시파일을 읽느라 I/O가 좀 발생합니다.

u0017496@sys-87548:~$ bresume 1
Job <1> is being resumed

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56

u0017496@sys-87548:~$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     1     1     0     0
normal           30  Open:Active       -    -    -    -     2     1     1     0
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <5> is submitted to queue <short>.

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
5       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:14
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56


대기 중인 job들 중에 빨리 처리되어야 하는 job은 btop 명령을 통해 대기 queue 상에서 순서를 최상위로 바꿀 수도 있습니다.

u0017496@sys-87548:~$ btop 5
Job <5> has been moved to position 1 from top.

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54
5       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:14
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
2       u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 03:56

현재 수행 중인 job의 standard out은 bpeek 명령을 통해 엿볼 수 있습니다.  이는 꼭 실시간으로 display되지는 않고, 약간 buffer가 쌓여야 보일 수도 있습니다.

u0017496@sys-87548:~$ bpeek 1
<< output from stdout >>
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: d61a01411a1c
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1065] LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1066] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (d61a01411a1c): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 07:56:19.985253: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 07:59:19.471085: step 0, loss = 2.89 (0.1 examples/sec; 55.400 sec/batch)
2017-06-08 08:04:13.045726: step 10, loss = 2.46 (0.4 examples/sec; 19.288 sec/batch)
2017-06-08 08:07:22.977419: step 20, loss = 2.49 (0.4 examples/sec; 19.100 sec/batch)
2017-06-08 08:10:37.519711: step 30, loss = 2.18 (0.4 examples/sec; 20.120 sec/batch)
2017-06-08 08:13:48.625858: step 40, loss = 2.28 (0.4 examples/sec; 20.342 sec/batch)

bhist 명령을 통해 각 job들의 history를 간략히, 그리고 -l option을 쓰면 자세히 볼 수 있습니다.

u0017496@sys-87548:~$ bhist
Summary of time in seconds spent in various states:
JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
2       u001749 *_size=8  1275    0       106     0       0       0       1381
4       u001749 *_size=8  775     0       0       0       0       0       775
5       u001749 *_size=8  321     0       0       0       0       0       321

u0017496@sys-87548:~$ bhist -l

Job <2>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
                     home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
                     nial /home/inception/models/inception/bazel-bin/inception/
                     flowers_train --train_dir=/home/inception/models/inception
                     /train --data_dir=/home/inception/models/inception/data --
                     pretrained_model_checkpoint_path=/home/inception/inception
                     -v3/model.ckpt-157585 --fine_tune=True --initial_learning_
                     rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
                     um_gpus 1 --batch_size=8>
Thu Jun  8 03:56:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
                     ue <normal>, CWD <$HOME>;
Thu Jun  8 04:13:10: Job moved to position 1 relative to <top> by user or admin
                     istrator <u0017496>;
Thu Jun  8 04:17:46: Dispatched 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.
                     com>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.i
                     host.com>, Effective RES_REQ <select[type == local] order[
                     r15s:pg] >;
Thu Jun  8 04:17:46: Starting (Pid 25530);
Thu Jun  8 04:17:46: Running with execution home </home/u0017496>, Execution CW
                     D </home/u0017496>, Execution Pid <25530>;

Summary of time in seconds spent in various states by  Thu Jun  8 04:19:36
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  1275     0        110      0        0        0        1385
------------------------------------------------------------------------------

Job <4>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
                     home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
                     nial /home/inception/models/inception/bazel-bin/inception/
                     flowers_train --train_dir=/home/inception/models/inception
                     /train --data_dir=/home/inception/models/inception/data --
                     pretrained_model_checkpoint_path=/home/inception/inception
                     -v3/model.ckpt-157585 --fine_tune=True --initial_learning_
                     rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
                     um_gpus 1 --batch_size=8>
Thu Jun  8 04:06:37: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
                     ue <short>, CWD <$HOME>;

Summary of time in seconds spent in various states by  Thu Jun  8 04:19:36
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  779      0        0        0        0        0        779
------------------------------------------------------------------------------

Job <5>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
                     home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
                     nial /home/inception/models/inception/bazel-bin/inception/
                     flowers_train --train_dir=/home/inception/models/inception
                     /train --data_dir=/home/inception/models/inception/data --
                     pretrained_model_checkpoint_path=/home/inception/inception
                     -v3/model.ckpt-157585 --fine_tune=True --initial_learning_
                     rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
                     um_gpus 1 --batch_size=8>
Thu Jun  8 04:14:11: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
                     ue <short>, CWD <$HOME>;
Thu Jun  8 04:14:34: Job moved to position 1 relative to <top> by user or admin
                     istrator <u0017496>;

Summary of time in seconds spent in various states by  Thu Jun  8 04:19:36
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  325      0        0        0        0        0        325

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:56
5       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:14
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06

u0017496@sys-87548:~$ bjobs -a
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:56
5       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:14
4       u001749 PEND  short      sys-87548.d             *ch_size=8 Jun  8 04:06
3       u001749 EXIT  normal     sys-87548.d    -        help       Jun  8 04:02
1       u001749 DONE  normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 03:54



이제 2번째 서버 sys-87549도 online으로 만듭니다.  동일한 install.conf를 통해 LSF를 설치하고, 동일하게 hostsetup 명령을 수행하면 됩니다.   bhosts 명령을 통해 보면, 현재 job이 수행 중이라 추가적인 수행 여력이 없는 sys-87548 서버는 closed로 보이는 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
sys-87548.dal-ebis closed          -      1      1      1      0      0      0
sys-87549.dal-ebis ok              -      1      0      0      0      0      0

u0017496@sys-87548:~$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     1     0     1     0
normal           30  Open:Active       -    -    -    -     0     0     0     0
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5       u001749 RUN   short      sys-87548.d sys-87548.d *ch_size=8 Jun  8 04:14

이제 이 상태에서 bsub 명령으로 또 docker image를 이용한 training job을 4개 submit 합니다.

u0017496@sys-87548:~$ bsub -n 1  sudo docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <8> is submitted to default queue <normal>.
... (4번 반복)
Job <11> is submitted to default queue <normal>.


이제 host 서버가 2대이므로, 2개의 job이 동시에 수행 중인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -     4     2     2     0
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

u0017496@sys-87548:~$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
sys-87548.dal-ebis closed          -      1      1      1      0      0      0
sys-87549.dal-ebis closed          -      1      1      1      0      0      0

bjobs 명령으로 8번 job은 sys-87548 서버에서, 9번 job은 sys-87549 서버에서 수행 중인 것을 보실 수 있습니다.

u0017496@sys-87548:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
8       u001749 RUN   normal     sys-87548.d sys-87548.d *ch_size=8 Jun  8 05:57
9       u001749 RUN   normal     sys-87548.d sys-87549.d *ch_size=8 Jun  8 05:57
10      u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 05:57
11      u001749 PEND  normal     sys-87548.d             *ch_size=8 Jun  8 05:57

u0017496@sys-87548:~$ bqueues -l normal

QUEUE: normal
  -- For normal low priority jobs, running only if hosts are lightly loaded.  This is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV
 30    0  Open:Active       -    -    -    -     4     2     2     0     0    0
Interval for a host to accept two jobs is 0 seconds

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

SCHEDULING POLICIES:  FAIRSHARE  NO_INTERACTIVE
USER_SHARES:  [default, 1]

SHARE_INFO_FOR: normal/
 USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME   ADJUST
u0017496        1       0.111      2        0        11.2      116       0.000

USERS: all
HOSTS:  all


u0017496@sys-87548:~$ bjobs -l

Job <8>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
                     and <sudo docker run --rm -v /home/inception:/home/incepti
                     on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
                     inception/bazel-bin/inception/flowers_train --train_dir=/h
                     ome/inception/models/inception/train --data_dir=/home/ince
                     ption/models/inception/data --pretrained_model_checkpoint_
                     path=/home/inception/inception-v3/model.ckpt-157585 --fine
                     _tune=True --initial_learning_rate=0.001 -input_queue_memo
                     ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
                     hare group charged </u0017496>
Thu Jun  8 05:57:23: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
                     HOME>;
Thu Jun  8 05:57:24: Started 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.com
                     >, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.ihos
                     t.com>, Execution Home </home/u0017496>, Execution CWD </h
                     ome/u0017496>;
Thu Jun  8 05:57:40: Resource usage collected.
                     MEM: 33 Mbytes;  SWAP: 85 Mbytes;  NTHREAD: 11
                     PGID: 29697;  PIDs: 29697 29699 29701 29702


 MEMORY USAGE:
 MAX MEM: 33 Mbytes;  AVG MEM: 33 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -
------------------------------------------------------------------------------

Job <9>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
                     and <sudo docker run --rm -v /home/inception:/home/incepti
                     on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
                     inception/bazel-bin/inception/flowers_train --train_dir=/h
                     ome/inception/models/inception/train --data_dir=/home/ince
                     ption/models/inception/data --pretrained_model_checkpoint_
                     path=/home/inception/inception-v3/model.ckpt-157585 --fine
                     _tune=True --initial_learning_rate=0.001 -input_queue_memo
                     ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
                     hare group charged </u0017496>
Thu Jun  8 05:57:27: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
                     HOME>;
Thu Jun  8 05:57:28: Started 1 Task(s) on Host(s) <sys-87549.dal-ebis.ihost.com
                     >, Allocated 1 Slot(s) on Host(s) <sys-87549.dal-ebis.ihos
                     t.com>, Execution Home </home/u0017496>, Execution CWD </h
                     ome/u0017496>;
Thu Jun  8 05:58:30: Resource usage collected.
                     MEM: 33 Mbytes;  SWAP: 148 Mbytes;  NTHREAD: 11
                     PGID: 14493;  PIDs: 14493 14494 14496 14497


 MEMORY USAGE:
 MAX MEM: 33 Mbytes;  AVG MEM: 23 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -
------------------------------------------------------------------------------

Job <10>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
                     mmand <sudo docker run --rm -v /home/inception:/home/incep
                     tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
                     s/inception/bazel-bin/inception/flowers_train --train_dir=
                     /home/inception/models/inception/train --data_dir=/home/in
                     ception/models/inception/data --pretrained_model_checkpoin
                     t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
                     ne_tune=True --initial_learning_rate=0.001 -input_queue_me
                     mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun  8 05:57:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
                     HOME>;
 PENDING REASONS:
 Job slot limit reached: 2 hosts;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -
------------------------------------------------------------------------------

Job <11>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
                     mmand <sudo docker run --rm -v /home/inception:/home/incep
                     tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
                     s/inception/bazel-bin/inception/flowers_train --train_dir=
                     /home/inception/models/inception/train --data_dir=/home/in
                     ception/models/inception/data --pretrained_model_checkpoin
                     t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
                     ne_tune=True --initial_learning_rate=0.001 -input_queue_me
                     mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun  8 05:57:33: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
                     HOME>;
 PENDING REASONS:
 Job slot limit reached: 2 hosts;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

POWER8에서 tensorflow docker image를 build하여 Inception v3 training 수행하기

먼저 docker hub에서 전에 만들어둔 cuda8과 cudnn5-devel package가 설치된 ppc64le 기반의 Ubuntu 16.04 LTS docker image를 pull 합니다.

root@sys-87548:/home/u0017496# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Extracting 41.78 MB/72.3 MB
6b239e02a89e: Download complete
aecbc9abccdc: Downloading 110.8 MB/415.3 MB
8f458a3f0497: Download complete
4903f7ce6675: Download complete
0c588ac98d19: Downloading   107 MB/450.9 MB
12e624e884fc: Download complete
18dd28bbb571: Downloading 45.37 MB/103.2 MB
...

이를 기반으로 PowerAI에 포함된 tensorflow를 설치한 docker image를 build 합니다.  먼저 dockerfile을 다음과 같이 만듭니다.

root@sys-87548:/home/u0017496# vi dockerfile.tensorflow

FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
RUN apt-get update && apt-get install -y nvidia-modprobe

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo* /tmp/temp/
COPY mldl-repo* /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo*deb && \
    dpkg -i /tmp/temp/libcudnn5*deb && \
    dpkg -i /tmp/temp/mldl-repo*deb && \
    rm -rf /tmp/temp && \
    apt-get update && apt-get install -y tensorflow && \
    rm -rf /var/lib/apt/lists/* && \
    dpkg -r mldl-repo-local

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/cuda-8.0/targets/ppc64le-linux/lib/stubs:/usr/lib/powerpc64le-linux-gnu/stubs:/usr/lib/powerpc64le-linux-gnu:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
ENV PATH="/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
ENV PYTHONPATH="/opt/DL/tensorflow/lib/python2.7/site-packages"

CMD /bin/bash

이제 이 dockerfile을 기반으로 bsyu/tensor_r1.0:ppc64le-xenial의 docker image를 빌드합니다.

root@sys-87548:/home/u0017496# docker build -t bsyu/tensor_r1.0:ppc64le-xenial -f dockerfile.tensorflow .
Sending build context to Docker daemon 3.436 GB
Step 1 : FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
 ---> d8d0da2fbdf2
Step 2 : RUN apt-get update && apt-get install -y nvidia-modprobe
 ---> Running in 204fe4e2c5f6
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports xenial InRelease [247 kB]
Get:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Release [565 B]
Get:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Release.gpg [819 B]
Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Packages [24.9 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports xenial-updates InRelease [102 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports xenial-security InRelease [102 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages [1470 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports xenial/universe ppc64el Packages [9485 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse ppc64el Packages [152 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main ppc64el Packages [613 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe ppc64el Packages [528 kB]
Get:13 http://ports.ubuntu.com/ubuntu-ports xenial-updates/multiverse ppc64el Packages [5465 B]
Get:14 http://ports.ubuntu.com/ubuntu-ports xenial-security/main ppc64el Packages [286 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports xenial-security/universe ppc64el Packages [138 kB]
Fetched 13.2 MB in 10s (1230 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  nvidia-modprobe
0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 16.3 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Get:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  nvidia-modprobe 375.51-0ubuntu1 [16.3 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 16.3 kB in 0s (191 kB/s)
Selecting previously unselected package nvidia-modprobe.
(Reading database ... 17174 files and directories currently installed.)
Preparing to unpack .../nvidia-modprobe_375.51-0ubuntu1_ppc64el.deb ...
Unpacking nvidia-modprobe (375.51-0ubuntu1) ...
Setting up nvidia-modprobe (375.51-0ubuntu1) ...
 ---> 5411319bbc05
Removing intermediate container 204fe4e2c5f6
Step 3 : RUN mkdir /tmp/temp
 ---> Running in cf13b03845f1
 ---> 66b2b250777f
Removing intermediate container cf13b03845f1
Step 4 : COPY libcudnn5* /tmp/temp/
 ---> 16d921e53451
Removing intermediate container 9d1efa9ed269
Step 5 : COPY cuda-repo* /tmp/temp/
...
Step 9 : ENV LD_LIBRARY_PATH "/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
 ---> Running in fe30af7c944e
 ---> f5faa1760ac7
Removing intermediate container fe30af7c944e
Step 10 : ENV PATH "/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
 ---> Running in 98a0e5bfd008
 ---> 7cfb0feaaee1
Removing intermediate container 98a0e5bfd008
Step 11 : ENV PYTHONPATH "/opt/DL/tensorflow/lib/python2.7/site-packages"
 ---> Running in d98d5352108e
 ---> affda7b26276
Removing intermediate container d98d5352108e
Step 12 : CMD /bin/bash
 ---> Running in d54a20fb7e3c
 ---> 4692368fb7ad
Removing intermediate container d54a20fb7e3c
Successfully built 4692368fb7ad


만들어진 docker image를 확인합니다.

root@sys-87548:/home/u0017496# docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
bsyu/tensor_r1.0          ppc64le-xenial      4692368fb7ad        3 minutes ago       6.448 GB
nvidia-docker             deb                 2830f66f0418        41 hours ago        429.8 MB
nvidia-docker             build               fa764787622c        41 hours ago        1.014 GB
ppc64le/ubuntu            14.04               0e6701cbf611        2 weeks ago         228.5 MB
bsyu/cuda8-cudnn5-devel   cudnn5-devel        d8d0da2fbdf2        4 months ago        1.895 GB
ppc64le/golang            1.6.3               6a579d02d32f        9 months ago        704.7 MB
golang                    1.5                 99668503de15        10 months ago       725.3 MB


이 docker image를 나중에 다른 서버에서도 사용하기 위해 docker hub에 push 해둡니다.

root@sys-87548:/home/u0017496# docker push bsyu/tensor_r1.0:ppc64le-xenial
The push refers to a repository [docker.io/bsyu/tensor_r1.0]
f42db0829239: Pushed
6a6b4d4d9d2a: Pushing 184.1 MB/2.738 GB
6458d0633f20: Pushing 172.7 MB/390.2 MB
726e25ffdf3c: Pushing 173.2 MB/1.321 GB
1535936ab54b: Pushed
bc0917851737: Pushed
9a1e25cd5998: Pushed
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Mounted from bsyu/cuda8-cudnn5-devel
84f99f1bf79b: Mounted from bsyu/cuda8-cudnn5-devel
7f7c1dccec82: Mounted from bsyu/cuda8-cudnn5-devel
5b8880a35736: Mounted from bsyu/cuda8-cudnn5-devel
41b97cb9a404: Mounted from bsyu/cuda8-cudnn5-devel
08f34ce6b3fb: Mounted from bsyu/cuda8-cudnn5-devel


이제 docker image는 준비되었으니, tensorflow를 이용한 inception v3를 수행할 준비를 합니다.  inception v3 training을 수행할 python package를 bazel를 이용하여 /home/inception directory 밑에 만듭니다.  이 directory를 나중에 docker image에서 mount하여 사용할 예정입니다.

root@sys-87548:/home# mkdir inception
root@sys-87548:/home# export INCEPTION_DIR=/home/inception

root@sys-87548:/home# cd inception/
root@sys-87548:/home/inception# curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  380M  100  380M    0     0  4205k      0  0:01:32  0:01:32 --:--:-- 4988k

root@sys-87548:/home/inception# tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585

root@sys-87548:/home/inception# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4866, done.
remote: Total 4866 (delta 0), reused 0 (delta 0), pack-reused 4866
Receiving objects: 100% (4866/4866), 153.36 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (2467/2467), done.
Checking connectivity... done.

root@sys-87548:/home/inception# export FLOWERS_DIR=/home/inception/models/inception
root@sys-87548:/home/inception# mkdir -p $FLOWERS_DIR/data
root@sys-87548:/home/inception# cd models/inception/
root@sys-87548:/home/inception/models/inception# . /opt/DL/bazel/bin/bazel-activate
root@sys-87548:/home/inception/models/inception# . /opt/DL/tensorflow/bin/tensorflow-activate

root@sys-87548:/home/inception/models/inception# export TEST_TMPDIR=/home/inception/.cache

root@sys-87548:/home/inception/models/inception# bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/inception/.cache'.
Extracting Bazel installation...
..............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
  bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 5.831s, Critical Path: 0.02s

root@sys-87548:/home/inception/models/inception# ls -l
total 76
lrwxrwxrwx 1 root root   116 Jun  8 02:36 bazel-bin -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/bin
lrwxrwxrwx 1 root root   121 Jun  8 02:36 bazel-genfiles -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/genfiles
lrwxrwxrwx 1 root root    86 Jun  8 02:36 bazel-inception -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception
lrwxrwxrwx 1 root root    96 Jun  8 02:36 bazel-out -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out
lrwxrwxrwx 1 root root   121 Jun  8 02:36 bazel-testlogs -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/testlogs
drwxr-xr-x 2 root root  4096 Jun  8 02:32 data
drwxr-xr-x 2 root root  4096 Jun  8 02:29 g3doc
drwxr-xr-x 4 root root  4096 Jun  8 02:29 inception
-rw-r--r-- 1 root root 38480 Jun  8 02:29 README.md
-rw-r--r-- 1 root root    30 Jun  8 02:29 WORKSPACE


root@sys-87548:/home/inception/models/inception# bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  218M  100  218M    0     0  4649k      0  0:00:48  0:00:48 --:--:-- 5105k
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-06-08 02:01:56.169564 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:01:56.268917 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:02:01.252583 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-06-08 02:02:01.252638 [thread 1]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.306138 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-06-08 02:02:01.306178 [thread 0]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.578737: Finished writing all 3170 images in data set.

inception v3는 다음과 같이 꽃 사진을 종류별로 인식하는 model입니다.

root@sys-87548:/home/inception/models/inception# du -sm data/raw-data/train/*
29      data/raw-data/train/daisy
43      data/raw-data/train/dandelion
1       data/raw-data/train/LICENSE.txt
34      data/raw-data/train/roses
47      data/raw-data/train/sunflowers
48      data/raw-data/train/tulips

이제 flowers_train을 bazel로 build 합니다.

root@sys-87548:/home/inception/models/inception# bazel build inception/flowers_train
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
  bazel-bin/inception/flowers_train
INFO: Elapsed time: 0.311s, Critical Path: 0.03s

이제 준비가 끝났습니다.  Docker에서 수행하기 전에, 이 flowers_train을 그냥 그대로 수행해 봅니다.

root@sys-87548:/home/inception/models/inception# time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

이 서버에는 cuda가 설치되어 있지만 GPU는 없으므로, 아래와 같이 error message가 보이는 것이 당연합니다.  GPU가 없으면 그냥 CPU에서 수행됩니다.  약 20분 이상 수행되므로 도중에 끊겠습니다.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
NVIDIA: no NVIDIA devices found
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87548): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 02:41:53.587744: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 02:44:28.213350: step 0, loss = 2.85 (0.2 examples/sec; 38.569 sec/batch)
...


이제 앞서 만들어 두었던 bsyu/tensor_r1.0:ppc64le-xenial라는 이름의 docker image를 이용하여 inception v3를 수행하겠습니다.  실제 flowers_train은 /home/inception 밑에 들어있으므로, 이 directory를 -v option을 이용하여 docker image에서도 mount 하도록 합니다.

root@sys-87548:/home/inception/models/inception# docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (b85c9a819a6a): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 06:48:27.996200: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 06:51:10.935895: step 0, loss = 2.83 (0.2 examples/sec; 39.389 sec/batch)
2017-06-08 06:56:21.408996: step 10, loss = 2.55 (0.4 examples/sec; 19.373 sec/batch)
2017-06-08 06:59:29.431547: step 20, loss = 2.33 (0.4 examples/sec; 19.856 sec/batch)
2017-06-08 07:02:36.828205: step 30, loss = 2.33 (0.4 examples/sec; 19.014 sec/batch)
2017-06-08 07:05:46.372759: step 40, loss = 2.17 (0.4 examples/sec; 18.428 sec/batch)

잘 수행되는 것을 보실 수 있습니다.   수행되는 중간에 Parent OS에서 nmon으로 관측해보면 python이 CPU 대부분을 사용하는 것을 보실 수 있습니다.  이 python process의 parent PID는 물론 docker daemon입니다.


root@sys-87548:/home/u0017496# ps -ef | grep 14190 | grep -v grep
root     14190 14173 78 02:46 ?        00:00:53 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

root@sys-87548:/home/u0017496# ps -ef | grep 14173 | grep -v grep
root     14173 15050  0 02:46 ?        00:00:00 docker-containerd-shim b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 /var/run/docker/libcontainerd/b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 docker-runc
root     14190 14173 80 02:46 ?        00:01:06 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8


이제 다음 posting에서는 이 서버와 다른 서버에서 LSF를 통해 이 docker image로 inception v3를 training하는 것을 보시겠습니다.  이 서버인 sys-87548 외에, sys-87549에도 docker를 설치하고 docker image를 pull 해두고, 또 여기서 build된 /home/inception directory를 scp를 통해 sys-87549 서버에도 복사해 둡니다.

root@sys-87549:/home/u0017496# docker pull bsyu/tensor_r1.0:ppc64le-xenial

root@sys-87549:/home/u0017496# scp -r sys-87548:/home/inception /home



2017년 5월 15일 월요일

Minsky 서버에 Continuum 아나콘다(Anaconda) 설치하기 + Tensorflow로 inception v3 training 해보기

Continuum에서 내놓은 아나콘다(Anaconda)는 여태까지는 x86용으로만 존재했으나, 최근 ARM processor와 POWER8 processor를 위한 min-conda를 내놓았습니다.   아래 site에 해당 package를 download 받기 위한 link와 설치 방법이 정리되어 있습니다.

https://www.continuum.io/content/conda-support-raspberry-pi-2-and-power8-le

불행히도 ppc64le를 위한 link는 잘못 지정되어 있어 '404 Not Found'가 나옵니다만, 실제로는 link만 잘못된 것이고 아래와 같이 file은 실제로 존재합니다.   위의 것은 python version2 용이고, 아래 것은 python 3용입니다.

https://repo.continuum.io/miniconda/Miniconda2-4.3.14-Linux-ppc64le.sh
https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh


여기서는 python 3용을 설치해보겠습니다.

u0017496@sys-87250:~$ wget https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh
--2017-05-14 21:54:26--  https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-ppc64le.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.16.19.10, 104.16.18.10, 2400:cb00:2048:1::6810:120a, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.16.19.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34765794 (33M) [application/x-sh]
Saving to: ‘Miniconda3-4.3.14-Linux-ppc64le.sh’

Miniconda3-4.3.14-Linux-pp 100%[=====================================>]  33.15M  10.2MB/s    in 3.3s

2017-05-17 22:05:18 (10.0 MB/s) - ‘Miniconda3-4.3.14-Linux-ppc64le.sh’ saved [34765794/34765794]


u0017496@sys-87250:~$ chmod a+x Miniconda3-4.3.14-Linux-ppc64le.sh

이제 이 shell script를 수행하면 license 동의 등에 답을 해야 하며, 그 외에도 여러가지 입력 값을 넣어야 합니다.  대부분 그냥 enter를 누르시면 됩니다.

u0017496@sys-87250:~$ ./Miniconda3-4.3.14-Linux-ppc64le.sh

Welcome to Miniconda3 4.3.14 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>

(중략)

[/home/u0017496/miniconda3] >>>
PREFIX=/home/u0017496/miniconda3
installing: python-3.6.0-0 ...
installing: cffi-1.9.1-py36_0 ...
installing: conda-env-2.6.0-0 ...
installing: cryptography-1.7.1-py36_0 ...
installing: idna-2.2-py36_0 ...
installing: libffi-3.2.1-1 ...
installing: openssl-1.0.2k-1 ...
installing: pyasn1-0.2.3-py36_0 ...
installing: pycosat-0.6.2-py36_0 ...
installing: pycparser-2.17-py36_0 ...
installing: pyopenssl-16.2.0-py36_0 ...
installing: requests-2.13.0-py36_0 ...
installing: ruamel_yaml-0.11.14-py36_1 ...
installing: setuptools-27.2.0-py36_0 ...
installing: six-1.10.0-py36_0 ...
installing: sqlite-3.13.0-0 ...
installing: xz-5.2.2-1 ...
installing: yaml-0.1.6-0 ...
installing: zlib-1.2.8-3 ...
installing: conda-4.3.14-py36_0 ...
installing: pip-9.0.1-py36_1 ...
installing: wheel-0.29.0-py36_0 ...
Python 3.6.0 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/u0017496/.bashrc ? [yes|no]

[no] >>> yes

Prepending PATH=/home/u0017496/miniconda3/bin to PATH in /home/u0017496/.bashrc
A backup will be made to: /home/u0017496/.bashrc-miniconda3.bak


For this change to become active, you have to open a new terminal.

Thank you for installing Miniconda2!

Share your notebooks and packages on Anaconda Cloud!
Sign up for free: https://anaconda.org


이제 mini-conda를 설치했으니 conda 명령을 쓸 수 있어야 합니다.  그러나 보시다시피 conda가 없습니다.

u0017496@sys-87250:~$ which conda

이는 mini-conda 설치시 ~/.bashrc에 conda의 PATH 정보가 자동으로 들어가긴 했지만 .bashrc가 수행되지 않았기 때문에 그런 것입니다.  수행하시면 conda의 PATH가 잡혀 있는 것을 보실 수 있습니다.

u0017496@sys-87250:~$ . ~/.bashrc
u0017496@sys-87250:~$ which conda
/home/u0017496/miniconda3/bin/conda

이제 다음과 같이 ananconda에 포함된 python library들을 보실 수 있습니다.

u0017496@sys-87250:~$ conda list
# packages in environment at /home/u0017496/miniconda3:
#
cffi                      1.9.1                    py36_0
conda                     4.3.14                   py36_0
conda-env                 2.6.0                         0
cryptography              1.7.1                    py36_0
idna                      2.2                      py36_0
libffi                    3.2.1                         1
openssl                   1.0.2k                        1
pip                       9.0.1                    py36_1
pyasn1                    0.2.3                    py36_0
pycosat                   0.6.2                    py36_0
pycparser                 2.17                     py36_0
pyopenssl                 16.2.0                   py36_0
python                    3.6.0                         0
requests                  2.13.0                   py36_0
ruamel_yaml               0.11.14                  py36_1
setuptools                27.2.0                   py36_0
six                       1.10.0                   py36_0
sqlite                    3.13.0                        0
wheel                     0.29.0                   py36_0
xz                        5.2.2                         1
yaml                      0.1.6                         0
zlib                      1.2.8                         3


현재까지 Continuum에서 빌드해놓은 package들을 모조리 다 설치하는 명령을 다음에 정리했습니다.

u0017496@sys-87250:~$ for i in `conda list | awk '{print $1}' | grep -v \#`
> do
> conda install $i
> done

(중략)

이제 위에서 설치한 패키지 중 pip가 제대로 설치되었는지 conda search로 확인해보겠습니다.  아래와 같이 * 표시가 된 것이 설치된 것입니다.

u0017496@sys-87250:~$ conda search pip
Fetching package metadata .........
pip                          7.1.0                    py27_0  defaults
                             7.1.0                    py34_0  defaults
                             7.1.0                    py27_1  defaults
                             7.1.0                    py34_1  defaults
                             7.1.2                    py27_0  defaults
                             7.1.2                    py34_0  defaults
                             8.1.0                    py27_0  defaults
                             8.1.0                    py34_0  defaults
                             8.1.0                    py35_0  defaults
                             8.1.2                    py27_0  defaults
                             8.1.2                    py34_0  defaults
                             8.1.2                    py35_0  defaults
                             9.0.0                    py27_0  defaults
                             9.0.0                    py34_0  defaults
                             9.0.0                    py35_0  defaults
                             9.0.1                    py27_1  defaults
                             9.0.1                    py35_1  defaults
                          *  9.0.1                    py36_1  defaults

u0017496@sys-87250:~$ which pip
/home/u0017496/miniconda3/bin/pip

u0017496@sys-87250:~$ pip --version
pip 9.0.1 from /home/u0017496/miniconda3/lib/python3.6/site-packages (python 3.6)

이렇게 conda에서 제공하는 pip로 keras 2.0.4를 설치해보겠습니다.

u0017496@sys-87250:~/miniconda3/lib$ pip install keras==2.0.4
Collecting keras==2.0.4
  Downloading Keras-2.0.4.tar.gz (199kB)
    100% |████████████████████████████████| 204kB 3.1MB/s
Collecting theano (from keras==2.0.4)
  Downloading Theano-0.9.0.tar.gz (3.1MB)
    100% |████████████████████████████████| 3.1MB 310kB/s
Collecting pyyaml (from keras==2.0.4)
  Downloading PyYAML-3.12.tar.gz (253kB)
    100% |████████████████████████████████| 256kB 3.6MB/s
Requirement already satisfied: six in ./python3.6/site-packages (from keras==2.0.4)
Requirement already satisfied: numpy>=1.9.1 in ./python3.6/site-packages (from theano->keras==2.0.4)
Requirement already satisfied: scipy>=0.14 in ./python3.6/site-packages (from theano->keras==2.0.4)
Building wheels for collected packages: keras, theano, pyyaml
  Running setup.py bdist_wheel for keras ... done
  Stored in directory: /home/u0017496/.cache/pip/wheels/48/82/42/f06a8c03a8f95ada523a81ba723e89f059693e6ad868d09727
  Running setup.py bdist_wheel for theano ... done
  Stored in directory: /home/u0017496/.cache/pip/wheels/d5/5b/93/433299b86e3e9b25f0f600e4e4ebf18e38eb7534ea518eba13
  Running setup.py bdist_wheel for pyyaml ... done
  Stored in directory: /home/u0017496/.cache/pip/wheels/2c/f7/79/13f3a12cd723892437c0cfbde1230ab4d82947ff7b3839a4fc
Successfully built keras theano pyyaml
Installing collected packages: theano, pyyaml, keras
Successfully installed keras-2.0.4 pyyaml-3.12 theano-0.9.0


또 gensim 2.0.0과 KoNLPy를 설치해보겠습니다.

u0017496@sys-87250:~$ pip install gensim==2.0.0
Collecting gensim==2.0.0
  Downloading gensim-2.0.0.tar.gz (14.1MB)
    100% |████████████████████████████████| 14.2MB 88kB/s
Requirement already satisfied: numpy>=1.3 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: scipy>=0.7.0 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: six>=1.5.0 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: smart_open>=1.2.1 in ./miniconda3/lib/python3.6/site-packages (from gensim==2.0.0)
Requirement already satisfied: boto>=2.32 in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Requirement already satisfied: bz2file in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Requirement already satisfied: requests in ./miniconda3/lib/python3.6/site-packages (from smart_open>=1.2.1->gensim==2.0.0)
Building wheels for collected packages: gensim
  Running setup.py bdist_wheel for gensim ... done
  Stored in directory: /home/u0017496/.cache/pip/wheels/e9/5f/e7/4ff23a3fe4b181b44f37eed5602f179c1cc92a0a34f337e745
Successfully built gensim
Installing collected packages: gensim
  Found existing installation: gensim 1.0.1
    Uninstalling gensim-1.0.1:
      Successfully uninstalled gensim-1.0.1
Successfully installed gensim-2.0.0


u0017496@sys-87250:~$ pip install konlpy
Collecting konlpy
  Downloading konlpy-0.4.4-py2.py3-none-any.whl (22.5MB)
    100% |████████████████████████████████| 22.5MB 57kB/s
Installing collected packages: konlpy
Successfully installed konlpy-0.4.4



이제 conda 명령을 통해 추가로 numpy와 matplotlib, scipy와 scikit-learn를 설치해보겠습니다.   matplotlib의 prerequisite이 numpy이고, scikit-learn의 prerequisite이 scipy라서 그것들은 자동으로 설치되니까, 실제로는 conda 명령은 두번만 쓰면 됩니다.


u0017496@sys-87250:~$ conda install matplotlib
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

    cycler:          0.10.0-py36_0
    freetype:        2.5.5-2
    libpng:          1.6.27-0
    matplotlib:      2.0.2-np112py36_0
    numpy:           1.12.1-py36_0
    openblas:        0.2.19-0
    python-dateutil: 2.6.0-py36_0
    pytz:            2017.2-py36_0

Proceed ([y]/n)? y

openblas-0.2.1 100% |###########################################################| Time: 0:00:00  10.21 MB/s
libpng-1.6.27- 100% |###########################################################| Time: 0:00:00  12.75 MB/s
freetype-2.5.5 100% |###########################################################| Time: 0:00:00  10.53 MB/s
numpy-1.12.1-p 100% |###########################################################| Time: 0:00:00  15.12 MB/s
pytz-2017.2-py 100% |###########################################################| Time: 0:00:00  13.25 MB/s
cycler-0.10.0- 100% |###########################################################| Time: 0:00:00  15.61 MB/s
python-dateuti 100% |###########################################################| Time: 0:00:00   6.43 MB/s
matplotlib-2.0 100% |###########################################################| Time: 0:00:00  14.62 MB/s


u0017496@sys-87250:~$ conda install scikit-learn
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

    scikit-learn: 0.18.1-np112py36_1
    scipy:        0.19.0-np112py36_0

Proceed ([y]/n)? y

scipy-0.19.0-n 100% |###########################################################| Time: 0:00:02  14.75 MB/s
scikit-learn-0 100% |###########################################################| Time: 0:00:00  15.56 MB/s



이렇게 설치된 것들은 아래와 같이 /home/u0017496/miniconda3/lib/python3.6/site-packages 에 들어갑니다.

u0017496@sys-87250:~$ ls /home/u0017496/miniconda3/lib/python3.6/site-packages/
asn1crypto                         mpl_toolkits                                  python_dateutil-2.6.0-py3.6.egg-info
asn1crypto-0.22.0-py3.6.egg-info   numpy-1.12.1.dist-info                        pytz
cffi                               OpenSSL                                       pytz-2017.2-py3.6.egg-info
cffi-1.10.0-py3.6.egg-info         packaging                                     README.txt
_cffi_backend.so                   packaging-16.8-py3.6.egg-info                 requests
conda                              pip                                           requests-2.14.2-py3.6.egg-info
conda-4.3.18-py3.6.egg-info        pip-9.0.1-py3.6.egg-info                      ruamel_yaml
conda_env                          pyasn1                                        scikit_learn-0.18.1-py3.6.egg-info
cryptography                       pyasn1-0.2.3-py3.6.egg-info                   scipy
cryptography-1.8.1-py3.6.egg-info  __pycache__                                   scipy-0.19.0-py3.6.egg-info
cycler-0.10.0-py3.6.egg-info       pycosat-0.6.2-py3.6.egg-info                  setuptools-27.2.0-py3.6.egg
cycler.py                          pycosat.cpython-36m-powerpc64le-linux-gnu.so  setuptools.pth
dateutil                           pycparser                                     six-1.10.0-py3.6.egg-info
easy-install.pth                   pycparser-2.17-py3.6.egg-info                 six.py
idna                               pylab.py                                      sklearn
idna-2.5-py3.6.egg-info            pyOpenSSL-17.0.0-py3.6.egg-info               test_pycosat.py
matplotlib                         pyparsing-2.1.4-py3.6.egg-info                wheel
matplotlib-2.0.2-py3.6.egg-info    pyparsing.py                                  wheel-0.29.0-py3.6.egg-info


따라서 이것들을 사용하기 위해서는 PYTHONPATH는 다음과 같이 설정하시면 됩니다.

u0017496@sys-87250:~$ export PYTHONPATH=/home/u0017496/miniconda3/lib/python3.6/site-packages:$PYTHONPATH


이제 여기에 (PowerAI에 포함된 tensorflow 말고) conda로 bazel, tensorflow 및 tensorflow-gpu도 설치해보겠습니다.

u0017496@sys-87250:~$ conda install bazel
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

    bazel: 0.4.5-0

Proceed ([y]/n)? y

bazel-0.4.5-0. 100% |#############################################| Time: 0:00:09  13.37 MB/s

u0017496@sys-87250:~$ conda install tensorflow
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

    libprotobuf: 3.2.0-0
    protobuf:    3.2.0-py36_0
    tensorflow:  1.1.0-np112py36_0
    werkzeug:    0.12.2-py36_0

Proceed ([y]/n)? y

libprotobuf-3. 100% |#############################################| Time: 0:00:00  13.84 MB/s
werkzeug-0.12. 100% |#############################################| Time: 0:00:00  18.67 MB/s
protobuf-3.2.0 100% |#############################################| Time: 0:00:00  10.39 MB/s
tensorflow-1.1 100% |#############################################| Time: 0:00:01  15.16 MB/s

u0017496@sys-87250:~$ conda install tensorflow-gpu
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/u0017496/miniconda3:

The following NEW packages will be INSTALLED:

    cudatoolkit:    8.0-0
    cudnn:          6.0.21-0
    tensorflow-gpu: 1.1.0-np112py36_0

Proceed ([y]/n)? y

cudatoolkit-8. 100% |#############################################| Time: 0:00:29  11.24 MB/s
cudnn-6.0.21-0 100% |#############################################| Time: 0:00:11  15.97 MB/s
tensorflow-gpu 100% |#############################################| Time: 0:00:06  14.27 MB/s


conda list 명령으로 보면 다음과 같은 것들이 설치된 것을 보실 수 있습니다.

u0017496@sys-87250:~$ conda list
# packages in environment at /home/u0017496/miniconda3:
#
asn1crypto                0.22.0                   py36_0
bazel                     0.4.5                         0
boto                      2.46.1                   py36_0
bz2file                   0.98                     py36_0
cffi                      1.10.0                   py36_0
conda                     4.3.18                   py36_0
conda-env                 2.6.0                         0
cryptography              1.8.1                    py36_0
cudatoolkit               8.0                           0
cudnn                     6.0.21                        0
cycler                    0.10.0                   py36_0
freetype                  2.5.5                         2
gensim                    1.0.1               np112py36_0
gensim                    2.0.0                     <pip>
idna                      2.5                      py36_0
Keras                     2.0.4                     <pip>
konlpy                    0.4.4                     <pip>
libffi                    3.2.1                         1
libpng                    1.6.27                        0
libprotobuf               3.2.0                         0
matplotlib                2.0.2               np112py36_0
numpy                     1.12.1                    <pip>
numpy                     1.12.1                   py36_0
openblas                  0.2.19                        0
openssl                   1.0.2k                        2
packaging                 16.8                     py36_0
pip                       9.0.1                    py36_1
protobuf                  3.2.0                    py36_0
pyasn1                    0.2.3                    py36_0
pycosat                   0.6.2                    py36_0
pycparser                 2.17                     py36_0
pyopenssl                 17.0.0                   py36_0
pyparsing                 2.1.4                    py36_0
python                    3.6.1                         2
python-dateutil           2.6.0                    py36_0
pytz                      2017.2                   py36_0
PyYAML                    3.12                      <pip>
requests                  2.14.2                   py36_0
ruamel_yaml               0.11.14                  py36_1
scikit-learn              0.18.1              np112py36_1
scipy                     0.19.0              np112py36_0
setuptools                27.2.0                   py36_0
six                       1.10.0                   py36_0
smart_open                1.5.2                    py36_0
sqlite                    3.13.0                        0
tensorflow                1.1.0               np112py36_0
tensorflow-gpu            1.1.0               np112py36_0
Theano                    0.9.0                     <pip>
werkzeug                  0.12.2                   py36_0
wheel                     0.29.0                   py36_0
xz                        5.2.2                         1
yaml                      0.1.6                         0
zlib                      1.2.8                         3




설치하는 김에, 이렇게 conda로 설치한 tensorflow를 이용하여 inception v3 model을 training 해보겠습니다.   다음 순서대로 따라 하시면 됩니다.


u0017496@sys-87250:~/inception$ pwd
/home/u0017496/inception

u0017496@sys-87250:~/inception$ export INCEPTION_DIR=/home/u0017496/inception

u0017496@sys-87250:~/inception$ curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  380M  100  380M    0     0  5918k      0  0:01:05  0:01:05 --:--:-- 4233k

u0017496@sys-87250:~/inception$ tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585

u0017496@sys-87250:~/inception$ git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4703, done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 4703 (delta 17), reused 31 (delta 11), pack-reused 4649
Receiving objects: 100% (4703/4703), 153.34 MiB | 5.62 MiB/s, done.
Resolving deltas: 100% (2374/2374), done.
Checking connectivity... done.

u0017496@sys-87250:~/inception/models/inception$ export FLOWERS_DIR=/home/u0017496/inception/models/inception

u0017496@sys-87250:~/inception/models/inception$ mkdir -p $FLOWERS_DIR/data

u0017496@sys-87250:~/inception/models/inception$ which bazel
/home/u0017496/miniconda3/bin/bazel

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/download_and_preprocess_flowers
Extracting Bazel installation...
....................
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
  bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 6.943s, Critical Path: 0.05s

u0017496@sys-87250:~/inception/models/inception$ export TEST_TMPDIR=/home/u0017496/.cache

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/u0017496/.cache'.
Extracting Bazel installation...
.............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
  bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 4.867s, Critical Path: 0.03s

u0017496@sys-87250:~/inception/models/inception$ bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  218M  100  218M    0     0  9372k      0  0:00:23  0:00:23 --:--:-- 10.1M
(중략)
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-05-19 05:33:44.191446 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-05-19 05:33:44.213856 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-05-19 05:33:54.902070 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-05-19 05:33:54.902172 [thread 1]: Wrote 1585 images to 1585 shards.
2017-05-19 05:33:54.911283 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-05-19 05:33:54.911360 [thread 0]: Wrote 1585 images to 1585 shards.
2017-05-19 05:33:55.171141: Finished writing all 3170 images in data set.

아래에서 보시다시피 이 inception v3는 꽃 사진을 분류하는 neural network입니다.

u0017496@sys-87250:~/inception/models/inception$ du -sm data/raw-data/train/*
29      data/raw-data/train/daisy
44      data/raw-data/train/dandelion
1       data/raw-data/train/LICENSE.txt
33      data/raw-data/train/roses
47      data/raw-data/train/sunflowers
48      data/raw-data/train/tulips

u0017496@sys-87250:~/inception/models/inception$ bazel build inception/flowers_train
INFO: $TEST_TMPDIR defined: output root default is '/home/u0017496/.cache'.
............................
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
  bazel-bin/inception/flowers_train
INFO: Elapsed time: 6.502s, Critical Path: 0.03s

이제 비로소 inception v3의 training 준비가 끝났습니다.  이제 다음 명령으로 training을 시작합니다.

u0017496@sys-87250:~/inception/models/inception$ time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=32

NVIDIA: no NVIDIA devices found
2017-05-19 05:41:03.740213: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-19 05:41:03.740670: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87250): /proc/driver/nvidia/version does not exist
2017-05-19 05:41:51.947244: Pre-trained model restored from /home/u0017496/inception/inception-v3/model.ckpt-157585
2017-05-19 05:47:22.023602: step 0, loss = 2.79 (0.2 examples/sec; 182.713 sec/batch)
2017-05-19 06:05:58.942671: step 10, loss = 2.53 (0.4 examples/sec; 78.882 sec/batch)
2017-05-19 06:19:26.875533: step 20, loss = 2.40 (0.4 examples/sec; 82.410 sec/batch)
2017-05-19 06:33:10.333275: step 30, loss = 2.20 (0.4 examples/sec; 77.844 sec/batch)
2017-05-19 06:48:27.688993: step 40, loss = 2.24 (0.3 examples/sec; 96.148 sec/batch)

real    84m30.882s
user    135m20.864s
sys     2m30.832s


이제 와서 고백하지만 제가 설치 demo를 보여드린 이 서버는 사실 GPU가 달려 있지 않은 POWER8 서버입니다.  GPU가 없는 경우 CPU를 이용하게 되는데, 그런 경우 이 training의 완료는 보시다시피 매우, 매우 오래 걸립니다.    저 output을 보면 초당 example 0.4개 처리로 나옵니다만, P100을 이용하는 경우 (GPU 개수 및 batch size에 따라) 초당 50개~200개 단위로 처리가 됩니다.

아래는 전에 PowerAI를 설치한 Minsky 서버에서 수행했던 inception v3의 결과 log 일부입니다.

2017-05-16 03:48:46.352210: Pre-trained model restored from /gpfs/gpfs_gl4_16mb/b7p088za/inception-v3/model.ckpt-157585
2017-05-16 03:52:44.322381: step 0, loss = 2.72 (17.6 examples/sec; 21.830 sec/batch)
2017-05-16 03:55:29.550791: step 10, loss = 2.57 (213.6 examples/sec; 1.797 sec/batch)
2017-05-16 03:55:47.619990: step 20, loss = 2.35 (212.1 examples/sec; 1.810 sec/batch)
2017-05-16 03:56:05.953991: step 30, loss = 2.17 (206.6 examples/sec; 1.859 sec/batch)
2017-05-16 03:56:24.306742: step 40, loss = 1.98 (209.4 examples/sec; 1.834 sec/batch)
2017-05-16 03:56:42.490063: step 50, loss = 1.92 (217.8 examples/sec; 1.763 sec/batch)
2017-05-16 03:57:00.444537: step 60, loss = 1.67 (216.6 examples/sec; 1.773 sec/batch)
2017-05-16 03:57:18.366941: step 70, loss = 1.58 (212.7 examples/sec; 1.806 sec/batch)
2017-05-16 03:57:36.467837: step 80, loss = 1.55 (213.6 examples/sec; 1.798 sec/batch)