2019년 7월 23일 화요일
공유 HW infrastructure에서의 H2O DriverlessAI 사용 방안
H2O DriverlessAI (이하 DAI)를 사내의 여러 팀에서 사용할 경우 먼저 참고하셔야 할 사항은 다음과 같습니다.
1) DAI의 training(DAI 용어로는 experiment)는 host memory든 GPU memory든 꼭 필요한 만큼의 memory만 사용합니다. 따라서 시스템 자원의 여유만 충분하다면 하나의 서버에서 여러 training을 한꺼번에 수행해도 됩니다.
2) DAI는 일반적으로 dataset 크기의 10배에 해당하는 메모리를 사용한다고 예상하시면 됩니다. 가령 1GB의 dataset으로 training을 한다고 하면, 10GB의 메모리가 필요합니다. "Accuracy" dial setting을 낮추면 더 적은 memory로도 training이 가능합니다.
3) DAI는 CPU 자원도 많이 쓰지만 특히 memory 자원에 민감합니다. 하나의 DAI training이 이미 시스템의 거의 모든 memory를 다 쓰고 있는 상황에서, 추가적으로 다른 DAI training을 수행하기 시작한다면, 이 두 DAI training 모두가 memory 부족으로 fail될 수 있습니다.
4) DAI는 GPU 자원이 없거나 GPU가 있어도 GPU의 메모리가 부족한 경우 그냥 CPU를 이용해서 training을 수행합니다. 따라서 GPU가 없거나 GPU 메모리가 부족하다고 해서 DAI training이 fail나는 경우는 없습니다.
위와 같은 사항을 이해한다면, 서로의 상황을 잘 이해하고 서로 배려하는 소수의 사용자들이 하나의 서버에서 여러개의 training을 동시에 수행하는 것은 별 문제를 일으키지 않습니다. 그러나 서로에 대해 잘 모르는 여러 팀의 여러 사용자들이 하나의 서버를 공유하는 것은 쉽지 않습니다.
이렇게 불특정 다수의 사용자들이 제한된 서버 자원을 이용하여 DAI에서 training을 하기 위해서는 각 사용자마다 한정된 시스템 자원 (CPU, memory, GPU)를 나누어주는 것이 가장 안정적입니다. 그러나 각각의 사용자들에게 완전히 격리된 가상머신을 할당해줄 경우 낭비되는 자원이 너무 많아진다는 약점이 있습니다.
이런 점들을 고려할 때, 가장 좋은 솔루션은 docker(GPU가 있는 환경에서는 nvidia-docker)를 사용하는 것입니다. Docker를 사용할 경우의 장점은 아래와 같습니다.
1) 각 사용자가 사용하는 DAI docker image에서 사용할 수 있는 CPU와 memory, GPU의 한도를 정할 수 있으므로 각 사용자는 안정적인 training이 가능합니다.
2) 이미 다른 사용자에게 할당된 자원이라도, 당장 사용하지 않는 자원은 다른 사용자들의 docker container가 사용할 수 있으므로 전체적인 시스템 활용률이 높아집니다.
3) CPU나 메모리, disk 공간 등에 대한 가상화 오버헤드가 거의 없습니다. IP도 사용자마다 1개씩 필요하지 않고, 각 사용자들에게는 서로 다른 port number만 할당해주면 됩니다.
4) 각 사용자 간의 보안은 일반적인 linux 보안 정책을 그대로 사용합니다. 따라서 보안 구분이 필요한 사용자 그룹마다 각기 다른 linux OS user id를 생성하여 관리하면 됩니다.
5) 서비스의 생성은 command line 또는 script 한줄로 몇 초 안에 간단하게 처리되며, 이는 수퍼 유저(root)의 권한을 가진 시스템 관리자만 할 수 있습니다.
6) 100% 오픈소스로 간단하게 구축할 수 있습니다.
DAI를 구동할 때 docker를 이용하는 방법은 다음과 같이 간단합니다. DAI가 설치된 docker image를 다음과 같이 구동하면 됩니다.
$ sudo docker run --runtime=nvidia --init --rm -p 12311:12345 -v /user01/data:/data -v /user01/log:/log -v /user01/tmp:/tmp h2oai/dai-redhat7-ppc64le:v0.1
--runtime=nvidia : GPU를 사용하는 nvidia-docker 환경임을 명기
-p 12311:12345 : docker 내부의 DAI가 사용하는 port인 12345를 parent OS에서는 12311 port로 전달
-v /user01/data:/data : Parent OS의 /user01/data directory를 docker 내부에서는 /data directory로 mount
h2oai/dai-redhat7-ppc64le:v0.1 : DAI가 설치된 docker image의 이름과 tag
위에서는 아무런 자원 제약을 주지 않고, 전체 시스템의 자원을 다 사용할 수 있도록 docker container를 구동한 것입니다. 이 경우, 1GB 정도의 AML dataset을 training할 때의 자원 사용 현황은 아래와 같습니다. DAI process가 10GB의 memory(Res Data 기준 - 실제 real memory를 점유한 크기)를 사용하고 있으며 CPU도 (HW thread를 제외하고) 8개의 CPU core를 다 쓰고 있는 것을 보실 수 있습니다.
이렇게 시스템 자원의 제약 없이 수행한 경우엔 1GB dataset 수행에 (accuracy-time-interpretability dial 2-2-2로 세팅) 23m 29s가 걸렸습니다.
이번에는 CPU 자원은 core 4개, 메모리 자원은 5GB로 제한을 주고 수행해보겠습니다.
$ sudo docker run --runtime=nvidia --init --rm --cpus=4 --memory=5g -p 12311:12345 -v /user01/data:/data -v /user01/log:/log -v /user01/tmp:/tmp h2oai/dai-redhat7-ppc64le:v0.1
이 경우 동일한 크기의 dataset을 동일한 dial 세팅으로 수행할 때 자원 사용 현황은 아래와 같습니다. DAI process가 4GB의 memory(Res Data 기준 - 실제 real memory를 점유한 크기)를 사용하고 있으며 CPU도 (HW thread를 제외하고) 8개의 CPU core 중 절반만을 쓰고 있는 것을 보실 수 있습니다.
이렇게 시스템 자원을 4 CPU core, 5GB의 메모리로 제약을 주고 수행한 경우엔 1GB dataset 수행에 (accuracy-time-interpretability dial 2-2-2로 세팅) 41m 12s가 걸렸습니다. 자원 제약 없을 때보다 2배 정도의 시간이 걸린 것을 보실 수 있습니다.
Docker container 수행시 GPU 자원을 할당하는 것은 아래와 같이 NVIDIA_VISIBLE_DEVICES 변수를 환경변수로 사용하면 됩니다. 아래의 경우에서는 2번과 3번 2개의 GPU를 할당하게 됩니다.
$ sudo docker run --runtime=nvidia --init --rm --cpus=4 --memory=5g -e NVIDIA_VISIBLE_DEVICES=2,3 -p 12311:12345 -v /user01/data:/data -v /user01/log:/log -v /user01/tmp:/tmp h2oai/dai-redhat7-ppc64le:v0.1
2018년 7월 3일 화요일
PowerAI R5.2가 docker image로도 나왔습니다
PowerAI는 v1.4까지는 internet에서 자유롭게 download 받아서 쓸 수 있도록 되어 있었으나, v1.5부터는 별도로 주문을 해야 download 받을 수 있도록 바뀌었습니다. 다만 이때도 machine serial #만 넣으면 무료로 주문이 되니까 비용 부담은 없었습니다. 그래도 불편한 것은 사실이었지요.
그 점은 여전합니다만, v1.5.2, 즉 R5.2부터는 PowerAI를 포함한 docker image로도 배포가 됩니다.
Redhat 7.5와 docker 13.1, nvidia-docker 1.0을 설치한 AC922 위에서 쓸 수 있는 이 docker image는 Ubuntu 16.04에 CUDA 9.2 기반으로 만들어져 있습니다. 그리고 그 속에 포함된 PowerAI R5.2 속에는 다음과 같은 component 들이 들어있습니다.
특히 IBM이 별도로 만들어 contribute한 Snap ML도 들어있는 것이 눈에 띕니다.
Component | Version |
---|---|
Distributed Deep Learning (DDL) | 1.0.0 |
TensorFlow | 1.8.0 |
TensorBoard | 1.8.0 |
IBM Caffe | 1.0.0 |
BVLC Caffe | 1.0.0 |
PyTorch | 0.4.0 |
Snap ML | 1.0.0 |
Spectrum MPI | 10.2 |
Bazel | 0.10.0 |
OpenBLAS | 0.2.20 |
Protobuf | 3.4.0 |
사용법은 일반 docker와 동일하며 간단합니다. 아래와 같이 pull 해서...
# docker pull ibmcom/powerai
다음과 같이 사용하시면 됩니다.
# nvidia-docker run -ti --env LICENSE=yes ibmcom/powerai bash
현재 사용가능한 tag은 1.5.2-all-ubuntu16.04와 1.5.2-all-ubuntu16.04-py3이 있습니다. 가령 python3에서 tensorflow를 사용하시고자 한다면 아래와 같이 py3 tag를 붙여서 pull/run 하시면 됩니다.
# docker pull ibmcom/powerai:1.5.2-all-ubuntu16.04-py3
# nvidia-docker run -ti --env LICENSE=yes ibmcom/powerai:1.5.2-all-ubuntu16.04-py3 bash
더 자세한 내용은 아래 site를 참조하세요.
https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/
https://hub.docker.com/r/ibmcom/powerai/
2018년 1월 15일 월요일
AC922에서 Ubuntu 기반 container image로 caffe 및 tensorflow 수행하기
좋은 소식입니다. 원래 발표에서는 AC922에서는 2018년 2Q가 되어야 정식으로 Ubuntu가 지원될 예정이었습니다. caffe나 tensorflow 등도 그때나 되어야 지원될 예정이었고요.
오늘 테스트해보니 AC922의 CUDA9.1 + Redhat 7.4 + nvidia-docker v1 환경에서 테스트해보니, 기존 ubuntu 16.04에서 빌드해놓았던 tensorflow 1.3과 caffe 1.0이 다 잘 돌아갑니다.
그리고 성능도 기존 Minsky P100에서보다 약 1.7~1.8배 정도 나옵니다. V100의 공식 TFLOPS 성능이 P100의 1.5배인 것을 생각하면 이는 POWER9과 NVLink 2.0 덕분인 것 같습니다.
[root@ac922 ~]# nvidia-docker run -ti --rm -v /nvme:/nvme bsyu/tf1.3-ppc64le:v0.1 bash
root@2be8a3ffc5fd:/nvme/models/tutorials/image/cifar10# which python
/opt/anaconda3/bin/python
root@2be8a3ffc5fd:/nvme/models/tutorials/image/cifar10# time python cifar10_multi_gpu_train.py --batch_size 512
2018-01-15 13:10:46.585492: step 8230, loss = 0.60 (27071.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:47.353298: step 8240, loss = 0.59 (26355.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:48.130751: step 8250, loss = 0.67 (26219.3 examples/sec; 0.020 sec/batch)
2018-01-15 13:10:48.909557: step 8260, loss = 0.61 (27011.5 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:49.681004: step 8270, loss = 0.65 (27131.1 examples/sec; 0.019 sec/batch)
[Service]
[Install]
$ vi /lib/udev/rules.d/40-redhat.rules (아래 줄을 #으로 comment-out)
# /usr/bin/nvidia-persistenced --verbose
오늘 테스트해보니 AC922의 CUDA9.1 + Redhat 7.4 + nvidia-docker v1 환경에서 테스트해보니, 기존 ubuntu 16.04에서 빌드해놓았던 tensorflow 1.3과 caffe 1.0이 다 잘 돌아갑니다.
그리고 성능도 기존 Minsky P100에서보다 약 1.7~1.8배 정도 나옵니다. V100의 공식 TFLOPS 성능이 P100의 1.5배인 것을 생각하면 이는 POWER9과 NVLink 2.0 덕분인 것 같습니다.
[root@ac922 ~]# nvidia-docker run -ti --rm -v /nvme:/nvme bsyu/tf1.3-ppc64le:v0.1 bash
root@2be8a3ffc5fd:/nvme/models/tutorials/image/cifar10# which python
/opt/anaconda3/bin/python
...
2018-01-15 13:10:45.808267: step 8220, loss = 0.75 (27044.5 examples/sec; 0.019 sec/batch)2018-01-15 13:10:46.585492: step 8230, loss = 0.60 (27071.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:47.353298: step 8240, loss = 0.59 (26355.8 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:48.130751: step 8250, loss = 0.67 (26219.3 examples/sec; 0.020 sec/batch)
2018-01-15 13:10:48.909557: step 8260, loss = 0.61 (27011.5 examples/sec; 0.019 sec/batch)
2018-01-15 13:10:49.681004: step 8270, loss = 0.65 (27131.1 examples/sec; 0.019 sec/batch)
단, 이런 성능을 내기 위해서는 예전과 마찬가지로 GPU auto boost를 해줘야 하며, V100에 대해서는 다음과 같이 해주시기 바랍니다.
[root@ac922 ~]# cat /etc/rc.local
#!/bin/bash
# THIS FILE IS ADDED FOR COMPATIBILITY PURPOSES
#
# It is highly advisable to create own systemd services or udev rules
# to run scripts during boot instead of using this file.
#
# In contrast to previous versions due to parallel execution during boot
# this script will NOT be run after all other services.
#
# Please note that you must run 'chmod +x /etc/rc.d/rc.local' to ensure
# that this script will be executed during boot.
touch /var/lock/subsys/local
/usr/bin/nvidia-smi -pm ENABLED
/usr/bin/nvidia-smi -ac 877,1530
/usr/sbin/ppc64_cpu --smt=off
sleep 30
/usr/bin/cpupower frequency-set --governor performance
/usr/bin/nvidia-docker-plugin &
또한 이렇게 docker container를 이용할 경우, cuda를 initialize하는데 시간이 꽤 오래 걸립니다. 이건 Redhat 위에서 ubuntu 기반 container를 띄우기 때문인지, 아니면 nvidia-docker v1을 사용하기 때문인지, 아니면 CUDA 9의 문제인지 아직 불분명합니다.
그리고 그대로 nvidia-smi를 수행하면 failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED error가 납니다. 그 문제는 아래와 같이 해결 가능합니다. 아래 내용은 IBM 이보란 과장이 정리해주었습니다.
*nvidia 설치가이드 참조 : https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas
# vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
options nouveau modeset=0
# sudo dracut --force
# vi /usr/lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
[Install]
WantedBy=multi-user.target
# sudo systemctl enable nvidia-persistenced
$ vi /lib/udev/rules.d/40-redhat.rules (아래 줄을 #으로 comment-out)
#SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p",
RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"
# /usr/bin/nvidia-persistenced --verbose
# sudo yum install freeglut-devel libX11-devel libXi-devel libXmu-devel
make mesa-libGLU-devel
Error downloading packages:
libXext-devel-1.3.3-3.el7.ppc64le: [Errno 256] No more mirrors to try.
libXdamage-devel-1.1.4-4.1.el7.ppc64le: [Errno 256] No more mirrors to try. (생략)
==> 설치부분은 위와 같이 다운로드 에러가 납니다만, 그냥 무시하고 넘어갔습니다.
libXext-devel-1.3.3-3.el7.ppc64le: [Errno 256] No more mirrors to try.
libXdamage-devel-1.1.4-4.1.el7.ppc64le: [Errno 256] No more mirrors to try. (생략)
==> 설치부분은 위와 같이 다운로드 에러가 납니다만, 그냥 무시하고 넘어갔습니다.
이후 PYTHONPATH, PATH, LD_LIBRARY_PATH 설정 후에 tensorflow 1.4 로 cifar10이든,
mnist 든 수행하면 다음의 에러가 납니다.
[root@ac922 cifar10]# python cifar10_train.py
Traceback (most recent call last):
File "cifar10_train.py", line 42, in <module>
import tensorflow as tf
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 30, in <module>
import traceback
File "/opt/anaconda3/lib/python3.6/traceback.py", line 5, in <module>
import linecache
File "/opt/anaconda3/lib/python3.6/linecache.py", line 11, in <module>
import tokenize
File "/opt/anaconda3/lib/python3.6/tokenize.py", line 33, in <module>
import re
File "/opt/anaconda3/lib/python3.6/re.py", line 142, in <module>
class RegexFlag(enum.IntFlag):
AttributeError: module 'enum' has no attribute 'IntFlag'
Traceback (most recent call last):
File "cifar10_train.py", line 42, in <module>
import tensorflow as tf
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 30, in <module>
import traceback
File "/opt/anaconda3/lib/python3.6/traceback.py", line 5, in <module>
import linecache
File "/opt/anaconda3/lib/python3.6/linecache.py", line 11, in <module>
import tokenize
File "/opt/anaconda3/lib/python3.6/tokenize.py", line 33, in <module>
import re
File "/opt/anaconda3/lib/python3.6/re.py", line 142, in <module>
class RegexFlag(enum.IntFlag):
AttributeError: module 'enum' has no attribute 'IntFlag'
enum34가 설치되어 있어서 발생하는 에러인데, Python 3.4 버전 이상부터는 enum34와 호환이 되지 않아서
이를 삭제해야 한다고 합니다. 삭제를 위해, 우선 PYTHONPATH를 Python 2.7 경로로 바꿔줍니다.
# export
PYTHONPATH=/opt/DL/tensorflow/lib/python2.7/site-packages
# pip uninstall enum34
# export PYTHONPATH=<python3.6/site-packages의 경로로 재설정>
# export
PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages
그 다음, cifar10 이든 mnist를 수행하면 enum 관련 에러는 발생하지 않고, nvidia-smi 에서는
unknown error 대신 GPU 메모리 상태 정보를 제대로 띄우는 것을 확인할 수 있습니다.
2018년 1월 11일 목요일
ppc64le에서의 flannel 구성 : host 서버 외부로의 docker container network 구성
서로 다른 두대의 물리적 서버에 각각 위치한 docker container들끼리 network 통신을 할 수 있도록 설정하는 방법입니다. 한줄 요약하자면 여기에는 flanneld를 구성해야 하며, 이를 위해서는 먼저 etcd를 구성해야 합니다.
이 테스트에는 아래 site들을 참조했습니다.
http://docker-k8s-lab.readthedocs.io/en/latest/docker/docker-etcd.html
http://docker-k8s-lab.readthedocs.io/en/latest/docker/docker-flannel.html
http://cloudgeekz.com/1016/configure-flannel-docker-power.html
https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/Docker_And_Kubernetes_Cluster_on_Power_with_RHEL7_Part_2_etcd_flanneld_daemons_on_master_node?lang=en
https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/Docker_And_Kubernetes_Cluster_on_Power_with_RHEL7_Part_3_flanneld_docker_daemons_on_Docker_Container_node?lang=en
서버 환경은 IBM POWER8 Ubuntu 16.04 ppc64le이며, 두 대 서버의 hostname과 IP는 다음과 같습니다.
물리적서버 #1 sys-90725 172.29.160.221
물리적서버 #2 sys-90754 172.29.160.207
먼저 etcd를 설치합니다. 이건 flanneld 구성을 위해 필요한 일종의 key-value store입니다. Ubuntu OS에 포함된 것을 그대로 이용하시면 됩니다.
u0017649@sys-90725:~$ sudo apt-get install etcd
설정값은 아래와 같이 /etc/default/etcd에 정리합니다.
u0017649@sys-90725:~$ sudo vi /etc/default/etcd
ETCD_NAME=sys-90725
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_CLIENT_URLS="http://172.29.160.221:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://172.29.160.221:2379"
물론 서버#2에는 거기에 맞는 hostname과 IP를 입력해야 합니다.
u0017649@sys-90754:~$ sudo vi /etc/default/etcd
ETCD_NAME=sys-90754
ETCD_DATA_DIR="/var/lib/etcd/default"
ETCD_LISTEN_CLIENT_URLS="http://172.29.160.270:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://172.29.160.207:2379"
etcd를 start하고, 제대로 작동 중인지 확인합니다.
u0017649@sys-90725:~$ sudo systemctl enable etcd
u0017649@sys-90725:~$ sudo systemctl start etcd
u0017649@sys-90725:~$ sudo systemctl status etcd
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabl
Active: active (running) since Mon 2018-01-08 03:11:40 EST; 6min ago
Docs: https://github.com/coreos/etcd
man:etcd
Main PID: 6100 (etcd)
CGroup: /system.slice/etcd.service
└─6100 /usr/bin/etcd
이어서, flanneld 구성 정보를 다음과 같이 json 파일로 작성합니다. 두 대의 서버에 동일한 내용으로 작성합니다.
u0017649@sys-90725:~$ vi flannel-config-vxlan.json
{
"Network": "10.172.29.0/16",
"SubnetLen": 24,
"Backend": {
"Type": "vxlan",
"VNI": 1
}
}
이제 이 file을 아래와 같이 etcd에 집어넣습니다. /atomic.io/network/config은 물리적 file name도 아니고 internet 주소도 아님에 유의하세요.
u0017649@sys-90725:~$ sudo etcdctl set /atomic.io/network/config < flannel-config-vxlan.json
제대로 들어갔는지 확인은 아래와 같이 합니다.
u0017649@sys-90725:~$ sudo etcdctl get /atomic.io/network/config
{
"Network": "10.172.29.0/16",
"SubnetLen": 24,
"Backend": {
"Type": "vxlan",
"VNI": 1
}
}
etcd가 정상 작동 중인지는 아래와 같은 방법으로 하실 수 있습니다.
u0017649@sys-90725:~$ curl -L http://sys-90725:2379/v2/keys/atomic.io/network/config
{"action":"get","node":{"key":"/atomic.io/network/config","value":"{\n \"Network\": \"10.172.29.0/16\",\n \"SubnetLen\": 24,\n \"Backend\": {\n \"Type\": \"vxlan\",\n \"VNI\": 1\n }\n}\n","modifiedIndex":4,"createdIndex":4}}
u0017649@sys-90725:~$ etcdctl cluster-health
member ce2a822cea30bfca is healthy: got healthy result from http://0.0.0.0:2379
cluster is healthy
이제 flannel를 설치해야 하는데, 이건 아직 Ubuntu OS에 들어있지 않습니다. 별 수 없이 build를 해야 하는데, 여기에 버전 1.7 이상의 golang을 쓰셔야 합니다. Default로 있는 golang 1.6을 쓰시면 error가 나니 유의하십시요.
u0017649@sys-90725:~$ sudo apt-get install linux-libc-dev golang-1.9 golang-1.9-go
u0017649@sys-90754:~$ sudo rm /usr/bin/go
u0017649@sys-90754:~$ sudo ln -s /usr/lib/go-1.9/bin/go /usr/bin/go
그냥 github의 flannel을 그대로 복제하여 build하면 역시 또 error가 납니다. 해서 다음과 같이 약간의 꼼수가 필요합니다. 즉, src/github.com/coreos라는 directory를 먼저 만들고, 그 속에 들어가서 github을 복제하십시요.
u0017649@sys-90725:~$ export GOPATH=$(pwd)
u0017649@sys-90725:~$ mkdir -p src/github.com/coreos
u0017649@sys-90725:~$ cd src/github.com/coreos
u0017649@sys-90725:~/src/github.com/coreos$ git clone https://github.com/coreos/flannel.git
이제 build 시작하기 전에, Makefile에 손을 좀 대야 합니다. 이유는 Makefile에 ppc64le도 있긴 한데 amd64를 default로 채택하게 되어 있기 때문에, 그를 ppc64le로 바꾸는 작업입니다. 우리는 여기서 make targ.gz을 할 것이니, 많은 entry 중에서 그 부분만 놔두고 #으로 막으면 됩니다.
(또는 그냥 ~/src/github.com/coreos/flannel 에 들어가서 "make dist/flanneld"만 수행해도 됩니다. 필요한 건 flanneld 뿐이거든요.)
u0017649@sys-90725:~/src/github.com/coreos$ cd flannel
u0017649@sys-90725:~/src/github.com/coreos/flannel$ export CGO_ENABLED=1
u0017649@sys-90725:~/src/github.com/coreos/flannel$ vi Makefile
# Default tag and architecture. Can be overridden
#TAG?=$(shell git describe --tags --dirty)
TAG?=v0.9.0-34-gab368026-ppc64le # from https://quay.io/repository/coreos/flannel-git?tab=tags
#ARCH?=amd64
ARCH?=ppc64le
#GO_VERSION=1.8.3
GO_VERSION=1.9.2
...
tar.gz:
# ARCH=ppc64le make dist/flanneld-ppc64le
# tar --transform='flags=r;s|-ppc64le||' -zcvf dist/flannel-$(TAG)-linux-ppc64le.tar.gz -C dist flanneld-ppc64le mk-docker-opts.sh ../README.md
# tar -tvf dist/flannel-$(TAG)-linux-ppc64le.tar.gz
# ARCH=ppc64le make dist/flanneld.exe
# tar --transform='flags=r;s|-ppc64le||' -zcvf dist/flannel-$(TAG)-windows-ppc64le.tar.gz -C dist flanneld.exe mk-docker-opts.sh ../README.md
# tar -tvf dist/flannel-$(TAG)-windows-ppc64le.tar.gz
ARCH=ppc64le make dist/flanneld-ppc64le
tar --transform='flags=r;s|-ppc64le||' -zcvf dist/flannel-$(TAG)-linux-ppc64le.tar.gz -C dist flanneld-ppc64le mk-docker-opts.sh ../README.md
tar -tvf dist/flannel-$(TAG)-linux-ppc64le.tar.gz
# ARCH=arm make dist/flanneld-arm
# tar --transform='flags=r;s|-arm||' -zcvf dist/flannel-$(TAG)-linux-arm.tar.gz -C dist flanneld-arm mk-docker-opts.sh ../README.md
# tar -tvf dist/flannel-$(TAG)-linux-arm.tar.gz
# ARCH=arm64 make dist/flanneld-arm64
# tar --transform='flags=r;s|-arm64||' -zcvf dist/flannel-$(TAG)-linux-arm64.tar.gz -C dist flanneld-arm64 mk-docker-opts.sh ../README.md
# tar -tvf dist/flannel-$(TAG)-linux-arm64.tar.gz
# ARCH=s390x make dist/flanneld-s390x
# tar --transform='flags=r;s|-s390x||' -zcvf dist/flannel-$(TAG)-linux-s390x.tar.gz -C dist flanneld-s390x mk-docker-opts.sh ../README.md
# tar -tvf dist/flannel-$(TAG)-linux-s390x.tar.gz
그 다음에 make tar.gz을 합니다.
u0017649@sys-90725:~/src/github.com/coreos/flannel$ make tar.gz
오래 걸리지 않아 아래와 같이 tar.gz이 만들어집니다.
u0017649@sys-90725:~/src/github.com/coreos/flannel$ ls -l dist/*.tar.gz
-rw-rw-r-- 1 u0017649 u0017649 8179227 Jan 8 23:26 dist/flannel-v0.9.0-34-gab368026-ppc64le-linux-ppc64le.tar.gz
이걸 적절한 위치(여기서는 /usr/local/bin)에 풀어놓습니다.
u0017649@sys-90725:~/src/github.com/coreos/flannel$ cd /usr/local/bin
u0017649@sys-90725:/usr/local/bin$ sudo tar -xvf ~/src/github.com/coreos/flannel/dist/flannel-v0.9.0-34-gab368026-ppc64le-linux-ppc64le.tar.gz
flanneld
mk-docker-opts.sh
README.md
flanneld 구성에 들어가기 전에 먼저 docker daemon을 stop 시키고 docker0도 삭제합니다.
u0017649@sys-90725:~$ sudo systemctl stop docker.service
u0017649@sys-90725:~$ sudo ip link delete docker0
이제 다음과 같이 /lib/systemd/system/flanneld.service 와 /etc/default/flanneld 파일을 만들어 systemctl에 flanneld 서비스를 등록합니다.
u0017649@sys-90725:~$ sudo vi /lib/systemd/system/flanneld.service
[Unit]
Description=Flanneld overlay address etcd agent
After=network.target
After=network-online.target
Wants=network-online.target
After=etcd.service
Before=docker.service
[Service]
Type=notify
EnvironmentFile=-/etc/default/flanneld
ExecStart=/usr/local/bin/flanneld -etcd-endpoints=${FLANNEL_ETCD} -etcd-prefix=${FLANNEL_ETCD_KEY} $FLANNEL_OPTIONS
Restart=on-failure
[Install]
WantedBy=multi-user.target
RequiredBy=docker.service
u0017649@sys-90725:~$ sudo vi /etc/default/flanneld
# Flanneld configuration options
# etcd url location. Point this to the server where etcd runs
FLANNEL_ETCD="http://172.29.160.221:2379"
# etcd config key. This is the configuration key that flannel queries
# For address range assignment
FLANNEL_ETCD_KEY="/atomic.io/network"
# Any additional options that you want to pass
FLANNEL_OPTIONS="-ip-masq=true"
이제 flanneld를 start 하고, 정상 작동하는지 확인합니다.
u0017649@sys-90725:~$ sudo systemctl start flanneld
u0017649@sys-90725:~$ sudo systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/lib/systemd/system/flanneld.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2018-01-10 20:21:19 EST; 28s ago
Main PID: 10666 (flanneld)
Tasks: 10
Memory: 8.8M
CPU: 266ms
CGroup: /system.slice/flanneld.service
└─10666 /usr/local/bin/flanneld -etcd-endpoints=http://172.29.160.221:2379 -etcd-prefix=/atomic.io
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.761758 10666 main.go:238] Installing signal handl
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.764142 10666 main.go:353] Found network config -
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.764234 10666 vxlan.go:120] VXLAN config: VNI=1 Po
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.830878 10666 local_manager.go:201] Found previous
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.859766 10666 local_manager.go:220] Allocated leas
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.860247 10666 main.go:300] Wrote subnet file to /r
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.860273 10666 main.go:304] Running backend.
Jan 10 20:21:19 sys-90725 systemd[1]: Started Flanneld overlay address etcd agent.
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.862149 10666 vxlan_network.go:60] watching for ne
Jan 10 20:21:19 sys-90725 flanneld[10666]: I0110 20:21:19.889547 10666 main.go:396] Waiting for 22h59m59.94
이제 보면 /run/flannel/subnet.env 이라는 파일이 생성되고 그 속에 관련 환경변수 정보가 기록된 것을 보실 수 있습니다.
u0017649@sys-90725:~$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.172.0.0/16
FLANNEL_SUBNET=10.172.61.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
flannel.1이라는 interface도 새로 생성되는데, 다음과 같습니다.
u0017649@sys-90725:~$ ifconfig flannel.1
flannel.1 Link encap:Ethernet HWaddr 66:bc:02:fe:88:b3
inet addr:10.172.61.0 Bcast:0.0.0.0 Mask:255.255.255.255
inet6 addr: fe80::64bc:2ff:fefe:88b3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:846 errors:0 dropped:0 overruns:0 frame:0
TX packets:679 errors:0 dropped:8 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:77685 (77.6 KB) TX bytes:77345 (77.3 KB)
이제 이 정보를 이용해 docker daemon의 설정값을 바꾸어 줍니다.
u0017649@sys-90725:~$ sudo vi /etc/systemd/system/multi-user.target.wants/docker.service
...
#ExecStart=/usr/bin/dockerd -H fd://
EnvironmentFile=-/run/flannel/subnet.env
ExecStart=/usr/bin/dockerd -H fd:// --bip=${FLANNEL_SUBNET} --mtu=${FLANNEL_MTU} --iptables=false --ip-masq=false
그 다음에 docker를 살립니다.
u0017649@sys-90725:~$ sudo systemctl daemon-reload
u0017649@sys-90725:~$ sudo systemctl start docker.service
이제 netstat -rn을 해보면 다음과 같이 flannel.1이라는 interface가 gateway로 뜬 것을 보실 수 있습니다.
u0017649@sys-90725:~$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 172.29.128.13 0.0.0.0 UG 0 0 0 ibmveth0
10.172.23.0 10.172.23.0 255.255.255.0 UG 0 0 0 flannel.1
10.172.61.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
172.29.128.0 0.0.0.0 255.255.192.0 U 0 0 0 ibmveth0
u0017649@sys-90725:~$ ip -4 a|grep inet
inet 127.0.0.1/8 scope host lo
inet 172.29.160.221/18 brd 172.29.191.255 scope global ibmveth0
inet 10.172.61.0/32 scope global flannel.1
inet 10.172.61.1/24 scope global docker0
이제 두 서버#1,#2에서 각각 docker container를 띄우고, ip address를 확인합니다.
서버 #1
u0017649@sys-90725:~$ docker run -ti --rm bsyu/caffe-ibm:v0.2 bash
root@73b40d9d2a5b:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 02:42:0a:ac:3d:02
inet addr:10.172.61.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:950 errors:0 dropped:0 overruns:0 frame:0
TX packets:851 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3402207 (3.4 MB) TX bytes:84232 (84.2 KB)
root@73b40d9d2a5b:~# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 02:42:0a:ac:3d:02
inet addr:10.172.61.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:931 errors:0 dropped:0 overruns:0 frame:0
TX packets:838 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
서버 #2
u0017649@sys-90754:~$ docker run -ti --rm bsyu/caffe-ibm:v0.2 bash
root@bfb4da5e6b06:/# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 02:42:0a:ac:05:02
inet addr:10.172.5.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:86 errors:0 dropped:0 overruns:0 frame:0
TX packets:106 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12926 (12.9 KB) TX bytes:11738 (11.7 KB)
서버 #2의 container에서 서버 #1의 container로 ping을 해보고 이어서 ssh를 해보겠습니다. (물론 서버 #1의 container에는 openssh-server 및 root login 허용 작업을 미리 해놓아야 합니다.)
root@bfb4da5e6b06:/# ping 10.172.61.2
PING 10.172.61.2 (10.172.61.2) 56(84) bytes of data.
64 bytes from 10.172.61.2: icmp_seq=1 ttl=62 time=1.01 ms
64 bytes from 10.172.61.2: icmp_seq=2 ttl=62 time=0.739 ms
^C
--- 10.172.61.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.739/0.877/1.016/0.141 ms
root@bfb4da5e6b06:/# ssh 10.172.61.2
root@10.172.61.2's password:
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic ppc64le)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Last login: Thu Jan 11 01:44:26 2018 from 10.172.5.2
root@73b40d9d2a5b:~# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 02:42:0a:ac:3d:02
inet addr:10.172.61.2 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:1030 errors:0 dropped:0 overruns:0 frame:0
TX packets:904 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3410576 (3.4 MB) TX bytes:91883 (91.8 KB)
root@73b40d9d2a5b:~# df -h
Filesystem Size Used Avail Use% Mounted on
none 35G 27G 6.6G 80% /
tmpfs 2.0G 0 2.0G 0% /dev
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda2 35G 27G 6.6G 80% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 2.0G 0 2.0G 0% /sys/firmware
다 잘 되는 것을 확인하실 수 있습니다. 다음과 같이 scp에 의한 파일 전송도 잘 됩니다.
root@73b40d9d2a5b:~# echo "I love donut" > /tmp/docker1
root@73b40d9d2a5b:~# logout
Connection to 10.172.61.2 closed.
root@bfb4da5e6b06:/# scp 10.172.61.2:/tmp/docker1 /tmp
root@10.172.61.2's password:
docker1 100% 13 0.0KB/s 00:00
root@bfb4da5e6b06:/# cat /tmp/docker1
I love donut
2번 container에 traceroute를 설치한 뒤 1번 container로의 route 경로를 살펴보면 다음과 같습니다.
root@bfb4da5e6b06:/# apt-get install traceroute
root@bfb4da5e6b06:/# traceroute 10.172.61.2
traceroute to 10.172.61.2 (10.172.61.2), 30 hops max, 60 byte packets
1 10.172.5.1 (10.172.5.1) 0.215 ms 0.030 ms 0.026 ms
2 10.172.61.0 (10.172.61.0) 3.198 ms 3.390 ms 3.353 ms
3 10.172.61.2 (10.172.61.2) 3.654 ms 3.926 ms 3.890 ms
즉, 먼저 서버 #2의 docker0를 통해 flannel.1으로 나가 서버 #1의 flannel.1을 거쳐 docker0로 들어가는 것입니다.
u0017649@sys-90754:~$ ip -4 a|grep inet
inet 127.0.0.1/8 scope host lo
inet 172.29.160.207/18 brd 172.29.191.255 scope global ibmveth0
inet 10.172.5.0/32 scope global flannel.1
inet 10.172.5.1/24 scope global docker0
u0017649@sys-90725:~$ ip -4 a|grep inet
inet 127.0.0.1/8 scope host lo
inet 172.29.160.221/18 brd 172.29.191.255 scope global ibmveth0
inet 10.172.61.0/32 scope global flannel.1
inet 10.172.61.1/24 scope global docker0
----------------------------------------
참고로 docker container 안에서 sshd를 살리고 root login을 허용하는 방법은 다음과 같습니다.
root@73b40d9d2a5b:/# apt-get install openssh-server
root@73b40d9d2a5b:/# vi /etc/ssh/sshd_config
#PermitRootLogin prohibit-password
PermitRootLogin yes
root@73b40d9d2a5b:/# /etc/init.d/ssh start
2017년 6월 8일 목요일
POWER8에서 LSF를 이용하여 tensorflow docker image로 inception v3를 training 하기
여기서는 Spectrum LSF의 무료 community edition을 이용하여, sys-87548 서버와 sys-87549 서버로 LSF cluster를 만들겠습니다. sys-87548 서버가 master이자 slave이고, sys-87549 서버는 secondary master이자 역시 slave 입니다.
LSF community edition을 인터넷에서 download 받아 다음과 같이 tar를 압축 해제합니다. 속에는 LSF와 Platform Application Server가 들어있는 것을 보실 수 있습니다. 여기서는 일단 LSF만 설치합니다.
root@sys-87548:/home/u0017496# tar -zxvf lsfce10.1-ppc64le.tar.gz
lsfce10.1-ppc64le/
lsfce10.1-ppc64le/lsf/
lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall_linux_ppc64le.tar.Z
lsfce10.1-ppc64le/lsf/lsf10.1_lnx310-lib217-ppc64le.tar.Z
lsfce10.1-ppc64le/pac/
lsfce10.1-ppc64le/pac/pac10.1_basic_linux-ppc64le.tar.Z
root@sys-87548:/home/u0017496# cd lsfce10.1-ppc64le/lsf/
아래와 같이 LSF에는 두개의 *.tar.Z file이 있는데, 이중 압축 해제가 필요한 것은 lsf10.1_lsfinstall_linux_ppc64le.tar.Z 뿐입니다. lsf10.1_lnx310-lib217-ppc64le.tar.Z은 압축 해제하지 말고 그냥 놔두셔야 합니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z lsf10.1_lsfinstall_linux_ppc64le.tar.Z
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# cd lsf10.1_lsfinstall
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl install.config lap lsf_unix_install.pdf patchlib README rpm slave.config
hostsetup instlib lsfinstall patchinstall pversions rhostsetup scripts
먼저 install.config를 다음과 같이 편집합니다. LSF_MASTER_LIST에 적힌 순서에 따라 sys-87548가 primary master, sys-87549가 secondary master이고, 또 LSF_ADD_SERVERS에 적힌 순서에 따라 sys-87548와 sys-87549가 각각 host server, 즉 slave입니다. 만약 서버가 1대 뿐이라면 그 서버가 master이자 single slave가 될 수도 있습니다.
[root@sys-87538 lsf10.1_lsfinstall]# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="u0017496"
LSF_CLUSTER_NAME="cluster1"
LSF_MASTER_LIST="sys-87548 sys-87549"
LSF_TARDIR="/home/u0017496/lsfce10.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="sys-87548 sys-87549"
lsfinstall 명령을 이용해 다음과 같이 설치 시작합니다. 도중에 LSF distribution tar 파일의 위치를 묻는데, 아까 압축 해제하지 않고 놔둔 그 파일을 묻는 것입니다. 그냥 1을 눌러 default를 선택하면 됩니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config
...
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
...
Searching LSF 10.1 distribution tar files in /home/u0017496/lsfce10.1-ppc64le/lsf Please wait ...
1) linux3.10-glibc2.17-ppc64le
Press 1 or Enter to install this host type: 1
LSF 엔진 설치가 끝나면 다음 명령을 수행하여 이 서버를 host server, 즉 slave로 설정합니다. --boot="y"는 서버가 부팅될 때 LSF daemon을 자동으로 살리라는 뜻입니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"
Logging installation sequence in /usr/share/lsf/log/Install.log
------------------------------------------------------------
L S F H O S T S E T U P U T I L I T Y
------------------------------------------------------------
This script sets up local host (LSF server, client or slave) environment.
Setting up LSF server host "sys-87548" ...
Checking LSF installation for host "sys-87548" ... Done
grep: /etc/init/rc-sysinit.conf: No such file or directory
Copying /etc/init.d/lsf, /etc/rc3.d/S95lsf and /etc/rc3.d/K05lsf
Installing LSF RC scripts on host "sys-87548" ... Done
LSF service ports are defined in /usr/share/lsf/conf/lsf.conf.
Checking LSF service ports definition on host "sys-87548" ... Done
You are installing IBM Spectrum LSF - Community Edition.
... Setting up LSF server host "sys-87548" is done
... LSF host setup is done.
그리고 root user와 LSF 사용 user의 .bashrc에는 다음과 같이 profile.lsf가 수행되도록 entry를 넣어둡니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf
u0017496@sys-87481:~$ vi ~/.bashrc
. /usr/share/lsf/conf/profile.lsf
그리고 서버들 간에는 rsh이 아닌 ssh를 이용하도록 아래와 같이 lsf.conf 속에 LSF_RSH를 설정합니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /usr/share/lsf/conf/lsf.conf
LSF_RSH=ssh
그리고 LSF 사용자들을 /etc/lsf.sudoers 속에 등록해둡니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=u0017496
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="u0017496"
lsfadmin user는 물론, root로도 자체, 그리고 에 대해 passwd 없이 ssh가 가능하도록 설정해야 합니다.
u0017496@sys-87548:~$ ssh-keygen -t rsa
u0017496@sys-87548:~$ ssh-copy-id sys-87548
u0017496@sys-87548:~$ ssh-copy-id sys-87549
root@sys-87548:/home/u0017496# ssh-keygen -t rsa
[root@sys-87548:/home/u0017496# ls -l /root/.ssh
total 12
-rw------- 1 root root 1595 Jun 8 03:43 authorized_keys
-rw------- 1 root root 1679 Jun 8 03:42 id_rsa
-rw-r--r-- 1 root root 396 Jun 8 03:42 id_rsa.pub
root@sys-87548:/home/u0017496# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
root@sys-87548:/home/u0017496# cp /root/.ssh/id_rsa.pub /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# chmod 666 /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# exit
u0017496@sys-87548:~$ scp /tmp/sys-87548.id_rsa.pub sys-87549:/tmp
sys-87548.id_rsa.pub 100% 396 0.4KB/s 00:00
root@sys-87548:/home/u0017496# cat /tmp/sys-87549.id_rsa.pub >> /root/.ssh/authorized_keys
이제 lsfstartup 명령을 이용해 LSF daemon들을 살립니다.
root@sys-87548:/home/u0017496# lsfstartup
root@sys-87548:/home/u0017496# lsid
IBM Spectrum LSF Community Edition 10.1.0.0, Jun 15 2016
Copyright IBM Corp. 1992, 2016. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is cluster1
My master name is sys-87548.dal-ebis.ihost.com
아직은 sys-87548에만 LSF가 설치되어 있으므로, LSF daemon들에게는 sys-87549가 down된 상태로 보인다는 점에 유의하십시요.
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis ok - 1 0 0 0 0 0
sys-87549.dal-ebis unavail - 1 0 0 0 0 0
다음과 같이 bsub 명령으로 tensorflow docker image를 이용해 inception v3를 training 하는 job을 submit 합니다. 이것들은 기본적으로 normal queue에 들어갑니다. 같은 명령을 2번 더 내려서, 3개의 training job을 queue에 던져 둡니다.
u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <1> is submitted to queue <normal>.
Job 3개 중 1개는 running 중이고, 2개는 pending 상태인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 3 2 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
이제 default인 normal queue가 아닌, short queue에 또 동일한 job을 submit 해봅니다.
u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <4> is submitted to queue <short>.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
3 u001749 PEND normal sys-87548.d help Jun 8 04:02
잘못 들어간 3번 job을 아래 bkill 명령으로 kill 시킬 수 있습니다.
u0017496@sys-87548:~$ bkill 3
Job <3> is being terminated
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
한창 running 중인 job은 bstop 명령으로 잠깐 suspend 시킬 수 있습니다. 나중에 resume할 수 있도록 수행 도중 상태는 임시파일로 disk에 write됩니다.
u0017496@sys-87548:~$ bstop 1
Job <1> is being stopped
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 0 1
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 USUSP normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
Suspend된 job은 bresume 명령으로 다시 수행할 수 있습니다. 이때 disk로부터 임시파일을 읽느라 I/O가 좀 발생합니다.
u0017496@sys-87548:~$ bresume 1
Job <1> is being resumed
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <5> is submitted to queue <short>.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
대기 중인 job들 중에 빨리 처리되어야 하는 job은 btop 명령을 통해 대기 queue 상에서 순서를 최상위로 바꿀 수도 있습니다.
u0017496@sys-87548:~$ btop 5
Job <5> has been moved to position 1 from top.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
현재 수행 중인 job의 standard out은 bpeek 명령을 통해 엿볼 수 있습니다. 이는 꼭 실시간으로 display되지는 않고, 약간 buffer가 쌓여야 보일 수도 있습니다.
u0017496@sys-87548:~$ bpeek 1
<< output from stdout >>
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: d61a01411a1c
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1065] LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1066] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (d61a01411a1c): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 07:56:19.985253: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 07:59:19.471085: step 0, loss = 2.89 (0.1 examples/sec; 55.400 sec/batch)
2017-06-08 08:04:13.045726: step 10, loss = 2.46 (0.4 examples/sec; 19.288 sec/batch)
2017-06-08 08:07:22.977419: step 20, loss = 2.49 (0.4 examples/sec; 19.100 sec/batch)
2017-06-08 08:10:37.519711: step 30, loss = 2.18 (0.4 examples/sec; 20.120 sec/batch)
2017-06-08 08:13:48.625858: step 40, loss = 2.28 (0.4 examples/sec; 20.342 sec/batch)
bhist 명령을 통해 각 job들의 history를 간략히, 그리고 -l option을 쓰면 자세히 볼 수 있습니다.
u0017496@sys-87548:~$ bhist
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
2 u001749 *_size=8 1275 0 106 0 0 0 1381
4 u001749 *_size=8 775 0 0 0 0 0 775
5 u001749 *_size=8 321 0 0 0 0 0 321
u0017496@sys-87548:~$ bhist -l
Job <2>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 03:56:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <normal>, CWD <$HOME>;
Thu Jun 8 04:13:10: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;
Thu Jun 8 04:17:46: Dispatched 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.
com>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.i
host.com>, Effective RES_REQ <select[type == local] order[
r15s:pg] >;
Thu Jun 8 04:17:46: Starting (Pid 25530);
Thu Jun 8 04:17:46: Running with execution home </home/u0017496>, Execution CW
D </home/u0017496>, Execution Pid <25530>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1275 0 110 0 0 0 1385
------------------------------------------------------------------------------
Job <4>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:06:37: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
779 0 0 0 0 0 779
------------------------------------------------------------------------------
Job <5>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:14:11: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;
Thu Jun 8 04:14:34: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
325 0 0 0 0 0 325
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
u0017496@sys-87548:~$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
3 u001749 EXIT normal sys-87548.d - help Jun 8 04:02
1 u001749 DONE normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
이제 2번째 서버 sys-87549도 online으로 만듭니다. 동일한 install.conf를 통해 LSF를 설치하고, 동일하게 hostsetup 명령을 수행하면 됩니다. bhosts 명령을 통해 보면, 현재 job이 수행 중이라 추가적인 수행 여력이 없는 sys-87548 서버는 closed로 보이는 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis ok - 1 0 0 0 0 0
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 0 1 0
normal 30 Open:Active - - - - 0 0 0 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5 u001749 RUN short sys-87548.d sys-87548.d *ch_size=8 Jun 8 04:14
이제 이 상태에서 bsub 명령으로 또 docker image를 이용한 training job을 4개 submit 합니다.
u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <8> is submitted to default queue <normal>.
... (4번 반복)
Job <11> is submitted to default queue <normal>.
이제 host 서버가 2대이므로, 2개의 job이 동시에 수행 중인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 4 2 2 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis closed - 1 1 1 0 0 0
bjobs 명령으로 8번 job은 sys-87548 서버에서, 9번 job은 sys-87549 서버에서 수행 중인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
8 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 05:57
9 u001749 RUN normal sys-87548.d sys-87549.d *ch_size=8 Jun 8 05:57
10 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57
11 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57
u0017496@sys-87548:~$ bqueues -l normal
QUEUE: normal
-- For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue.
PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV
30 0 Open:Active - - - - 4 2 2 0 0 0
Interval for a host to accept two jobs is 0 seconds
SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
SCHEDULING POLICIES: FAIRSHARE NO_INTERACTIVE
USER_SHARES: [default, 1]
SHARE_INFO_FOR: normal/
USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME ADJUST
u0017496 1 0.111 2 0 11.2 116 0.000
USERS: all
HOSTS: all
u0017496@sys-87548:~$ bjobs -l
Job <8>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:23: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:24: Started 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:57:40: Resource usage collected.
MEM: 33 Mbytes; SWAP: 85 Mbytes; NTHREAD: 11
PGID: 29697; PIDs: 29697 29699 29701 29702
MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 33 Mbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <9>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:27: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:28: Started 1 Task(s) on Host(s) <sys-87549.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87549.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:58:30: Resource usage collected.
MEM: 33 Mbytes; SWAP: 148 Mbytes; NTHREAD: 11
PGID: 14493; PIDs: 14493 14494 14496 14497
MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 23 Mbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <10>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <11>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:33: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
POWER8에서 tensorflow docker image를 build하여 Inception v3 training 수행하기
먼저 docker hub에서 전에 만들어둔 cuda8과 cudnn5-devel package가 설치된 ppc64le 기반의 Ubuntu 16.04 LTS docker image를 pull 합니다.
root@sys-87548:/home/u0017496# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Extracting 41.78 MB/72.3 MB
6b239e02a89e: Download complete
aecbc9abccdc: Downloading 110.8 MB/415.3 MB
8f458a3f0497: Download complete
4903f7ce6675: Download complete
0c588ac98d19: Downloading 107 MB/450.9 MB
12e624e884fc: Download complete
18dd28bbb571: Downloading 45.37 MB/103.2 MB
...
이를 기반으로 PowerAI에 포함된 tensorflow를 설치한 docker image를 build 합니다. 먼저 dockerfile을 다음과 같이 만듭니다.
root@sys-87548:/home/u0017496# vi dockerfile.tensorflow
FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
RUN apt-get update && apt-get install -y nvidia-modprobe
RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo* /tmp/temp/
COPY mldl-repo* /tmp/temp/
RUN dpkg -i /tmp/temp/cuda-repo*deb && \
dpkg -i /tmp/temp/libcudnn5*deb && \
dpkg -i /tmp/temp/mldl-repo*deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y tensorflow && \
rm -rf /var/lib/apt/lists/* && \
dpkg -r mldl-repo-local
# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/cuda-8.0/targets/ppc64le-linux/lib/stubs:/usr/lib/powerpc64le-linux-gnu/stubs:/usr/lib/powerpc64le-linux-gnu:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
ENV PATH="/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
ENV PYTHONPATH="/opt/DL/tensorflow/lib/python2.7/site-packages"
CMD /bin/bash
이제 이 dockerfile을 기반으로 bsyu/tensor_r1.0:ppc64le-xenial의 docker image를 빌드합니다.
root@sys-87548:/home/u0017496# docker build -t bsyu/tensor_r1.0:ppc64le-xenial -f dockerfile.tensorflow .
Sending build context to Docker daemon 3.436 GB
Step 1 : FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
---> d8d0da2fbdf2
Step 2 : RUN apt-get update && apt-get install -y nvidia-modprobe
---> Running in 204fe4e2c5f6
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports xenial InRelease [247 kB]
Get:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release [565 B]
Get:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release.gpg [819 B]
Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Packages [24.9 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports xenial-updates InRelease [102 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports xenial-security InRelease [102 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages [1470 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports xenial/universe ppc64el Packages [9485 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse ppc64el Packages [152 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main ppc64el Packages [613 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe ppc64el Packages [528 kB]
Get:13 http://ports.ubuntu.com/ubuntu-ports xenial-updates/multiverse ppc64el Packages [5465 B]
Get:14 http://ports.ubuntu.com/ubuntu-ports xenial-security/main ppc64el Packages [286 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports xenial-security/universe ppc64el Packages [138 kB]
Fetched 13.2 MB in 10s (1230 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
nvidia-modprobe
0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 16.3 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Get:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el nvidia-modprobe 375.51-0ubuntu1 [16.3 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 16.3 kB in 0s (191 kB/s)
Selecting previously unselected package nvidia-modprobe.
(Reading database ... 17174 files and directories currently installed.)
Preparing to unpack .../nvidia-modprobe_375.51-0ubuntu1_ppc64el.deb ...
Unpacking nvidia-modprobe (375.51-0ubuntu1) ...
Setting up nvidia-modprobe (375.51-0ubuntu1) ...
---> 5411319bbc05
Removing intermediate container 204fe4e2c5f6
Step 3 : RUN mkdir /tmp/temp
---> Running in cf13b03845f1
---> 66b2b250777f
Removing intermediate container cf13b03845f1
Step 4 : COPY libcudnn5* /tmp/temp/
---> 16d921e53451
Removing intermediate container 9d1efa9ed269
Step 5 : COPY cuda-repo* /tmp/temp/
...
Step 9 : ENV LD_LIBRARY_PATH "/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
---> Running in fe30af7c944e
---> f5faa1760ac7
Removing intermediate container fe30af7c944e
Step 10 : ENV PATH "/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
---> Running in 98a0e5bfd008
---> 7cfb0feaaee1
Removing intermediate container 98a0e5bfd008
Step 11 : ENV PYTHONPATH "/opt/DL/tensorflow/lib/python2.7/site-packages"
---> Running in d98d5352108e
---> affda7b26276
Removing intermediate container d98d5352108e
Step 12 : CMD /bin/bash
---> Running in d54a20fb7e3c
---> 4692368fb7ad
Removing intermediate container d54a20fb7e3c
Successfully built 4692368fb7ad
만들어진 docker image를 확인합니다.
root@sys-87548:/home/u0017496# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/tensor_r1.0 ppc64le-xenial 4692368fb7ad 3 minutes ago 6.448 GB
nvidia-docker deb 2830f66f0418 41 hours ago 429.8 MB
nvidia-docker build fa764787622c 41 hours ago 1.014 GB
ppc64le/ubuntu 14.04 0e6701cbf611 2 weeks ago 228.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 4 months ago 1.895 GB
ppc64le/golang 1.6.3 6a579d02d32f 9 months ago 704.7 MB
golang 1.5 99668503de15 10 months ago 725.3 MB
이 docker image를 나중에 다른 서버에서도 사용하기 위해 docker hub에 push 해둡니다.
root@sys-87548:/home/u0017496# docker push bsyu/tensor_r1.0:ppc64le-xenial
The push refers to a repository [docker.io/bsyu/tensor_r1.0]
f42db0829239: Pushed
6a6b4d4d9d2a: Pushing 184.1 MB/2.738 GB
6458d0633f20: Pushing 172.7 MB/390.2 MB
726e25ffdf3c: Pushing 173.2 MB/1.321 GB
1535936ab54b: Pushed
bc0917851737: Pushed
9a1e25cd5998: Pushed
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Mounted from bsyu/cuda8-cudnn5-devel
84f99f1bf79b: Mounted from bsyu/cuda8-cudnn5-devel
7f7c1dccec82: Mounted from bsyu/cuda8-cudnn5-devel
5b8880a35736: Mounted from bsyu/cuda8-cudnn5-devel
41b97cb9a404: Mounted from bsyu/cuda8-cudnn5-devel
08f34ce6b3fb: Mounted from bsyu/cuda8-cudnn5-devel
이제 docker image는 준비되었으니, tensorflow를 이용한 inception v3를 수행할 준비를 합니다. inception v3 training을 수행할 python package를 bazel를 이용하여 /home/inception directory 밑에 만듭니다. 이 directory를 나중에 docker image에서 mount하여 사용할 예정입니다.
root@sys-87548:/home# mkdir inception
root@sys-87548:/home# export INCEPTION_DIR=/home/inception
root@sys-87548:/home# cd inception/
root@sys-87548:/home/inception# curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 380M 100 380M 0 0 4205k 0 0:01:32 0:01:32 --:--:-- 4988k
root@sys-87548:/home/inception# tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585
root@sys-87548:/home/inception# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4866, done.
remote: Total 4866 (delta 0), reused 0 (delta 0), pack-reused 4866
Receiving objects: 100% (4866/4866), 153.36 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (2467/2467), done.
Checking connectivity... done.
root@sys-87548:/home/inception# export FLOWERS_DIR=/home/inception/models/inception
root@sys-87548:/home/inception# mkdir -p $FLOWERS_DIR/data
root@sys-87548:/home/inception# cd models/inception/
root@sys-87548:/home/inception/models/inception# . /opt/DL/bazel/bin/bazel-activate
root@sys-87548:/home/inception/models/inception# . /opt/DL/tensorflow/bin/tensorflow-activate
root@sys-87548:/home/inception/models/inception# export TEST_TMPDIR=/home/inception/.cache
root@sys-87548:/home/inception/models/inception# bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/inception/.cache'.
Extracting Bazel installation...
..............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 5.831s, Critical Path: 0.02s
root@sys-87548:/home/inception/models/inception# ls -l
total 76
lrwxrwxrwx 1 root root 116 Jun 8 02:36 bazel-bin -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/bin
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-genfiles -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/genfiles
lrwxrwxrwx 1 root root 86 Jun 8 02:36 bazel-inception -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception
lrwxrwxrwx 1 root root 96 Jun 8 02:36 bazel-out -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-testlogs -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/testlogs
drwxr-xr-x 2 root root 4096 Jun 8 02:32 data
drwxr-xr-x 2 root root 4096 Jun 8 02:29 g3doc
drwxr-xr-x 4 root root 4096 Jun 8 02:29 inception
-rw-r--r-- 1 root root 38480 Jun 8 02:29 README.md
-rw-r--r-- 1 root root 30 Jun 8 02:29 WORKSPACE
root@sys-87548:/home/inception/models/inception# bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 218M 100 218M 0 0 4649k 0 0:00:48 0:00:48 --:--:-- 5105k
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-06-08 02:01:56.169564 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:01:56.268917 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:02:01.252583 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-06-08 02:02:01.252638 [thread 1]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.306138 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-06-08 02:02:01.306178 [thread 0]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.578737: Finished writing all 3170 images in data set.
inception v3는 다음과 같이 꽃 사진을 종류별로 인식하는 model입니다.
root@sys-87548:/home/inception/models/inception# du -sm data/raw-data/train/*
29 data/raw-data/train/daisy
43 data/raw-data/train/dandelion
1 data/raw-data/train/LICENSE.txt
34 data/raw-data/train/roses
47 data/raw-data/train/sunflowers
48 data/raw-data/train/tulips
이제 flowers_train을 bazel로 build 합니다.
root@sys-87548:/home/inception/models/inception# bazel build inception/flowers_train
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
bazel-bin/inception/flowers_train
INFO: Elapsed time: 0.311s, Critical Path: 0.03s
이제 준비가 끝났습니다. Docker에서 수행하기 전에, 이 flowers_train을 그냥 그대로 수행해 봅니다.
root@sys-87548:/home/inception/models/inception# time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
이 서버에는 cuda가 설치되어 있지만 GPU는 없으므로, 아래와 같이 error message가 보이는 것이 당연합니다. GPU가 없으면 그냥 CPU에서 수행됩니다. 약 20분 이상 수행되므로 도중에 끊겠습니다.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
NVIDIA: no NVIDIA devices found
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87548): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 02:41:53.587744: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 02:44:28.213350: step 0, loss = 2.85 (0.2 examples/sec; 38.569 sec/batch)
...
이제 앞서 만들어 두었던 bsyu/tensor_r1.0:ppc64le-xenial라는 이름의 docker image를 이용하여 inception v3를 수행하겠습니다. 실제 flowers_train은 /home/inception 밑에 들어있으므로, 이 directory를 -v option을 이용하여 docker image에서도 mount 하도록 합니다.
root@sys-87548:/home/inception/models/inception# docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (b85c9a819a6a): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 06:48:27.996200: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 06:51:10.935895: step 0, loss = 2.83 (0.2 examples/sec; 39.389 sec/batch)
2017-06-08 06:56:21.408996: step 10, loss = 2.55 (0.4 examples/sec; 19.373 sec/batch)
2017-06-08 06:59:29.431547: step 20, loss = 2.33 (0.4 examples/sec; 19.856 sec/batch)
2017-06-08 07:02:36.828205: step 30, loss = 2.33 (0.4 examples/sec; 19.014 sec/batch)
2017-06-08 07:05:46.372759: step 40, loss = 2.17 (0.4 examples/sec; 18.428 sec/batch)
잘 수행되는 것을 보실 수 있습니다. 수행되는 중간에 Parent OS에서 nmon으로 관측해보면 python이 CPU 대부분을 사용하는 것을 보실 수 있습니다. 이 python process의 parent PID는 물론 docker daemon입니다.
root@sys-87548:/home/u0017496# ps -ef | grep 14190 | grep -v grep
root 14190 14173 78 02:46 ? 00:00:53 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
root@sys-87548:/home/u0017496# ps -ef | grep 14173 | grep -v grep
root 14173 15050 0 02:46 ? 00:00:00 docker-containerd-shim b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 /var/run/docker/libcontainerd/b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 docker-runc
root 14190 14173 80 02:46 ? 00:01:06 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
이제 다음 posting에서는 이 서버와 다른 서버에서 LSF를 통해 이 docker image로 inception v3를 training하는 것을 보시겠습니다. 이 서버인 sys-87548 외에, sys-87549에도 docker를 설치하고 docker image를 pull 해두고, 또 여기서 build된 /home/inception directory를 scp를 통해 sys-87549 서버에도 복사해 둡니다.
root@sys-87549:/home/u0017496# docker pull bsyu/tensor_r1.0:ppc64le-xenial
root@sys-87549:/home/u0017496# scp -r sys-87548:/home/inception /home
root@sys-87548:/home/u0017496# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Extracting 41.78 MB/72.3 MB
6b239e02a89e: Download complete
aecbc9abccdc: Downloading 110.8 MB/415.3 MB
8f458a3f0497: Download complete
4903f7ce6675: Download complete
0c588ac98d19: Downloading 107 MB/450.9 MB
12e624e884fc: Download complete
18dd28bbb571: Downloading 45.37 MB/103.2 MB
...
이를 기반으로 PowerAI에 포함된 tensorflow를 설치한 docker image를 build 합니다. 먼저 dockerfile을 다음과 같이 만듭니다.
root@sys-87548:/home/u0017496# vi dockerfile.tensorflow
FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
RUN apt-get update && apt-get install -y nvidia-modprobe
RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo* /tmp/temp/
COPY mldl-repo* /tmp/temp/
RUN dpkg -i /tmp/temp/cuda-repo*deb && \
dpkg -i /tmp/temp/libcudnn5*deb && \
dpkg -i /tmp/temp/mldl-repo*deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y tensorflow && \
rm -rf /var/lib/apt/lists/* && \
dpkg -r mldl-repo-local
# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/cuda-8.0/targets/ppc64le-linux/lib/stubs:/usr/lib/powerpc64le-linux-gnu/stubs:/usr/lib/powerpc64le-linux-gnu:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
ENV PATH="/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
ENV PYTHONPATH="/opt/DL/tensorflow/lib/python2.7/site-packages"
CMD /bin/bash
이제 이 dockerfile을 기반으로 bsyu/tensor_r1.0:ppc64le-xenial의 docker image를 빌드합니다.
root@sys-87548:/home/u0017496# docker build -t bsyu/tensor_r1.0:ppc64le-xenial -f dockerfile.tensorflow .
Sending build context to Docker daemon 3.436 GB
Step 1 : FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
---> d8d0da2fbdf2
Step 2 : RUN apt-get update && apt-get install -y nvidia-modprobe
---> Running in 204fe4e2c5f6
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports xenial InRelease [247 kB]
Get:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release [565 B]
Get:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Release.gpg [819 B]
Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el Packages [24.9 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports xenial-updates InRelease [102 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports xenial-security InRelease [102 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages [1470 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports xenial/universe ppc64el Packages [9485 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse ppc64el Packages [152 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main ppc64el Packages [613 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe ppc64el Packages [528 kB]
Get:13 http://ports.ubuntu.com/ubuntu-ports xenial-updates/multiverse ppc64el Packages [5465 B]
Get:14 http://ports.ubuntu.com/ubuntu-ports xenial-security/main ppc64el Packages [286 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports xenial-security/universe ppc64el Packages [138 kB]
Fetched 13.2 MB in 10s (1230 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
nvidia-modprobe
0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 16.3 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Get:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el nvidia-modprobe 375.51-0ubuntu1 [16.3 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 16.3 kB in 0s (191 kB/s)
Selecting previously unselected package nvidia-modprobe.
(Reading database ... 17174 files and directories currently installed.)
Preparing to unpack .../nvidia-modprobe_375.51-0ubuntu1_ppc64el.deb ...
Unpacking nvidia-modprobe (375.51-0ubuntu1) ...
Setting up nvidia-modprobe (375.51-0ubuntu1) ...
---> 5411319bbc05
Removing intermediate container 204fe4e2c5f6
Step 3 : RUN mkdir /tmp/temp
---> Running in cf13b03845f1
---> 66b2b250777f
Removing intermediate container cf13b03845f1
Step 4 : COPY libcudnn5* /tmp/temp/
---> 16d921e53451
Removing intermediate container 9d1efa9ed269
Step 5 : COPY cuda-repo* /tmp/temp/
...
Step 9 : ENV LD_LIBRARY_PATH "/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
---> Running in fe30af7c944e
---> f5faa1760ac7
Removing intermediate container fe30af7c944e
Step 10 : ENV PATH "/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
---> Running in 98a0e5bfd008
---> 7cfb0feaaee1
Removing intermediate container 98a0e5bfd008
Step 11 : ENV PYTHONPATH "/opt/DL/tensorflow/lib/python2.7/site-packages"
---> Running in d98d5352108e
---> affda7b26276
Removing intermediate container d98d5352108e
Step 12 : CMD /bin/bash
---> Running in d54a20fb7e3c
---> 4692368fb7ad
Removing intermediate container d54a20fb7e3c
Successfully built 4692368fb7ad
만들어진 docker image를 확인합니다.
root@sys-87548:/home/u0017496# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/tensor_r1.0 ppc64le-xenial 4692368fb7ad 3 minutes ago 6.448 GB
nvidia-docker deb 2830f66f0418 41 hours ago 429.8 MB
nvidia-docker build fa764787622c 41 hours ago 1.014 GB
ppc64le/ubuntu 14.04 0e6701cbf611 2 weeks ago 228.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 4 months ago 1.895 GB
ppc64le/golang 1.6.3 6a579d02d32f 9 months ago 704.7 MB
golang 1.5 99668503de15 10 months ago 725.3 MB
이 docker image를 나중에 다른 서버에서도 사용하기 위해 docker hub에 push 해둡니다.
root@sys-87548:/home/u0017496# docker push bsyu/tensor_r1.0:ppc64le-xenial
The push refers to a repository [docker.io/bsyu/tensor_r1.0]
f42db0829239: Pushed
6a6b4d4d9d2a: Pushing 184.1 MB/2.738 GB
6458d0633f20: Pushing 172.7 MB/390.2 MB
726e25ffdf3c: Pushing 173.2 MB/1.321 GB
1535936ab54b: Pushed
bc0917851737: Pushed
9a1e25cd5998: Pushed
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Mounted from bsyu/cuda8-cudnn5-devel
84f99f1bf79b: Mounted from bsyu/cuda8-cudnn5-devel
7f7c1dccec82: Mounted from bsyu/cuda8-cudnn5-devel
5b8880a35736: Mounted from bsyu/cuda8-cudnn5-devel
41b97cb9a404: Mounted from bsyu/cuda8-cudnn5-devel
08f34ce6b3fb: Mounted from bsyu/cuda8-cudnn5-devel
이제 docker image는 준비되었으니, tensorflow를 이용한 inception v3를 수행할 준비를 합니다. inception v3 training을 수행할 python package를 bazel를 이용하여 /home/inception directory 밑에 만듭니다. 이 directory를 나중에 docker image에서 mount하여 사용할 예정입니다.
root@sys-87548:/home# mkdir inception
root@sys-87548:/home# export INCEPTION_DIR=/home/inception
root@sys-87548:/home# cd inception/
root@sys-87548:/home/inception# curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 380M 100 380M 0 0 4205k 0 0:01:32 0:01:32 --:--:-- 4988k
root@sys-87548:/home/inception# tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585
root@sys-87548:/home/inception# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4866, done.
remote: Total 4866 (delta 0), reused 0 (delta 0), pack-reused 4866
Receiving objects: 100% (4866/4866), 153.36 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (2467/2467), done.
Checking connectivity... done.
root@sys-87548:/home/inception# export FLOWERS_DIR=/home/inception/models/inception
root@sys-87548:/home/inception# mkdir -p $FLOWERS_DIR/data
root@sys-87548:/home/inception# cd models/inception/
root@sys-87548:/home/inception/models/inception# . /opt/DL/bazel/bin/bazel-activate
root@sys-87548:/home/inception/models/inception# . /opt/DL/tensorflow/bin/tensorflow-activate
root@sys-87548:/home/inception/models/inception# export TEST_TMPDIR=/home/inception/.cache
root@sys-87548:/home/inception/models/inception# bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/inception/.cache'.
Extracting Bazel installation...
..............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 5.831s, Critical Path: 0.02s
root@sys-87548:/home/inception/models/inception# ls -l
total 76
lrwxrwxrwx 1 root root 116 Jun 8 02:36 bazel-bin -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/bin
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-genfiles -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/genfiles
lrwxrwxrwx 1 root root 86 Jun 8 02:36 bazel-inception -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception
lrwxrwxrwx 1 root root 96 Jun 8 02:36 bazel-out -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out
lrwxrwxrwx 1 root root 121 Jun 8 02:36 bazel-testlogs -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/testlogs
drwxr-xr-x 2 root root 4096 Jun 8 02:32 data
drwxr-xr-x 2 root root 4096 Jun 8 02:29 g3doc
drwxr-xr-x 4 root root 4096 Jun 8 02:29 inception
-rw-r--r-- 1 root root 38480 Jun 8 02:29 README.md
-rw-r--r-- 1 root root 30 Jun 8 02:29 WORKSPACE
root@sys-87548:/home/inception/models/inception# bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 218M 100 218M 0 0 4649k 0 0:00:48 0:00:48 --:--:-- 5105k
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-06-08 02:01:56.169564 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:01:56.268917 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:02:01.252583 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-06-08 02:02:01.252638 [thread 1]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.306138 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-06-08 02:02:01.306178 [thread 0]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.578737: Finished writing all 3170 images in data set.
inception v3는 다음과 같이 꽃 사진을 종류별로 인식하는 model입니다.
root@sys-87548:/home/inception/models/inception# du -sm data/raw-data/train/*
29 data/raw-data/train/daisy
43 data/raw-data/train/dandelion
1 data/raw-data/train/LICENSE.txt
34 data/raw-data/train/roses
47 data/raw-data/train/sunflowers
48 data/raw-data/train/tulips
이제 flowers_train을 bazel로 build 합니다.
root@sys-87548:/home/inception/models/inception# bazel build inception/flowers_train
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
bazel-bin/inception/flowers_train
INFO: Elapsed time: 0.311s, Critical Path: 0.03s
이제 준비가 끝났습니다. Docker에서 수행하기 전에, 이 flowers_train을 그냥 그대로 수행해 봅니다.
root@sys-87548:/home/inception/models/inception# time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
이 서버에는 cuda가 설치되어 있지만 GPU는 없으므로, 아래와 같이 error message가 보이는 것이 당연합니다. GPU가 없으면 그냥 CPU에서 수행됩니다. 약 20분 이상 수행되므로 도중에 끊겠습니다.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
NVIDIA: no NVIDIA devices found
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87548): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 02:41:53.587744: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 02:44:28.213350: step 0, loss = 2.85 (0.2 examples/sec; 38.569 sec/batch)
...
이제 앞서 만들어 두었던 bsyu/tensor_r1.0:ppc64le-xenial라는 이름의 docker image를 이용하여 inception v3를 수행하겠습니다. 실제 flowers_train은 /home/inception 밑에 들어있으므로, 이 directory를 -v option을 이용하여 docker image에서도 mount 하도록 합니다.
root@sys-87548:/home/inception/models/inception# docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (b85c9a819a6a): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 06:48:27.996200: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 06:51:10.935895: step 0, loss = 2.83 (0.2 examples/sec; 39.389 sec/batch)
2017-06-08 06:56:21.408996: step 10, loss = 2.55 (0.4 examples/sec; 19.373 sec/batch)
2017-06-08 06:59:29.431547: step 20, loss = 2.33 (0.4 examples/sec; 19.856 sec/batch)
2017-06-08 07:02:36.828205: step 30, loss = 2.33 (0.4 examples/sec; 19.014 sec/batch)
2017-06-08 07:05:46.372759: step 40, loss = 2.17 (0.4 examples/sec; 18.428 sec/batch)
잘 수행되는 것을 보실 수 있습니다. 수행되는 중간에 Parent OS에서 nmon으로 관측해보면 python이 CPU 대부분을 사용하는 것을 보실 수 있습니다. 이 python process의 parent PID는 물론 docker daemon입니다.
root@sys-87548:/home/u0017496# ps -ef | grep 14190 | grep -v grep
root 14190 14173 78 02:46 ? 00:00:53 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
root@sys-87548:/home/u0017496# ps -ef | grep 14173 | grep -v grep
root 14173 15050 0 02:46 ? 00:00:00 docker-containerd-shim b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 /var/run/docker/libcontainerd/b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 docker-runc
root 14190 14173 80 02:46 ? 00:01:06 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
이제 다음 posting에서는 이 서버와 다른 서버에서 LSF를 통해 이 docker image로 inception v3를 training하는 것을 보시겠습니다. 이 서버인 sys-87548 외에, sys-87549에도 docker를 설치하고 docker image를 pull 해두고, 또 여기서 build된 /home/inception directory를 scp를 통해 sys-87549 서버에도 복사해 둡니다.
root@sys-87549:/home/u0017496# docker pull bsyu/tensor_r1.0:ppc64le-xenial
root@sys-87549:/home/u0017496# scp -r sys-87548:/home/inception /home
피드 구독하기:
글 (Atom)