여기서는 Spectrum LSF의 무료 community edition을 이용하여, sys-87548 서버와 sys-87549 서버로 LSF cluster를 만들겠습니다. sys-87548 서버가 master이자 slave이고, sys-87549 서버는 secondary master이자 역시 slave 입니다.
LSF community edition을 인터넷에서 download 받아 다음과 같이 tar를 압축 해제합니다. 속에는 LSF와 Platform Application Server가 들어있는 것을 보실 수 있습니다. 여기서는 일단 LSF만 설치합니다.
root@sys-87548:/home/u0017496# tar -zxvf lsfce10.1-ppc64le.tar.gz
lsfce10.1-ppc64le/
lsfce10.1-ppc64le/lsf/
lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall_linux_ppc64le.tar.Z
lsfce10.1-ppc64le/lsf/lsf10.1_lnx310-lib217-ppc64le.tar.Z
lsfce10.1-ppc64le/pac/
lsfce10.1-ppc64le/pac/pac10.1_basic_linux-ppc64le.tar.Z
root@sys-87548:/home/u0017496# cd lsfce10.1-ppc64le/lsf/
아래와 같이 LSF에는 두개의 *.tar.Z file이 있는데, 이중 압축 해제가 필요한 것은 lsf10.1_lsfinstall_linux_ppc64le.tar.Z 뿐입니다. lsf10.1_lnx310-lib217-ppc64le.tar.Z은 압축 해제하지 말고 그냥 놔두셔야 합니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z lsf10.1_lsfinstall_linux_ppc64le.tar.Z
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf# cd lsf10.1_lsfinstall
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl install.config lap lsf_unix_install.pdf patchlib README rpm slave.config
hostsetup instlib lsfinstall patchinstall pversions rhostsetup scripts
먼저 install.config를 다음과 같이 편집합니다. LSF_MASTER_LIST에 적힌 순서에 따라 sys-87548가 primary master, sys-87549가 secondary master이고, 또 LSF_ADD_SERVERS에 적힌 순서에 따라 sys-87548와 sys-87549가 각각 host server, 즉 slave입니다. 만약 서버가 1대 뿐이라면 그 서버가 master이자 single slave가 될 수도 있습니다.
[root@sys-87538 lsf10.1_lsfinstall]# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="u0017496"
LSF_CLUSTER_NAME="cluster1"
LSF_MASTER_LIST="sys-87548 sys-87549"
LSF_TARDIR="/home/u0017496/lsfce10.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="sys-87548 sys-87549"
lsfinstall 명령을 이용해 다음과 같이 설치 시작합니다. 도중에 LSF distribution tar 파일의 위치를 묻는데, 아까 압축 해제하지 않고 놔둔 그 파일을 묻는 것입니다. 그냥 1을 눌러 default를 선택하면 됩니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config
...
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
...
Searching LSF 10.1 distribution tar files in /home/u0017496/lsfce10.1-ppc64le/lsf Please wait ...
1) linux3.10-glibc2.17-ppc64le
Press 1 or Enter to install this host type: 1
LSF 엔진 설치가 끝나면 다음 명령을 수행하여 이 서버를 host server, 즉 slave로 설정합니다. --boot="y"는 서버가 부팅될 때 LSF daemon을 자동으로 살리라는 뜻입니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"
Logging installation sequence in /usr/share/lsf/log/Install.log
------------------------------------------------------------
L S F H O S T S E T U P U T I L I T Y
------------------------------------------------------------
This script sets up local host (LSF server, client or slave) environment.
Setting up LSF server host "sys-87548" ...
Checking LSF installation for host "sys-87548" ... Done
grep: /etc/init/rc-sysinit.conf: No such file or directory
Copying /etc/init.d/lsf, /etc/rc3.d/S95lsf and /etc/rc3.d/K05lsf
Installing LSF RC scripts on host "sys-87548" ... Done
LSF service ports are defined in /usr/share/lsf/conf/lsf.conf.
Checking LSF service ports definition on host "sys-87548" ... Done
You are installing IBM Spectrum LSF - Community Edition.
... Setting up LSF server host "sys-87548" is done
... LSF host setup is done.
그리고 root user와 LSF 사용 user의 .bashrc에는 다음과 같이 profile.lsf가 수행되도록 entry를 넣어둡니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf
u0017496@sys-87481:~$ vi ~/.bashrc
. /usr/share/lsf/conf/profile.lsf
그리고 서버들 간에는 rsh이 아닌 ssh를 이용하도록 아래와 같이 lsf.conf 속에 LSF_RSH를 설정합니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /usr/share/lsf/conf/lsf.conf
LSF_RSH=ssh
그리고 LSF 사용자들을 /etc/lsf.sudoers 속에 등록해둡니다.
root@sys-87548:/home/u0017496/lsfce10.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=u0017496
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="u0017496"
lsfadmin user는 물론, root로도 자체, 그리고 에 대해 passwd 없이 ssh가 가능하도록 설정해야 합니다.
u0017496@sys-87548:~$ ssh-keygen -t rsa
u0017496@sys-87548:~$ ssh-copy-id sys-87548
u0017496@sys-87548:~$ ssh-copy-id sys-87549
root@sys-87548:/home/u0017496# ssh-keygen -t rsa
[root@sys-87548:/home/u0017496# ls -l /root/.ssh
total 12
-rw------- 1 root root 1595 Jun 8 03:43 authorized_keys
-rw------- 1 root root 1679 Jun 8 03:42 id_rsa
-rw-r--r-- 1 root root 396 Jun 8 03:42 id_rsa.pub
root@sys-87548:/home/u0017496# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
root@sys-87548:/home/u0017496# cp /root/.ssh/id_rsa.pub /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# chmod 666 /tmp/sys-87548.id_rsa.pub
root@sys-87548:/home/u0017496# exit
u0017496@sys-87548:~$ scp /tmp/sys-87548.id_rsa.pub sys-87549:/tmp
sys-87548.id_rsa.pub 100% 396 0.4KB/s 00:00
root@sys-87548:/home/u0017496# cat /tmp/sys-87549.id_rsa.pub >> /root/.ssh/authorized_keys
이제 lsfstartup 명령을 이용해 LSF daemon들을 살립니다.
root@sys-87548:/home/u0017496# lsfstartup
root@sys-87548:/home/u0017496# lsid
IBM Spectrum LSF Community Edition 10.1.0.0, Jun 15 2016
Copyright IBM Corp. 1992, 2016. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is cluster1
My master name is sys-87548.dal-ebis.ihost.com
아직은 sys-87548에만 LSF가 설치되어 있으므로, LSF daemon들에게는 sys-87549가 down된 상태로 보인다는 점에 유의하십시요.
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis ok - 1 0 0 0 0 0
sys-87549.dal-ebis unavail - 1 0 0 0 0 0
다음과 같이 bsub 명령으로 tensorflow docker image를 이용해 inception v3를 training 하는 job을 submit 합니다. 이것들은 기본적으로 normal queue에 들어갑니다. 같은 명령을 2번 더 내려서, 3개의 training job을 queue에 던져 둡니다.
u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <1> is submitted to queue <normal>.
Job 3개 중 1개는 running 중이고, 2개는 pending 상태인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 3 2 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
이제 default인 normal queue가 아닌, short queue에 또 동일한 job을 submit 해봅니다.
u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <4> is submitted to queue <short>.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
3 u001749 PEND normal sys-87548.d help Jun 8 04:02
잘못 들어간 3번 job을 아래 bkill 명령으로 kill 시킬 수 있습니다.
u0017496@sys-87548:~$ bkill 3
Job <3> is being terminated
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
한창 running 중인 job은 bstop 명령으로 잠깐 suspend 시킬 수 있습니다. 나중에 resume할 수 있도록 수행 도중 상태는 임시파일로 disk에 write됩니다.
u0017496@sys-87548:~$ bstop 1
Job <1> is being stopped
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 0 1
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 USUSP normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
Suspend된 job은 bresume 명령으로 다시 수행할 수 있습니다. 이때 disk로부터 임시파일을 읽느라 I/O가 좀 발생합니다.
u0017496@sys-87548:~$ bresume 1
Job <1> is being resumed
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 1 0 0
normal 30 Open:Active - - - - 2 1 1 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bsub -n 1 -q short sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <5> is submitted to queue <short>.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
대기 중인 job들 중에 빨리 처리되어야 하는 job은 btop 명령을 통해 대기 queue 상에서 순서를 최상위로 바꿀 수도 있습니다.
u0017496@sys-87548:~$ btop 5
Job <5> has been moved to position 1 from top.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
2 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 03:56
현재 수행 중인 job의 standard out은 bpeek 명령을 통해 엿볼 수 있습니다. 이는 꼭 실시간으로 display되지는 않고, 약간 buffer가 쌓여야 보일 수도 있습니다.
u0017496@sys-87548:~$ bpeek 1
<< output from stdout >>
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: d61a01411a1c
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1065] LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1066] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (d61a01411a1c): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 07:56:19.985253: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 07:59:19.471085: step 0, loss = 2.89 (0.1 examples/sec; 55.400 sec/batch)
2017-06-08 08:04:13.045726: step 10, loss = 2.46 (0.4 examples/sec; 19.288 sec/batch)
2017-06-08 08:07:22.977419: step 20, loss = 2.49 (0.4 examples/sec; 19.100 sec/batch)
2017-06-08 08:10:37.519711: step 30, loss = 2.18 (0.4 examples/sec; 20.120 sec/batch)
2017-06-08 08:13:48.625858: step 40, loss = 2.28 (0.4 examples/sec; 20.342 sec/batch)
bhist 명령을 통해 각 job들의 history를 간략히, 그리고 -l option을 쓰면 자세히 볼 수 있습니다.
u0017496@sys-87548:~$ bhist
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
2 u001749 *_size=8 1275 0 106 0 0 0 1381
4 u001749 *_size=8 775 0 0 0 0 0 775
5 u001749 *_size=8 321 0 0 0 0 0 321
u0017496@sys-87548:~$ bhist -l
Job <2>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 03:56:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <normal>, CWD <$HOME>;
Thu Jun 8 04:13:10: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;
Thu Jun 8 04:17:46: Dispatched 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.
com>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.i
host.com>, Effective RES_REQ <select[type == local] order[
r15s:pg] >;
Thu Jun 8 04:17:46: Starting (Pid 25530);
Thu Jun 8 04:17:46: Running with execution home </home/u0017496>, Execution CW
D </home/u0017496>, Execution Pid <25530>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1275 0 110 0 0 0 1385
------------------------------------------------------------------------------
Job <4>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:06:37: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
779 0 0 0 0 0 779
------------------------------------------------------------------------------
Job <5>, User <u0017496>, Project <default>, Command <sudo docker run --rm -v /
home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xe
nial /home/inception/models/inception/bazel-bin/inception/
flowers_train --train_dir=/home/inception/models/inception
/train --data_dir=/home/inception/models/inception/data --
pretrained_model_checkpoint_path=/home/inception/inception
-v3/model.ckpt-157585 --fine_tune=True --initial_learning_
rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --n
um_gpus 1 --batch_size=8>
Thu Jun 8 04:14:11: Submitted from host <sys-87548.dal-ebis.ihost.com>, to Que
ue <short>, CWD <$HOME>;
Thu Jun 8 04:14:34: Job moved to position 1 relative to <top> by user or admin
istrator <u0017496>;
Summary of time in seconds spent in various states by Thu Jun 8 04:19:36
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
325 0 0 0 0 0 325
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
u0017496@sys-87548:~$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:56
5 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:14
4 u001749 PEND short sys-87548.d *ch_size=8 Jun 8 04:06
3 u001749 EXIT normal sys-87548.d - help Jun 8 04:02
1 u001749 DONE normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 03:54
이제 2번째 서버 sys-87549도 online으로 만듭니다. 동일한 install.conf를 통해 LSF를 설치하고, 동일하게 hostsetup 명령을 수행하면 됩니다. bhosts 명령을 통해 보면, 현재 job이 수행 중이라 추가적인 수행 여력이 없는 sys-87548 서버는 closed로 보이는 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis ok - 1 0 0 0 0 0
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 1 0 1 0
normal 30 Open:Active - - - - 0 0 0 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5 u001749 RUN short sys-87548.d sys-87548.d *ch_size=8 Jun 8 04:14
이제 이 상태에서 bsub 명령으로 또 docker image를 이용한 training job을 4개 submit 합니다.
u0017496@sys-87548:~$ bsub -n 1 sudo docker run --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8
Job <8> is submitted to default queue <normal>.
... (4번 반복)
Job <11> is submitted to default queue <normal>.
이제 host 서버가 2대이므로, 2개의 job이 동시에 수행 중인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 4 2 2 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0
u0017496@sys-87548:~$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
sys-87548.dal-ebis closed - 1 1 1 0 0 0
sys-87549.dal-ebis closed - 1 1 1 0 0 0
bjobs 명령으로 8번 job은 sys-87548 서버에서, 9번 job은 sys-87549 서버에서 수행 중인 것을 보실 수 있습니다.
u0017496@sys-87548:~$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
8 u001749 RUN normal sys-87548.d sys-87548.d *ch_size=8 Jun 8 05:57
9 u001749 RUN normal sys-87548.d sys-87549.d *ch_size=8 Jun 8 05:57
10 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57
11 u001749 PEND normal sys-87548.d *ch_size=8 Jun 8 05:57
u0017496@sys-87548:~$ bqueues -l normal
QUEUE: normal
-- For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue.
PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV
30 0 Open:Active - - - - 4 2 2 0 0 0
Interval for a host to accept two jobs is 0 seconds
SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
SCHEDULING POLICIES: FAIRSHARE NO_INTERACTIVE
USER_SHARES: [default, 1]
SHARE_INFO_FOR: normal/
USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME ADJUST
u0017496 1 0.111 2 0 11.2 116 0.000
USERS: all
HOSTS: all
u0017496@sys-87548:~$ bjobs -l
Job <8>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:23: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:24: Started 1 Task(s) on Host(s) <sys-87548.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87548.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:57:40: Resource usage collected.
MEM: 33 Mbytes; SWAP: 85 Mbytes; NTHREAD: 11
PGID: 29697; PIDs: 29697 29699 29701 29702
MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 33 Mbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <9>, User <u0017496>, Project <default>, Status <RUN>, Queue <normal>, Comm
and <sudo docker run --rm -v /home/inception:/home/incepti
on bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/
inception/bazel-bin/inception/flowers_train --train_dir=/h
ome/inception/models/inception/train --data_dir=/home/ince
ption/models/inception/data --pretrained_model_checkpoint_
path=/home/inception/inception-v3/model.ckpt-157585 --fine
_tune=True --initial_learning_rate=0.001 -input_queue_memo
ry_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>, S
hare group charged </u0017496>
Thu Jun 8 05:57:27: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
Thu Jun 8 05:57:28: Started 1 Task(s) on Host(s) <sys-87549.dal-ebis.ihost.com
>, Allocated 1 Slot(s) on Host(s) <sys-87549.dal-ebis.ihos
t.com>, Execution Home </home/u0017496>, Execution CWD </h
ome/u0017496>;
Thu Jun 8 05:58:30: Resource usage collected.
MEM: 33 Mbytes; SWAP: 148 Mbytes; NTHREAD: 11
PGID: 14493; PIDs: 14493 14494 14496 14497
MEMORY USAGE:
MAX MEM: 33 Mbytes; AVG MEM: 23 Mbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <10>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:31: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
------------------------------------------------------------------------------
Job <11>, User <u0017496>, Project <default>, Status <PEND>, Queue <normal>, Co
mmand <sudo docker run --rm -v /home/inception:/home/incep
tion bsyu/tensor_r1.0:ppc64le-xenial /home/inception/model
s/inception/bazel-bin/inception/flowers_train --train_dir=
/home/inception/models/inception/train --data_dir=/home/in
ception/models/inception/data --pretrained_model_checkpoin
t_path=/home/inception/inception-v3/model.ckpt-157585 --fi
ne_tune=True --initial_learning_rate=0.001 -input_queue_me
mory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8>
Thu Jun 8 05:57:33: Submitted from host <sys-87548.dal-ebis.ihost.com>, CWD <$
HOME>;
PENDING REASONS:
Job slot limit reached: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
community 버전으로 하나의 서버에 테스트 구성시에는
답글삭제install.conf에 LSF_MASTER_LIST="sys-87548"와 LSF_ADD_SERVERS="sys-87548" 같은 hostname만 동일하게 적어주면 될까요?? config내에서 single cluster에 대한 설정이 따로 있나요? 위와 같이 적어주면, 데몬띄울때 자기서버로 ssh붙으려고 하네요.