HW 엔지니어를 위한 Deep Learning: LSF

레이블이 LSF인 게시물을 표시합니다. 모든 게시물 표시

2020년 4월 8일 수요일

LSF 관련 Q&A : job의 suspend - resume, 방화벽 환경에서 뚫어놓아야 할 port들

Q1. GPU를 사용하는 job의 경우에도 job을 suspend - resume하는 것이 가능한지요 ?

: LSF에서는 훨씬 더 긴급하고 중요한 job B가 생겼는데 이미 수행 중인 기존 job A가 자원을 다 쓰고 있어서 당장 가용한 자원이 없을 경우, 이미 RUNNING 상태로 수행 중인 기존 job A를 잠시 suspend 시키고, 그렇게 풀려난 자원을 이용하여 긴급하고 중요한 job B를 수행시킨 뒤, 그것이 다 끝나면 suspend 되었던 기존 job A를 다시 resume 할 수 있다고 들었습니다. 이것은 CPU 상에서 job이 수행될 때의 이야기일텐데, GPU 상에서 수행되는 job에 대해서도 suspend-resume이 잘 수행되는지요 ?

A1. 가능합니다만 tensorflow와 같은 deep learning training job에 대해서는 현실적인 효용성이 떨어집니다.

: 이미 알고 계시는 것처럼 bkill 명령은 이미 submit 되어 수행 중인 LSF job을 글자 그대로 kill 하는 것이지만, bstop 명령은 중단시키는 것이 아니라 suspend 시키는 명령입니다. (좀더 정확하게 말하자면 사용 옵션에 따라 bkill도 suspend를 시킬 수는 있습니다.) 그렇게 suspend 된 job은 나중에 bresume 명령 (또는 bkill -s CONT 명령)으로 resume할 수 있습니다.
다만, 이때 suspend된 job은 CPU 자원만 release할 뿐이고, memory 상에 올라가 있는 기존 process의 영역을 지우지는 않습니다. 따라서, bstop - bresume 명령을 통해 잠시 suspend 시킨 뒤에 나중에 다시 resume하기 위해서는 서버에 충분한 free memory 용량이 있어야 합니다. 대개의 경우 HPC cluster node들에는 큰 용량의 RAM이 장착되어 있으므로 CPU job에서는 bstop - bresume 명령을 통한 일시 중단 및 재개가 잘 됩니다.

그러나 GPU 상에서는 이야기가 좀더 복잡합니다. GPU 상에서 돌아가는 deep learning application, 가령 tensorflow를 사용하는 python의 경우 효율적인 training을 위해 GPU 상에 존재하는 모든 GPU memory를 다 써버리는 경우가 대부분입니다. 이렇게 tensorflow를 이용한 python process에 대해서도 bstop 명령을 날리면 해당 procecss가 kill 되지는 않고 suspend되는데, 그때 tensorflow가 사용하던 메모리는 그대로 모두 사용된 채로 남아 있게 됩니다. 즉, nvidia-smi 등의 명령으로 GPU 사용량을 확인해보면 GPU compute %는 0%지만 해당 GPU를 여전히 tensorflow python이 쥐고 있고 특히 GPU memory 사용량도 거의 100% 다 사용 중인 것으로 나오는 것을 보실 수 있습니다.

그런 경우에도 suspend - resume이 가능하기는 합니다만, 애초에 그렇게 수행 중이던 job을 suspend 시키는 이유가 더 급한 다른 job을 수행하기 위함인데 정작 그 상태에서는 다른 job을 수행시킬 수가 없습니다. 새로 수행되는 job이 사용할 available GPU memory가 없는 것으로 나올테니까요. 물론 caffe나 h2o 등과 같이, GPU memory를 전체 다 사용해버리지 않고 필요한 만큼만 점거하는 application의 경우, CPU에서와 같이 bstop - bresume 명령을 통해 잠시 suspend 시켜놓고 더 중요한 다른 job을 수행하는 것이 가능합니다.

Q2. LSF가 정상적인 동작을 하기 위해서 사용하는 network port들은 어떤 것들이 있는지요 ?

: 방화벽이 구현되어 있는 네트워크 환경에서 LSF를 구성해서 사용하려고 해보니 잘 되지 않습니다. 아마 LSF가 필요로 하는 network port가 방화벽에 막혀 있어서 그런 것 같은데, ssh에 필요한 22번 외에 어떤 port들을 뚫어놓아야 하는지요 ?

A2. LSF 운용에 필요한 기본 network port들은 아래와 같이 확인하실 수 있습니다.

$ cd $EGO_TOP/conf

$ grep PORT lsf.conf
LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882
LSB_QUERY_PORT=6891

단, LSF_LIM_PORT=7869 은 tcp/udp 둘 다 뚫어놓아야 하고, 나머지는 모두 tcp 입니다. 이에 대해서는 아래 URL을 참조하시기 바랍니다.

https://www.ibm.com/support/pages/configure-firewall-ports-lsf-master-platform-rtm-monitoring
https://www.ibm.com/support/pages/purpose-different-lsf-ports

2018년 9월 6일 목요일

Poughkeepsie 센터에서 LSF cluster 이용하여 tensorflow 수행하는 방법 안내

** 이는 IBM Poughkeepsie benchmark center의 cluster에서 LSF를 이용해 테스트를 수행하시는 분들을 위한 guide입니다.

먼저 별도로 통보받으신 VPN SW와 VPN id/passwd를 이용해 Poughkeepsie benchmark center cluster에 VPN 연결을 하셔야 합니다.

그 다음에, 이미 통보받으신대로 p10login4.pbm.ihost.com 라는 주소로 putty 등을 이용해 ssh 접속을 하십시요. (아래 그림 참조)

접속할 때의 userid 및 passwd는 VPN id/passwd와는 별도로 통보받으신 userid/passwd (서버 linux OS용) 입니다.

p10login4에 접속하신 뒤, 먼저 anaconda3를 설치하시기 바랍니다. 설치하실 때 default로 제시되는 directory를 그대로 이용하시는 것이 좋습니다. 이유는 /gpfs/gpfs_gl4_16mb/ 밑의 directory가 GPFS(요즘 이름은 Spectrum Scale)라는 공유파일시스템로서 cluster 내의 모든 work node에 mount 되어 있는 것이기 때문입니다. 거기에 설치해야 나중에 어느 work node에서 job을 수행하더라도 이 anaconda를 사용하실 수 있습니다.

[b8p217za@p10login4 ~]$ wget https://repo.continuum.io/archive/Anaconda3-5.2.0-Linux-ppc64le.sh

[b8p217za@p10login4 ~]$ chmod a+x Anaconda3-5.2.0-Linux-ppc64le.sh

[b8p217za@p10login4 ~]$ ./Anaconda3-5.2.0-Linux-ppc64le.sh
...
[/gpfs/gpfs_gl4_16mb/b8p217/b8p217za/anaconda3] >>>
...
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /gpfs/gpfs_gl4_16mb/b8p217/b8p217za/.bashrc ? [yes|no]
[no] >>> yes

설치가 끝난 뒤에는 .bashrc을 다시 수행하셔서 기본 python이 anaconda3에서 나오는 것인지 확인하십시요.

[b8p217za@p10login4 ~]$ . /gpfs/gpfs_gl4_16mb/b8p217/b8p217za/.bashrc

[b8p217za@p10login4 ~]$ which python
~/anaconda3/bin/python

그 다음으로는 tensorflow 운용에 필요한 dependent package들을 설치하셔야 합니다. 다음의 명령을 내리시면 됩니다.

[b8p217za@p10login4 ~]$ /opt/DL/tensorflow/bin/install_dependencies
...

The following NEW packages will be INSTALLED:

absl-py: 0.1.10-py36_0 file://opt/DL/conda-pkgs
astor: 0.6.2-py_0 file://opt/DL/conda-pkgs
blas: 1.0-openblas
gast: 0.2.0-py36_0 file://opt/DL/conda-pkgs
grpcio: 1.10.0-py36hf484d3e_0 file://opt/DL/conda-pkgs
libprotobuf: 3.5.0-hf484d3e_0 file://opt/DL/conda-pkgs
powerai-tensorflow-prereqs: 1.8.0_31721.7987738-py36_0 file:///opt/DL/tensorflow/conda-pkgs
protobuf: 3.5.0-py36_0 file://opt/DL/conda-pkgs
termcolor: 1.1.0-py36_0 file://opt/DL/conda-pkgs
toposort: 1.5-py36_0 file://opt/DL/conda-pkgs
...
Proceed ([y]/n)? y

이제 tensorflow를 사용하시기 위해 환경변수 설정 script를 다음과 같이 수행하십시요. 이 script를 수행하시면 아래와 같이 PATH, PYTHONPATH들이 설정되면서 PowerAI에 포함된 tensorflow를 사용하실 수 있게 됩니다.

[b8p217za@p10login4 ~]$ source /opt/DL/tensorflow/bin/tensorflow-activate

[b8p217za@p10login4 ~]$ env | grep PATH
MANPATH=/vol/ibmplatform/lsf/10.1/man:/usr/share/lmod/lmod/share/man::
MODULEPATH_ROOT=/gpfs/gpfs_gl4_16mb/lmod/P8
LD_LIBRARY_PATH=/vol/ibmplatform/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/opt/DL/tensorflow/lib
PATH=/gpfs/gpfs_gl4_16mb/b8p217/b8p217za/anaconda3/bin:/vol/ibmplatform/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc:/vol/ibmplatform/lsf/10.1/linux3.10-glibc2.17-ppc64le/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/gpfs/gpfs_gl4_16mb/b8p217/b8p217za/.local/bin:/gpfs/gpfs_gl4_16mb/b8p217/b8p217za/bin:/opt/DL/tensorflow/bin
MODULEPATH=/etc/modulefiles:/usr/share/modulefiles:/gpfs/gpfs_gl4_16mb/lmod/P8/Linux:/gpfs/gpfs_gl4_16mb/lmod/P8/Core:/gpfs/gpfs_gl4_16mb/lmod/P8/rhel/7.5/core
PYTHONPATH=/opt/DL/tensorflow/lib/python3.6/site-packages:/opt/DL/tensorflow/lib/python3.6/site-packages/external/protobuf_archive/python:/opt/DL/tensorflow/lib/python3.6/site-packages/external/six_archive

이제 tensorflow를 사용하실 준비가 끝났습니다. 그러나 이 서버에서 직접 tensorflow training을 수행하시면 안 됩니다. 이 서버는 어디까지나 login 서버일 뿐, V100이 장착된 AC922 서버가 아닙니다. 다음과 같이 job scheduler인 LSF에게 job을 submit 하는 방식으로 training을 수행하셔야 합니다.

먼저, 수행하실 tensorflow python code를 확인하십시요. 여기서는 아래의 cnn_mnist_lms.py를 LSF로 수행하는 것을 해보겠습니다. 참고로, 이 cnn_mnist_lms.py는 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py를 LMS로 변환한 example code입니다.

[b8p217za@p10login4 ~]$ ls -l /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py
-rw-r--r-- 1 root root 6203 Jun 5 08:31 /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py

위 code를 LSF로 수행하시려면 다음과 같이 하시면 됩니다. 즉 아래의 파란색 부분만 덧붙이시면 됩니다.

[b8p217za@p10login4 ~]$ bsub -R "rusage[ngpus_excl_p=4]" -q ac922_v100 -o ./1.out -e ./1.err python /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py
Job <243511> is submitted to queue <ac922_v100>.

-q는 queue 이름입니다. 안내받으신 대로 ac922_v100을 쓰시면 됩니다. -o는 output file을, -e는 error message file을 지정합니다.

결과물은 위에서 지정한 대로라면 login 서버의 현재 directory에 1.out이라는 file에 저장되고, 혹시 error가 생겼다면 그 메시지는 1.err라는 file에 저장됩니다.

그리고 "rusage[ngpus_excl_p=4]" 라는 것은 GPU 4개를 exclusive mode로 쓰려고 하니 가용한 GPU 개수가 4개인 서버에서 수행하라는 뜻입니다. 만약 1개만 쓰실 거라면 그 숫자를 1로 지정하시면 됩니다. 만약 MPI로 작성하신 code를 mpirun 명령을 통해 돌리시는 거라면, AC922 1대에 설치된 GPU 개수가 4장이지만 ngpus_excl_p=8 또는 12 등 4보다 더 큰 숫자로 두셔도 됩니다. 물론 보통의 deep learning training에서는 4개 이하를 쓰시겠지요.

위의 job number에 대해서 bhist 명령을 내리면 job이 어느 work node로 allocation되었는지 또 언제 시작되었고 (만약 끝났다면) 언제 끝났는지 알 수 있습니다. 아래의 경우는 p10a02라는 work node에 할당되었으나, 뭔가 문제가 있어서 시작된지 6초만에 fail 났습니다.

[b8p217za@p10login4 ~]$ bhist -l 243511

Job <243511>, User <b8p217za>, Project <default>, Command <python /opt/DL/tenso
rflow/lib/python3.6/site-packages/tensorflow/contrib/lms/e
xamples/cnn_mnist_lms.py>
Wed Sep 5 22:45:59: Submitted from host <p10login4>, to Queue <ac922_v100>, CW
D <$HOME>, Output File <./1.out>, Error File <./1.err>, Re
quested Resources <rusage[ngpus_excl_p=4]>;
Wed Sep 5 22:45:59: Dispatched 1 Task(s) on Host(s) <p10a02>, Allocated 1 Slot
(s) on Host(s) <p10a02>, Effective RES_REQ <select[type ==
local] order[r15s:pg] rusage[ngpus_excl_p=4.00] >;
Wed Sep 5 22:46:00: Starting (Pid 7426);
Wed Sep 5 22:46:06: External Message "p10a02:gpus=0,1,2,3;" was posted from "b
8p217za" to message box 0;
Wed Sep 5 22:46:06: Running with execution home </gpfs/gpfs_gl4_16mb/b8p217/b8
p217za>, Execution CWD </gpfs/gpfs_gl4_16mb/b8p217/b8p217z
a>, Execution Pid <7426>;
Wed Sep 5 22:46:06: Exited with exit code 2. The CPU time used is 0.4 seconds;
Wed Sep 5 22:46:06: Completed <exit>;

MEMORY USAGE:
MAX MEM: 7 Mbytes; AVG MEM: 7 Mbytes

Summary of time in seconds spent in various states by Wed Sep 5 22:46:06
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 7 0 0 0 7

Error가 난 이유는 아래와 같이 해당 example code가 work node인 p10a02 서버의 local directory에 설치되어 있지 않기 때문입니다.

[b8p217za@p10login4 ~]$ cat 1.err
python: can't open file '/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py': [Errno 2] No such file or directory

위 error를 수정하기 위해 POK center에 연락하여 각 node들에서 # yum install power-mldl-py3로 필요 fileset들의 update를 요청했습니다. (sudo 권한이 필요한 일입니다.)

---------------

POK center로부터 update 수행했다는 연락을 받았고, 다시 수행해본 수행 결과는 아래와 같습니다.

[b8p217za@p10login4 ~]$ bsub -R "rusage[ngpus_excl_p=4]" -q ac922_v100 -o ./1.out -e ./1.err python /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py
Job <244143> is submitted to queue <ac922_v100>.

[b8p217za@p10login4 ~]$ bhist -l 244143

Job <244143>, User <b8p217za>, Project <default>, Command <python /opt/DL/tenso
rflow/lib/python3.6/site-packages/tensorflow/contrib/lms/e
xamples/cnn_mnist_lms.py>
Thu Sep 6 21:27:03: Submitted from host <p10login4>, to Queue <ac922_v100>, CW
D <$HOME>, Output File <./1.out>, Error File <./1.err>, Re
quested Resources <rusage[ngpus_excl_p=4]>;
Thu Sep 6 21:27:04: Dispatched 1 Task(s) on Host(s) <p10a01>, Allocated 1 Slot
(s) on Host(s) <p10a01>, Effective RES_REQ <select[type ==
local] order[r15s:pg] rusage[ngpus_excl_p=4.00] >;
Thu Sep 6 21:27:05: Starting (Pid 76857);
Thu Sep 6 21:27:12: External Message "p10a01:gpus=0,1,2,3;" was posted from "b
8p217za" to message box 0;
Thu Sep 6 21:27:12: Running with execution home </gpfs/gpfs_gl4_16mb/b8p217/b8
p217za>, Execution CWD </gpfs/gpfs_gl4_16mb/b8p217/b8p217z
a>, Execution Pid <76857>;
Thu Sep 6 21:29:28: Done successfully. The CPU time used is 204.2 seconds;
Thu Sep 6 21:29:35: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 4.2 Gbytes; AVG MEM: 3 Gbytes

Summary of time in seconds spent in various states by Thu Sep 6 21:29:35
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 144 0 0 0 145

1초 만에 "p10a01" node로 dispatch 되어 2분 16초 만에 성공적으로 training을 완료했습니다.

결과물이 저장되는 1.out 파일을 열어보면 아래와 같은 내용이 담겨 있습니다.

[b8p217za@p10login4 ~]$ cat 1.out

Sender: LSF System <p10lsf@p10a01>
Subject: Job 244143: <python /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py> in cluster <pok_tc_cloud> Done

Job <python /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py> was submitted from host <p10login4> by user <b8p217za> in cluster <pok_tc_cloud> at Thu Sep 6 21:27:03 2018.
Job was executed on host(s) <p10a01>, in queue <ac922_v100>, as user <b8p217za> in cluster <pok_tc_cloud> at Thu Sep 6 21:27:04 2018.
</gpfs/gpfs_gl4_16mb/b8p217/b8p217za> was used as the home directory.
</gpfs/gpfs_gl4_16mb/b8p217/b8p217za> was used as the working directory.
Started at Thu Sep 6 21:27:04 2018.
Terminated at Thu Sep 6 21:29:29 2018.
Results reported at Thu Sep 6 21:29:29 2018.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py
------------------------------------------------------------

Successfully completed.

Resource usage summary:

CPU time : 204.17 sec.
Max Memory : 4393 MB
Average Memory : 3106.20 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 3
Max Threads : 346
Run time : 147 sec.
Turnaround time : 146 sec.

The output (if any) follows:

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST-data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/tmpeuq76pnh
{'accuracy': 0.96890002, 'loss': 0.10233325, 'global_step': 20000}

PS:

Read file <./1.err> for stderr output of this job.

그리고 training 과정 중의 메시지는 아래와 같이 1.err에 저장됩니다. 여기서는 일부만 copy & paste 했습니다.

[b8p217za@p10login4 ~]$ cat 1.err

...

INFO:tensorflow:Saving checkpoints for 20000 into /tmp/tmpeuq76pnh/model.ckpt.

INFO:tensorflow:Loss for final step: 0.103957.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Starting evaluation at 2018-09-07-01:29:27

INFO:tensorflow:Graph was finalized.

2018-09-06 21:29:27.207180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3

2018-09-06 21:29:27.207305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] ...

2018-09-06 21:29:27.208973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14127 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)

INFO:tensorflow:Restoring parameters from /tmp/tmpeuq76pnh/model.ckpt-20000

INFO:tensorflow:Running local_init_op.

INFO:tensorflow:Done running local_init_op.

INFO:tensorflow:Finished evaluation at 2018-09-07-01:29:27

INFO:tensorflow:Saving dict for global step 20000: accuracy = 0.9689, global_step = 20000, loss = 0.102333

2017년 8월 2일 수요일

왜 GPU를 이용한 deep learning에 LSF job scheduler가 필요한가 ?

IBM LSF(Load Sharing Facility)는 한마디로 queue 방식의 job scheduler로서, 주로 수퍼컴 클러스터에서 사용되는 SW 제품입니다. 그에 비해 deep learning training 업무는 GPU를 몇 장 장착한 1대의 서버 또는 desktop workstation에서 수행하는 것이 보통이므로 LSF와는 잘 어울리지 않는다고 생각하기 쉽습니다.

그러나 그렇지 않습니다. GPU를 사용하는 deep learning training 업무야말로 LSF를 이용할 때 볼 수 있는 혜택이 무척 큰 업무입니다. 이유는 무엇보다 GPU 자원이 비싼 것에 비해 정작 활용률이 낮기 때문입니다.

전통적으로 deep learning training은 연구원들이 개인별로 혹은 팀별로 구매한 서버나 워크스테이션 1~2대에서 수행해왔습니다. 최근의 AI 붐으로 인해 점점 더 많은 연구원들이 더 많은 training을 하게 되었고, 이로 인해 기업이나 연구소에서는 연구원들의 요청에 따라 더 많은 서버를 구매하게 되었습니다.

또한 전통적으로 팀별로 독립된 연구 환경을 선호하는 연구하는 연구원들의 특성상, 다른 팀 또는 다른 연구원들과 하나의 GPU 서버를 공유하여 training을 수행하는 일이 많지는 않았습니다. 더 좋은 GPU 서버 1대를 사서 공유하는 것 보다는 좀더 낮은 사양의 GPU 서버를 2대 사서 각각 1대씩 따로 운용하는 것을 선호했지요.

문제는 그러다보니 GPU 서버들의 구매 비용은 말할 것도 없고, 전기사용량이 많고 소음과 발열량도 많은 GPU 서버들의 관리에 문제가 생기기 시작했다는 점입니다. 더 큰 문제는 그렇게 많은 비용을 들여 사들인 GPU 서버들의 활용률이 생각보다 매우 낮다는 것입니다.

아무리 열심히 연구활동을 하는 연구원들이라고 해도, 24시간 계속 끊이지 않고 뭔가 모델을 training시키지는 않습니다. 새로운 연구 논문도 읽어야 하고, data labeling 작업도 감독해야 할테니까요. 그러다보니 A 연구실에서는 GPU 자원이 부족하여 GPU 서버를 사야하는데 바로 옆 방의 B 연구실에서는 GPU 서버가 그냥 놀면서 전기만 먹어대는 경우도 많습니다. 더 최악인 것은, 바로 다음 달이 되면 A 연구실의 GPU가 놀 때 B 연구실에서는 GPU가 부족하다고 GPU 서버를 1대 더 사달라고 하는 경우지이요.

이런 문제를 해결해주는 것이 IBM LSF입니다. LSF는 GPU 자원을 모니터링하며 관리하다가, 연구원들이 queue에 submit하는 training job들을 최적의 GPU에 알아서 배치해 줍니다. 그럼으로써 전체적인 GPU 자원이 낭비되는 일 없이 활용률을 높여주고, 또 연구원들도 훨씬 편리하게 작업을 수행할 수 있게 해줍니다.

아래의 두가지 시나리오를 보시면 쉽게 이해가 되실 것입니다.

#1. 당장 급하게 2개의 GPU를 이용하여 돌려야 하는 training이 있습니다. 그런데 이미 김박사님이 4개의 GPU를, 이박사님이 3개의 GPU를 이용하여 뭔가를 돌리고 계십니다. 두분께 전화를 걸어 여쭤보니 김박사님은 몇시에 끝날지 잘 모르겠다고 하시고, 이박사님은 아마 새벽 2시쯤 끝날 것이라고 말씀하십니다. 과연 저는 새벽 2시까지 기다렸다가 이박사님의 job이 끝나는 것을 확인한 뒤에 제 training 작업을 걸어놓고 퇴근해야 하나요 ?

#2. 연구원 A는 GPU 4장이 장착된 서버가 현재 놀고 있는 것을 확인했습니다. 그래서 GPU 2장, 즉 gpu0과 gpu1을 이용하여 caffe training을 하나 돌렸습니다. 그런데, 하필 거의 동시에 연구원 B도 이 서버가 놀고 있다고 생각하고는 역시 gpu0과 gpu1을 이용하여 caffe training을 걸었습니다. 과연 어떻게 될까요 ? 매번 누가 이 서버에서 몇번 GPU를 이용해서 돌리려는지 확인한 뒤 job script를 수정해야 할까요 ?

결론적으로, 약간의 비용을 들여서 LSF를 구축하시는 것이 무작정 GPU 서버를 더 사는 것보다 훨씬 더 효율이 높으며, 이는 비단 GPU 서버 HW 구매 비용 뿐만 아니라 상면과 전력소비량, 공조기 비용 등의 절감에 큰 도움이 됩니다. 또한, 현업 연구원들 입장에서도 훨씬 더 편리하고 마음 편하게 연구 활동에만 전념할 수 있다는 장점이 있습니다.

LSF를 이용한 deep learning의 가장 분명한 reference는 바로 IBM 자신의 Poughkeepsie 벤치마크 센터의 수퍼컴입니다. 거기서의 LSF 사용이 얼마나 간단한지는 여기서 확인하시기 바랍니다.

IBM Poughkeepsie 벤치마크 센터에서의 LSF를 이용한 deep learning training 수행

이번 posting에서는 IBM Poughkeepsie (POK) 벤치마크 센터를 이용하여 Minsky 서버를 이용한 deep learning 성능 벤치마크 테스트를 수행하는 방법에 대해 알아보겠습니다. 단, 여기서의 주 내용은 POK 센터 수퍼컴 클러스터의 개략적인 GPFS 및 LSF 환경 및 그 사용방법에 대한 가이드일 뿐이고, 이 수퍼컴을 사용하기 위한 신청/승인 절차는 다루지 않습니다. 이 수퍼컴 클러스터는 IBM HW/SW의 구매를 고려 중이신 고객분들의 capacity sizing 등 각종 PoC와 performance benchmark test를 위해 사용됩니다.

먼저, IBM 영업측을 통해서 POK 벤치마크 센터의 사용 승인을 받으시면 VPN 연결 방법 및 관련 id/passwd를 받게 됩니다.

VPN 연결 뒤에 연결하셔야 하는 서버는 실제로 고객분이 성능 테스트를 수행하실 서버가 아니라, login 서버라고 하는 서버입니다. POK 벤치마크 센터의 수퍼컴은 수십대의 POWER8 서버로 되어 있는데, 고객분들은 이 서버들 중 하나를 할당 받는 형태로 테스트를 하는 것이 아니라 이 서버들의 computing power를 LSF라고 하는 job scheduler를 통해 할당받는 것입니다. 고객분들이 접속하시는 이 login 서버는 job scheduler의 master 서버 역할을 하며, 여기서는 다음과 같은 것을 수행하실 수 있습니다.

- 수행하려는 application과 data의 컴파일 및 설치
- 수행를 위해 필요한 shell script 등의 작성과 간단한 동작 여부 테스트

간혹 이 login 서버에서 아예 성능 테스트를 돌려버리시는 분들이 있는데, 그럴 경우 제대로 된 성능을 얻기 어려울 뿐만 아니라 이 수퍼컴을 이용하시는 전세계의 많은 다른 고객분들께도 폐를 끼치는 행위가 되므로 절대 그러지 마시기를 부탁드립니다. 많은 수퍼컴 클러스터에서는 그런 일을 막기 위해 login 서버의 사양을 일부러 작은 것으로 하거나 GPU가 달려 있지 않은 것으로 구성하기도 합니다.

이 login 서버와 수퍼컴 노드들은 모두 Spetrum Scale (옛이름 GPFS)라는 병렬파일시스템으로 묶여있습니다. 즉, 어느 서버에 login하더라도 (내장 disk를 이용한 일부 파일시스템을 제외하고는) 모두 같은 파일시스템이 마운트 되어 있는 것을 보실 수 있으며, login 서버에서 저장해 놓은 파일들은 수퍼컴 내의 어느 서버에서도 다 read/write할 수 있습니다. 물론 각 user id도 login 서버와 수퍼컴 노드들에서 다 동일하게 만들어져 있고, user의 홈 디렉토리도 이 GPFS 파일시스템으로 되어 있으므로 login 서버의 홈 디렉토리에 저장된 내용은 어느 노드에서라도 다 동일하게 보실 수 있습니다.

Login 서버에 접속하시면 다음과 같이 여러 filesystem들이 마운트 되어 있는 것을 보실 수 있습니다. 그 중 앞이 /gpfs 로 시작되는 파일시스템들이 Spectrum Scale (GPFS) 파일시스템들입니다. 고객분은 시스템 userid/passwd를 받으실 때 어느 특정 GPFS 파일시스템을 사용하라는 가이드를 받으실 것입니다. 대부분의 경우, /gpfs/gpfs_gl4_16mb를 사용하라는 가이드를 받으실 것이고, 또 홈 디렉토리가 이미 그 파일시스템으로 잡혀 있을 것입니다.

b7p193aa@p10login1:~$ pwd
/gpfs/gpfs_gl4_16mb/home/b7p193aa

b7p193aa@p10login1:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 243G 0 243G 0% /dev
tmpfs 52G 778M 51G 2% /run
/dev/sda2 879G 42G 793G 5% /
tmpfs 256G 17M 256G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 256G 0 256G 0% /sys/fs/cgroup
cgmfs 128K 0 128K 0% /run/cgmanager/fs
fserv3.pbm.ihost.com:/export/ibmplatform 98G 38G 61G 39% /vol/ibmplatform
tmpfs 52G 0 52G 0% /run/user/0
gpfs_gl4_16mb_bench 221T 123T 98T 56% /gpfs/gpfs_gl4_16mb_bench
gpfs_gl4_8mb 75T 23T 53T 30% /gpfs/gpfs_gl4_8mb
gpfs_gs2_512k 2.1T 1.9T 130G 94% /gpfs/gpfs_gs2_512k
gpfs_stage1 66T 57T 8.7T 87% /gpfs/gpfs_stage1
gpfs_2gl4_8mb 61T 8.6T 52T 15% /gpfs/gpfs_2gl4_8mb
gpfs_gl4_16mb 165T 126T 39T 77% /gpfs/gpfs_gl4_16mb
/dev/nvme0n1p1 2.9T 332M 2.8T 1% /nvme3T
....

이 수퍼컴 클러스터 내의 노드들의 사양과 OS 등은 용도/그룹별로 서로 약간 다릅니다. 일부는 전통적 HPC 테스트를 위해 Redhat OS가 설치되어 있고, 일부는 deep learning을 위해 Ubuntu 16.04와 함께 IBM PowerAI toolkit이 설치되어 있습니다. 그 중 어느 쪽에 login 해야 하느냐는 고민하실 필요가 없습니다. 왜냐하면 해당 노드들에는 직접 login 하실 일이 없고, login 노드에서의 LSF job submit 형태로만 이용을 하시게 되거든요. 이제 그 과정을 찬찬히 보시겠습니다.

LSF는 job scheduler SW이고, 이를 이용하시려면 몇가지의 간단한 명령어만 익히시면 사용 가능하십니다. 특히 1대의 노드만을 이용하여 deep learning을 하시는 분들께서는 매우 간단한 명령 몇개만 아시면 됩니다.

bqueues : job을 submit할 큐의 정보를 보여줍니다
bsub : job을 큐에 submit 해줍니다
bjobs : 큐에 submit된 job의 상태를 보여줍니다
bhist : 현재 수행 중인, 혹은 이미 수행이 끝난 job의 history를 보여줍니다
bkill : submit되어 현재 수행 중인 상태의 job을 도중에 kill 시켜 줍니다
bhosts : 수퍼컴 클러스터 내의 노드들 상황을 보여줍니다.

이제 자세히 보시겠습니다. 이 수퍼컴 클러스터에서 job을 submit할 queue에 어떤 것들이 있는지 bqueues 명령을 통해 보실 수 있습니다.

b7p193aa@p10login1:~$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
test-stream 30 Open:Inact - - - - 0 0 0 0
s822lc_p100_k80 30 Open:Active - - - - 8616 6568 2048 0
822normal 30 Open:Inact - - - - 0 0 0 0
s822lc_p100 30 Open:Active - - - - 3 0 3 0
s822lc_p100nvme 30 Open:Active - - - - 151 0 151 0
normal 30 Open:Active - - - - 0 0 0 0
s822lc_k80 30 Closed:Inact - - - - 0 0 0 0

Deep learning을 하실 고객분들은 이 중 s822lc_p100nvme 이라는 이름의 queue에 job을 submit 하셔야 합니다. 전통적 HPC를 하실 분들은 s822lc_p100 라는 queue를 이용하셔야 합니다.

수행할 job을 위한 shell script를 미리 만들어 두시는 것이 편합니다. 여기서는 PowerAI에 포함된 tensorflow를 이용하여 CIFAR-10 training 하는 shell script를 준비했습니다. 현재의 shell에서 수행되는 것이 아니라 동일 GPFS 파일시스템을 마운트하고 있는 다른 서버에서 LSF를 통해서 수행되는 것이므로, 가급적 모든 path는 절대 path로 써주시는 것이 좋습니다.

b7p193aa@p10login1:~$ cat cifar10.sh
#!/bin/bash
source /opt/DL/tensorflow/bin/tensorflow-activate
source /opt/DL/bazel/bin/bazel-activate
export FLOWERS_DIR=/gpfs/gpfs_gl4_16mb/b7p193aa/inception/models/inception
export INCEPTION_DIR=/gpfs/gpfs_gl4_16mb/b7p193aa/inception
/gpfs/gpfs_gl4_16mb/b7p193aa/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.005 -input_queue_memory_factor=1 --max_steps=500 --num_gpus 4 --batch_size=64

이제 이 cifar10.sh를 LSF의 s822lc_p100nvme 이라는 이름의 queue에 submit 하겠습니다.

b7p193aa@p10login1:~$ bsub -q s822lc_p100nvme /gpfs/gpfs_gl4_16mb/home/b7p193aa/cifar10.sh
Job <113856> is submitted to queue <s822lc_p100nvme>.

Job ID 113856를 이용하여 현재 상황이 어떤지 등을 보실 수 있습니다. 먼저 job 상황을 보기 위해 bjobs 명령을 써보겠습니다.

b7p193aa@p10login1:~$ bjobs 113856
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
113856 b7p193a RUN s822lc_p10 p10login1 p10a106 *ifar10.sh Aug 2 00:38

현재 run 중이고, p10a106이라는 서버에서 수행 중임을 알 수 있습니다.

bhist 명령으로 보시면 이제 막 job이 할당되어 해당 노드상에서 pid 142480로 시작된 것을 보실 수 있습니다.

b7p193aa@p10login1:~$ bhist -l 113856

Job <113856>, User <b7p193aa>, Project <default>, Command </gpfs/gpfs_gl4_16mb/
home/b7p193aa/cifar10.sh>
Wed Aug 2 00:38:06: Submitted from host <p10login1>, to Queue <s822lc_p100nvme
>, CWD <$HOME>;
Wed Aug 2 00:38:07: Dispatched 1 Task(s) on Host(s) <p10a106>, Allocated 1 Slo
t(s) on Host(s) <p10a106>, Effective RES_REQ <select[type
== local] order[r15s:pg] >;
Wed Aug 2 00:38:08: Starting (Pid 142480);

Summary of time in seconds spent in various states by Wed Aug 2 00:38:08
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 1 0 0 0 2

이어서 bhosts 명령으로 확인하시면, 이 p10a106 노드에서 뭔가 한창 돌아가고 있는 것을 보실 수 있습니다.

b7p193aa@p10login1:~$ bhosts p10a106
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
p10a106 ok - 160 150 150 0 0 0

Job이 돌아가는 모습을 보시려면 bpeek 명령을 쓰실 수 있습니다. 원래 console 상에 display 되어야 하는 message 등을 여기서 엿볼 수 있습니다.

b7p193aa@p10login1:~$ bpeek 113856
<< output from stdout >>

<< output from stderr >>
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally

시간이 지난 뒤 다시 bhist 명령을 내려보면 이제 완료된 것을 보실 수 있습니다.

b7p193aa@p10login1:~$ bhist -l 113856

Job <113856>, User <b7p193aa>, Project <default>, Command </gpfs/gpfs_gl4_16mb/
home/b7p193aa/cifar10.sh>
Wed Aug 2 00:38:06: Submitted from host <p10login1>, to Queue <s822lc_p100nvme
>, CWD <$HOME>, Error File <./err.2>;
Wed Aug 2 00:38:07: Dispatched 1 Task(s) on Host(s) <p10a106>, Allocated 1 Slo
t(s) on Host(s) <p10a106>, Effective RES_REQ <select[type
== local] order[r15s:pg] >;
Wed Aug 2 00:38:08: Starting (Pid 142480);
Wed Aug 2 00:38:14: Running with execution home </gpfs/gpfs_gl4_16mb/home/b7p1
93aa>, Execution CWD </gpfs/gpfs_gl4_16mb/home/b7p193aa>,
Execution Pid <142480>;
Wed Aug 2 02:14:55: Done successfully. The CPU time used is 692931.6 seconds;
Wed Aug 2 02:15:00: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 20.7 Gbytes; AVG MEM: 16.2 Gbytes

Summary of time in seconds spent in various states by Wed Aug 2 02:15:00
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 5808 0 0 0 5809

그 결과물로 나오는 model file은 미리 정해진 위치인 $INCEPTION/models/inception/train 밑에 아래와 같이 생성된 것을 확인하실 수 있습니다.

b7p193aa@p10login1:~$ ls /gpfs/gpfs_gl4_16mb/b7p193aa/inception/models/inception/train
checkpoint model.ckpt-0.data-00000-of-00001 model.ckpt-0.index model.ckpt-0.meta

가끔은 작성한 shell이 제대로 수행되지 않고 error가 나는 경우가 있습니다. 이때 error 메시지를 봐야 수정을 할텐데, 제가 위에 정리한 내용에는 그 부분이 없지요. 이는 bsub 명령을 내릴 때 -e 옵션을 주시면 됩니다.

아래처럼 -e 뒤에 경로를 포함한 파일명을 주시면 그 파일에 error 메시지가 쌓입니다.

b7p193aa@p10login1:~$ bsub -q s822lc_p100nvme -e ./err.1 /gpfs/gpfs_gl4_16mb/home/b7p193aa/cifar10.sh
Job <113855> is submitted to queue <s822lc_p100nvme>.

이 job은 아래와 같이 exit code 127을 내면서 시작하자마자 죽은 것을 보실 수 있습니다.

b7p193aa@p10login1:~$ bhist -l 113855

Job <113855>, User <b7p193aa>, Project <default>, Command </gpfs/gpfs_gl4_16mb/
home/b7p193aa/cifar10.sh>
Wed Aug 2 00:36:20: Submitted from host <p10login1>, to Queue <s822lc_p100nvme
>, CWD <$HOME>, Error File <./err.1>;
Wed Aug 2 00:36:21: Dispatched 1 Task(s) on Host(s) <p10a119>, Allocated 1 Slo
t(s) on Host(s) <p10a119>, Effective RES_REQ <select[type
== local] order[r15s:pg] >;
Wed Aug 2 00:36:22: Starting (Pid 96410);
Wed Aug 2 00:36:28: Running with execution home </gpfs/gpfs_gl4_16mb/home/b7p1
93aa>, Execution CWD </gpfs/gpfs_gl4_16mb/home/b7p193aa>,
Execution Pid <96410>;
Wed Aug 2 00:36:28: Exited with exit code 127. The CPU time used is 0.4 seconds;
Wed Aug 2 00:36:28: Completed <exit>;

Summary of time in seconds spent in various states by Wed Aug 2 00:36:28
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 7 0 0 0 8

./err.1 파일을 열어보면 아래와 같이 제가 경로명을 잘못 줬기 때문에 발생한 것임을 아실 수 있습니다.

b7p193aa@p10login1:~$ cat ./err.1
/gpfs/gpfs_gl4_16mb/home/b7p193aa/cifar10.sh: line 6: /gpfs_gl4_16mb/b7p193aa/inception/models/inception/bazel-bin/inception/flowers_train: No such file or directory

2017년 7월 7일 금요일

GPU를 이용하는 Caffe training을 위한 LSF 환경 setup

LSF를 GPU 환경에서 사용하는 가장 큰 이유는 값비싼 GPU 자원을 여러 deep learning 연구원이 공동으로 사용하는 것을 편리하게 해주기 때문입니다. 가령 내가 GPU 2장을 이용한 training 작업을 걸어야 하는데, 전체 4장의 GPU 중 3장을 누군가 다른 연구원들이 쓰고 있다면 그 작업들이 끝날 때까지 기다려야 합니다. 그 작업들이 언제 끝날 줄 알고 기다리겠습니까 ? 그냥 작업을 돌려 놓고 퇴근하거나 다른 연구에 집중하면 좋겠는데, 무턱대고 그렇게 job을 돌리면 error가 나거나, 다른 연구원이 애써 수행 중인 training job까지 망쳐놓기 딱 좋기 때문에 그럴 수도 없습니다.

이때 필요한 것이 IBM Spectrum LSF입니다. GPU를 위한 LSF 설정 방법을 caffe를 예로 삼아 여기에 정리했습니다.

여기서는 NVIDIA K80 GPU 2장 (GK210 GPU * 4장)이 설치된 IBM POWER8 GPU 서버인 S822LC 서버, 흔히 code명 Firestone으로 불리는 서버 1대를 사용했습니다. OS는 물론 ppc64le 기반의 Ubuntu 16.04 LTS 입니다.

먼저, 다음과 같이 Spectrum LSF HPC Suite를 설치합니다. 정확하게는 HPC Suite 전체를 설치하는 것이 아니라, 여기서는 그 속에 들었는 LSF만을 설치하는 것입니다. 참고로 HPC Suite 속에는 LSF 뿐만 아니라 LS(License Server), PAC(Platform Application Center), PPM(Platform Process Manager), SMPI(Spectrum MPI) 등이 함께 들어 있습니다. 그러나 여기서는 다 필요없고 LSF만 있으면 됩니다.

이 HPC Suite에 들어있는 LSF를 사용하기 위해서는 lsf_std_entitlement.dat 라는 이름의 standard edition용 entitlement file이 필요하고, 이는 license를 정식으로 구매하실 때 별도로 제공됩니다. 정식 버전의 LSF가 아닌, 무료로 사용할 수 있는 Communitty Edition도 있고, 그 설치/사용방법은 이 standard edition과 동일합니다. 단, 일부 기능에 제약이 있습니다.

root@ubuntu02:/home/test# tar -zxvf lsfshpc10.1.1-ppc64le.tar.gz

test@ubuntu02:~/lsfshpc10.1.1-ppc64le$ ls
ls lsf pac ppm smpi

root@ubuntu02:/home/test# cd lsfshpc10.1.1-ppc64le/lsf/

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z lsf10.1_lsfinstall_linux_ppc64le.tar.Z

위와 같이 LSF directory 속에는 두개의 Z 압축 파일이 있는데, 이중 install_ 어쩌고 하는 file만 압축해제하시면 됩니다. lib 어쩌고 하는 이름의 file은 압축해제하시면 안됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# cd lsf10.1_lsfinstall

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl instlib lsf_unix_install.pdf pversions rpm
hostsetup lap patchinstall README scripts
install.config lsfinstall patchlib rhostsetup slave.config

이제 저 install.config를 수정하면 됩니다. 모두 직관적으로 아실 수 있는 이름들의 parameter인데, LSF_MASTER_LIST에는 원래 빈칸(space)로 구분된 여러대의 서버 이름을 적으시는 것입니다. 리스트의 맨 앞에 있는 서버가 active master이고, 그 뒤에 있는 것들이 secondary master들이 됩니다. 여기서는 master이자 slave인 서버가 딱 1대 (ubuntu02) 있으므로, 1대의 이름만 적었습니다.
LSF_ADD_SERVERS에는 실제로 job을 수행할 slave 서버들을 적으셔야 하는데, 역시 빈칸(space)로 구분되는 서버 이름들을 적으시면 됩니다. 여기서는 ubuntu02 1대만 적습니다.
LSF_TARDIR에는 위에서 압축해제하지 말라고 말씀드린, lsf10.1_lnx310-lib217-ppc64le.tar.Z 파일이 들어있는 directory 이름을 적으시면 됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="test"
LSF_CLUSTER_NAME="firestone"
LSF_MASTER_LIST="ubuntu02"
LSF_TARDIR="/home/test/lsfshpc10.1.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="ubuntu02"

수정이 끝나면 아래와 같이 그 config 파일로 lsfinstall 명령을 수행합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config

그리고 위에서 언급한, 미리 받아둔 standard edition용 entitlement file을 다음과 같이 제 위치에 복사합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# cp /home/test/lsf_std_entitlement.dat /usr/share/lsf/conf/lsf.entitlement

이것이 끝나면 원래 slave 서버에서 수행해야 하는 hostsetup 명령을 수행합니다. (다시 말씀드리지만 여기서는 ubuntu02 서버가 master이자 slave입니다.) --boot="y" 옵션을 쓰시면 부팅할 때마다 LSF daemon이 자동으로 구동됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"

그리고나서 .bashrc 등에 아래와 같이 /usr/share/lsf/conf/profile.lsf가 항상 수행되도록 등록해줍니다. root 사용자에서 뿐만 아니라, 위에서 LSF admin으로 등록한 test 사용자에서도 같은 entry를 .bashrc에 넣어 줍니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# . /root/.bashrc

또한 LSF를 sudo 권한으로 수행할 수 있도록 test 사용자를 아래 file에 등록해줍니다. 단, 이 /etc/lsf.sudoers의 permission은 반드시 600, owner는 root:root 여야 합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=test
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="test"

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ls -l /etc/lsf.sudoers
-rw------- 1 root root 126 Jun 29 17:38 /etc/lsf.sudoers

이제 LSF daemon들을 구동합니다. 원래 여러개가 있는데, 하나하나 따로 할 필요없이 lsfstartup으로 시작하고 lsfshutdown으로 끝내면 됩니다. Master에서 전체 cluster들의 daemon을 다 한꺼번에 살리고 내릴 수 있습니다. 물론 이를 위해서는 passwd 문답 없이도 ssh가 되도록 ssh id를 미리 copy해놓아야 합니다. 여기서는 1대의 서버가 master/slave 노릇을 다 합니다만, 스스로에 대해서도 passwd 문답 없이 ssh가 되도록 설정을 미리 해두어야 합니다. (여기서는 그 과정 생략했습니다. 그에 대해서는 https://hwengineer.blogspot.kr/2017/06/power8-lsf-tensorflow-docker-image.html 을 참조하십시요.)

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# which lsfstartup
/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/bin/lsfstartup

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# lsfstartup
Starting up all LIMs ...
Do you really want to start up LIM on all hosts ? [y/n]y
Start up LIM on <ubuntu02> ...... done

Waiting for Master LIM to start up ... Master LIM is ok
Starting up all RESes ...
Do you really want to start up RES on all hosts ? [y/n]y
Start up RES on <ubuntu02> ...... done

Starting all slave daemons on LSBATCH hosts ...
Do you really want to start up slave batch daemon on all hosts ? [y/n] y
Start up slave batch daemon on <ubuntu02> ...... done

Done starting up LSF daemons on the local LSF cluster ...

일단 LSF cluster는 구성이 되었습니다. 그러나 여기서 그대로 GPU를 이용하는 caffe job을 submit하면 error가 나는 것을 보실 수 있을 겁니다. 그 이유와 해결 방법에 대해서 찬찬히 살펴보겠습니다.

먼저, caffe를 이용하여 CIFAR-10 모델을 training하기 위한 준비를 하겠습니다.

편의를 위해, 아래와 같이 test 사용자가 passwd 문답 없이도 sudo를 수행할 수 있도록 설정을 하겠습니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi /etc/sudoers
...
test ALL=(ALL) NOPASSWD: ALL

이제 PowerAI toolkit에 포함된 NVIDIA Caffe (caffe-nv)를 이용하여 CIFAR-10 data와 script를 준비하겠습니다. 쉽습니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo ./data/cifar10/get_cifar10.sh
Downloading...
--2017-07-04 10:37:53-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

cifar-10-binary.tar.gz 100%[================================>] 162.17M 4.10MB/s in 25s

2017-07-04 10:38:19 (6.59 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

일부 script의 PATH는 잘못 되어 있으므로 아래와 같이 수정해줍니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/create_cifar10.sh
...
if [ -z "$CAFFE_BIN" ]; then
# EXAMPLES=./build/$EXAMPLE
EXAMPLES=./bin
# TOOLS=./build/tools
TOOLS=./bin
else
...

이제 아래와 같이 CIFAR-10 LMDB를 생성합니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib ./examples/cifar10/create_cifar10.sh
Creating lmdb...
I0704 10:58:13.721052 84760 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_train_lmdb
I0704 10:58:13.721252 84760 convert_cifar_data.cpp:52] Writing Training data
I0704 10:58:13.721264 84760 convert_cifar_data.cpp:55] Training Batch 1
I0704 10:58:13.764257 84760 convert_cifar_data.cpp:55] Training Batch 2
I0704 10:58:13.801908 84760 convert_cifar_data.cpp:55] Training Batch 3
I0704 10:58:13.830626 84760 convert_cifar_data.cpp:55] Training Batch 4
I0704 10:58:13.877624 84760 convert_cifar_data.cpp:55] Training Batch 5
I0704 10:58:18.264618 84760 convert_cifar_data.cpp:73] Writing Testing data
I0704 10:58:18.264998 84760 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_test_lmdb
Computing image mean...
Done.

이제 CIFAR-10 training을 시작할 준비가 끝났습니다. 기본으로 제공되는 train_quick.sh을 그냥 수행해보면 아래와 같이 1장의 GPU를 이용해 training이 잘 수행됩니다. (여기서도 아래처럼 build 대신 bin directory로 일부 script 내용을 고쳐야 합니다.)

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/train_quick.sh
...
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
...
I0704 11:48:33.841511 87435 sgd_solver.cpp:106] Iteration 4800, lr = 0.0001
I0704 11:48:35.339391 87435 solver.cpp:242] Iteration 4900 (66.76 iter/s, 1.4979s/100 iter), loss = 0.38117
I0704 11:48:35.339428 87435 solver.cpp:261] Train net output #0: loss = 0.38117 (* 1 = 0.38117 loss)
I0704 11:48:35.339442 87435 sgd_solver.cpp:106] Iteration 4900, lr = 0.0001
I0704 11:48:36.822921 87435 solver.cpp:489] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_5000.caffemodel.h5
I0704 11:48:36.824291 87435 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_5000.solverstate.h5
I0704 11:48:36.829028 87435 solver.cpp:342] Iteration 5000, loss = 0.456113
I0704 11:48:36.829043 87435 solver.cpp:362] Iteration 5000, Testing net (#0)
I0704 11:48:37.224135 87435 solver.cpp:429] Test net output #0: accuracy = 0.7594
I0704 11:48:37.224155 87435 solver.cpp:429] Test net output #1: loss = 0.734521 (* 1 = 0.734521 loss)
I0704 11:48:37.224179 87435 solver.cpp:347] Optimization Done.
I0704 11:48:37.224186 87435 caffe.cpp:234] Optimization Done.
138.60user 30.86system 1:23.55elapsed 202%CPU (0avgtext+0avgdata 678784maxresident)k
16inputs+6016outputs (0major+24918minor)pagefaults 0swaps

위 training은 아래의 'nvidia-smi -l 5' 명령으로 모니터링한 결과처럼, GPU 1장을 이용합니다. Default로 caffe는 무조건 첫번째 GPU에 job을 던집니다. (여기서는 GPU 2를 첫번째 GPU로 인식하네요.)

Tue Jul 4 11:48:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 26W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 57C P0 129W / 149W | 216MiB / 11441MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 38C P8 29W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 87308 C ./bin/caffe 214MiB |
+-----------------------------------------------------------------------------+

이번에는 CPU를 2장씩 사용하여 training하도록 해보겠습니다. caffe 명령에서 -gpu 옵션을 쓰도록 train_quick.sh 스크립트를 수정해줍니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/train_quick.sh
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

# reduce learning rate by factor of 10 after 8 epochs
$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5

이제 수행해보면 2장씩 쓰는 것을 보실 수 있습니다. GPU 2, 3을 쓰는군요.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
...
I0704 11:51:57.520256 87780 solver.cpp:242] Iteration 4800 (94.5059 iter/s, 1.05814s/100 iter), loss = 0.429975
I0704 11:51:57.520298 87780 solver.cpp:261] Train net output #0: loss = 0.429975 (* 1 = 0.429975 loss)
I0704 11:51:57.520318 87780 sgd_solver.cpp:106] Iteration 4800, lr = 0.0001
I0704 11:51:58.578877 87780 solver.cpp:242] Iteration 4900 (94.4687 iter/s, 1.05855s/100 iter), loss = 0.631555
I0704 11:51:58.578930 87780 solver.cpp:261] Train net output #0: loss = 0.631555 (* 1 = 0.631555 loss)
I0704 11:51:58.578975 87780 sgd_solver.cpp:106] Iteration 4900, lr = 0.0001
I0704 11:51:59.628901 87780 solver.cpp:489] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_5000.caffemodel.h5
I0704 11:51:59.630488 87780 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_5000.solverstate.h5
I0704 11:51:59.633839 87780 solver.cpp:342] Iteration 5000, loss = 0.444928
I0704 11:51:59.633874 87780 solver.cpp:362] Iteration 5000, Testing net (#0)
I0704 11:52:00.025651 87780 solver.cpp:429] Test net output #0: accuracy = 0.7373
I0704 11:52:00.025693 87780 solver.cpp:429] Test net output #1: loss = 0.784022 (* 1 = 0.784022 loss)
I0704 11:52:00.025703 87780 solver.cpp:347] Optimization Done.
I0704 11:52:00.041434 87780 caffe.cpp:234] Optimization Done.
162.54user 28.23system 1:02.22elapsed 306%CPU (0avgtext+0avgdata 997696maxresident)k
0inputs+6016outputs (0major+36442minor)pagefaults 0swaps

Tue Jul 4 11:51:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 26W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 50C P0 113W / 149W | 191MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 49C P0 128W / 149W | 152MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 87633 C ./bin/caffe 189MiB |
| 3 87633 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

이번에는 이 script를 2번 연속으로 수행해보겠습니다. 여기서는 스크립트 안에 -gpu 0,1이라고 지정되어 있으므로, 두 job이 모두 같은 2개의 GPU를 이용하려고 들 것입니다. 이럴 경우 어떻게 될까요 ? 위에서 보시다시피 GPU당 메모리는 180MB 정도만 사용하므로 2개 job이 동시에 돌아도 문제는 없을 것처럼 보입니다.

Wed Jul 5 10:42:55 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 27W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 44C P0 75W / 149W | 383MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 44C P0 85W / 149W | 304MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 7906 C ./bin/caffe 189MiB |
| 2 8023 C ./bin/caffe 189MiB |
| 3 7906 C ./bin/caffe 150MiB |
| 3 8023 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

결론적으로는 이렇게 수행하면 일단 수행 시작은 됩니다만, 두 job이 모두 두배씩 더 오래 걸려 수행되는 것이 아니라 위와 같은 상태에서 아예 hang이 걸려 버립니다. 즉, 2개 job이 서로에게 lock을 걸어 버리는 것입니다.

이런 현상을 피하려면 지금 어느 GPU가 놀고 있는지 확인한 뒤 caffe를 수행하는 스크립트를 수정하여 놀고 있는 GPU 번호를 적어야 합니다. 여기서는 -gpu 0,1이 아니라 -gpu 2,3으로 적어야 하는 것이지요. 이렇게 하면 아래와 같이 잘 수행됩니다.

$TOOLS/caffe train -gpu 0,1 --solver=examples/cifar10/cifar10_quick_solver.prototxt

$TOOLS/caffe train -gpu 2,3 --solver=examples/cifar10/cifar10_quick_solver.prototxt

Wed Jul 5 11:09:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 46C P0 111W / 149W | 191MiB / 11441MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 44C P0 124W / 149W | 152MiB / 11441MiB | 87% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 48C P0 110W / 149W | 191MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 47C P0 127W / 149W | 152MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 5343 C ./bin/caffe 189MiB |
| 1 5343 C ./bin/caffe 150MiB |
| 2 5221 C ./bin/caffe 189MiB |
| 3 5221 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

그러나 이와 같이 일일이 GPU 상황을 모니터링하고 그에 따라 수행 스크립트를 고친다는 것은 당연히 불편한 일입니다. LSF를 이용하는 이유가 그런 모니터링 없이도, 그저 수행 스크립트 대충 짜서 submit하면 알아서 스케쥴링을 해주기를 바라기 때문인데, 일일이 그 수행 스크립트를 수정하는 것은 곤란합니다.

특히 caffe는 특성상 -gpu 옵션을 안 쓰는 것도 문제입니다. -gpu 옵션을 안 쓸 경우, 무조건 첫번째 GPU로 job이 assign 되거든요. 따라서 caffe에서 -gpu 옵션을 쓰지 않는다면 수작업으로 job을 직접 수행하든 LSF로 수행하든 다 error가 날 수 밖에 없습니다.

이 문제의 해결을 위해서는 아래와 같은 과정을 거쳐야 합니다. 먼저, GPU의 compute mode를 default(shared) mode에서 exclusive mode로 변경해주어야 합니다.

test@ubuntu02:~$ nvidia-smi -q | grep -i compute
Compute Mode : Default
Compute Mode : Default
Compute Mode : Default
Compute Mode : Default

Document를 보면 compute mode 1은 EXCLUSIVE_THREAD라고 되어있는데, CUDA 8.0에서는 그 mode는 depreciated 되었다면서 그냥 EXCLUSIVE_PROCESS (3)으로 설정하네요.

test@ubuntu02:~$ sudo nvidia-smi -c 1
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:03:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:04:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:03:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:04:00.0.
All done.

참고로 compute mode 2는 PROHIBITED, 즉 연산은 아예 못 하게 막는 모드입니다.

test@ubuntu02:~$ sudo nvidia-smi -c 2
Set compute mode to PROHIBITED for GPU 0002:03:00.0.
Set compute mode to PROHIBITED for GPU 0002:04:00.0.
Set compute mode to PROHIBITED for GPU 0004:03:00.0.
Set compute mode to PROHIBITED for GPU 0004:04:00.0.
All done.

실제적으로는 그냥 3번 mode를 택하셔야 합니다. 어차피 1번 mode를 택해도 둘다 EXCLUSIVE_PROCESS로 setting 됩니다. 이 모드는 reboot하면 없어지므로, 영구히 setup하기 위해서는 /etc/rc.local 등에 등록해야 합니다.

test@ubuntu02:~$ sudo nvidia-smi -c 3
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:03:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:04:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:03:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:04:00.0.
All done.

이제 gpu 0에서 training이 돌고 있는데 두번째 training에서 동일한 gpu 0을 쓰려고 하면 나중에 수행된 job은 아래와 같이 error가 발생하면서 fail나는 것을 보실 수 있습니다. 먼저 수행되던 것은 영향을 받지 않고 정상적으로 수행됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
F0705 11:32:12.339743 7616 gpu_memory.cpp:168] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable
*** Check failure stack trace: ***
@ 0x3fff9f28ce0c google::LogMessage::Fail()
@ 0x3fff9f28f284 google::LogMessage::SendToLog()
@ 0x3fff9f28c768 google::LogMessage::Flush()
@ 0x3fff9f2911c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x3fff9f5d9c50 caffe::GPUMemory::Manager::update_dev_info()
@ 0x3fff9f5daf74 caffe::GPUMemory::Manager::init()
@ 0x1000b128 (unknown)
@ 0x10007b54 (unknown)
@ 0x3fff9e97309c (unknown)
@ 0x3fff9e973298 __libc_start_main
@ (nil) (unknown)
Aborted (core dumped)
F0705 11:32:12.533152 7620 gpu_memory.cpp:168] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable
*** Check failure stack trace: ***
@ 0x3fff9fb7ce0c google::LogMessage::Fail()
@ 0x3fff9fb7f284 google::LogMessage::SendToLog()
@ 0x3fff9fb7c768 google::LogMessage::Flush()
@ 0x3fff9fb811c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x3fff9fec9c50 caffe::GPUMemory::Manager::update_dev_info()
@ 0x3fff9fecaf74 caffe::GPUMemory::Manager::init()
@ 0x1000b128 (unknown)
@ 0x10007b54 (unknown)
@ 0x3fff9f26309c (unknown)
@ 0x3fff9f263298 __libc_start_main
@ (nil) (unknown)
Aborted (core dumped)
Command exited with non-zero status 134
0.07user 0.08system 0:00.37elapsed 42%CPU (0avgtext+0avgdata 64512maxresident)k
0inputs+1024outputs (0major+3069minor)pagefaults 0swaps

이번에는 LSF로 caffe job을 submit 해보겠습니다. 별다른 옵션 없이, 그냥 bsub 명령을 앞에 붙이기만 하면 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <220> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <221> is submitted to default queue <normal>.

Submit은 잘 되었으나, 실제로 job이 잘 돌아가는지 봐야지요. 이는 bhist 명령으로 볼 수 있습니다. 당연한 일이지만, 일단 첫번째로 submit한 job은 잘 완료되었습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 220

Job <220>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/D
L/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Wed Jul 5 11:34:11: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Wed Jul 5 11:34:12: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Wed Jul 5 11:34:12: Starting (Pid 7824);
Wed Jul 5 11:34:12: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <7824>;
Wed Jul 5 11:35:43: Done successfully. The CPU time used is 183.6 seconds;
Wed Jul 5 11:35:45: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 271 Mbytes; AVG MEM: 269 Mbytes

Summary of time in seconds spent in various states by Wed Jul 5 11:35:45
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 91 0 0 0 92

문제는 두번째 job인데, 역시 안 되었습니다. Exit code 134, 그러니까 수작업으로 돌렸을 때와 동일한 error가 납니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 221

Job <221>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/D
L/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Wed Jul 5 11:35:15: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Wed Jul 5 11:35:15: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Wed Jul 5 11:35:15: Starting (Pid 7963);
Wed Jul 5 11:35:15: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <7963>;
Wed Jul 5 11:35:18: Exited with exit code 134. The CPU time used is 0.7 second
s;
Wed Jul 5 11:35:18: Completed <exit>;

MEMORY USAGE:
MAX MEM: 37 Mbytes; AVG MEM: 37 Mbytes

Summary of time in seconds spent in various states by Wed Jul 5 11:35:18
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 3 0 0 0 3

이 error의 원인은 무엇일까요 ? 생각해보면 간단합니다. LSF에는 현재 GPU 자원에 대한 감시 체계도 갖춰져 있지 않고, 또 제가 job을 submit 할 때 GPU job에 대한 요구조건도 주지 않았습니다. 따라서, LSF는 그냥 기본 요구조건인 slot (CPU 자원) 상황만 보고 job을 배치한 것이고, 따라서 같은 GPU에 대해 2개 caffe job이 수행되는 결과를 낳은 것입니다.

이 문제를 해결하기 위한 첫번째 단계는 LSF에게 GPU 자원을 인식하고 모니터링하게 등록하는 것입니다.

맨 먼저, LSF 10.1에서는 어떤 GPU 자원 항목이 있는지 보시겠습니다. 이를 위해 elim.gpu 명령을 수행하겠습니다. 이 명령은 /usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc 밑에 존재하고, 따로 종료되지 않으므로 그냥 control-C로 끊어주셔야 합니다.

test@ubuntu02:/opt/DL/caffe-nv$ elim.gpu
4 ngpus 4 ngpus_shared 0 ngpus_excl_t 0 ngpus_excl_p 4
^C

맨 앞에 나오는 숫자 4는 4개 parameter가 display된다는 뜻이고, GPU 개수(ngpus)가 4, shared mode의 GPU(ngpus_shared)가 0, exclusive thread mode의 GPU(ngpus_excl_t)가 0, 끝으로 exclusive process mode의 GPU(ngpus_excl_p)가 4개 있다는 뜻입니다. 이는 바로 위에서 제가 GPU의 compute mode를 3, 즉 EXCLUSIVE_PROCESS 로 설정했기 때문에 이렇게 나오는 것입니다.

이제 이 항목을 lsf에 등록하겠습니다. LSF conf directory에 가서 lsf.shared 파일을 수정하면 되는데, 기존 stanza를 보면 Begin Resource와 End Resource 사이에 mips니 sparc이니 하는 항목이 보이고, aix라는 이름도 보입니다.

test@ubuntu02:/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc$ cd $LSF_ENVDIR

test@ubuntu02:/usr/share/lsf/conf$ vi lsf.shared

Begin Resource
mips Boolean () () (MIPS architecture)
sparc Boolean () () (SUN SPARC)
hpux Boolean () () (HP-UX UNIX)
aix Boolean () () (AIX UNIX)
irix Boolean () () (IRIX UNIX)
... 중략 ...
openmpi Boolean () () (OPENMPI)
bluegene Boolean () () (BLUEGENE)
define_ncpus_procs Boolean () () (ncpus := procs)
define_ncpus_cores Boolean () () (ncpus := cores)
define_ncpus_threads Boolean () () (ncpus := threads)
vnode Boolean () () (Simulation node used by integrations for example Cray Linux)
craylinux Boolean () () (Cray Linux Environment: CRAY XT/XE login nodes and compute nodes)
gpu Boolean () () (gpu availability)
End Resource

이 항목들은 그대로 내버려 두시고, 그 밑에 아래와 같은 새로운 Begin Resource ~ End Resource stanza를 삽입해줍니다.

Begin Resource

RESOURCENAME TYPE INTERVAL INCREASING CONSUMABLE DESCRIPTION # Keywords
ngpus Numeric 60 N N (Number of GPUs)
ngpus_shared Numeric 60 N N (Number of GPUs in Shared Mode)
ngpus_excl_t Numeric 60 N Y (Number of GPUs in Exclusive thread Mode)
ngpuprohibited Numeric 60 N N (Number of GPUs in Prohibited Mode)
ngpus_excl_p Numeric 60 N Y (Number of GPUs in Exclusive process Mode)

End Resource

이어서 lsf.cluster."cluster_name" 파일도 수정해줍니다. 여기서 제 cluster의 이름은 firestone입니다. 역시 기존 항목들은 내버려두시고, 아래의 Begin ResourceMap ~ End ResourceMap 부분을 추가해줍니다.

test@ubuntu02:/usr/share/lsf/conf$ vi lsf.cluster.firestone
...
Begin ResourceMap
RESOURCENAME LOCATION
ngpus ([default])
ngpus_shared ([default])
ngpus_excl_t ([default])
ngpuprohibited ([default])
ngpus_excl_p ([default])
End ResourceMap

이제 reconfig를 합니다.

test@ubuntu02:/usr/share/lsf/conf$ lsadmin reconfig

Checking configuration files ...
No errors found.

Restart only the master candidate hosts? [y/n] n
Do you really want to restart LIMs on all hosts? [y/n] y
Restart LIM on <ubuntu02> ...... done

test@ubuntu02:/usr/share/lsf/conf$ badmin reconfig

Checking configuration files ...

No errors found.

Reconfiguration initiated

이제 bhosts -l 명령을 내려 봅니다.

test@ubuntu02:~$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 0 0 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 29 1 0 782G 37.6G 125G 16 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 4.0
Reserved 0.0 0.0 - 0.0

LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

CONFIGURED AFFINITY CPU LIST: all

방금 제가 등록한 ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p 항목들이 모니터링 되는 것을 보실 수 있습니다.

이제 bsub 명령만 붙여서 caffe job을 submit 해보겠습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1033> is submitted to default queue <normal>.

아래 보시다시피 이 job 자체는 잘 돌아갑니다. 상태가 RUN인 것을 확인하십시요.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1033 test RUN normal ubuntu02 ubuntu02 *_quick.sh Jul 7 10:27

그러나 bhosts -l 명령으로 보면, ngpus_excl_p가 여전히 total 4개로 보이고, Reserved 항목은 0으로 되어 있는 것을 보실 수 있습니다. 이때 실제로 nvidia-smi 명령으로 보면 GPU 1개가 caffe를 열심히 수행하고 있는데도 그렇습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 58 1 0 782G 37.6G 124.6G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 4.0
Reserved 0.0 0.0 - 0.0

LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

CONFIGURED AFFINITY CPU LIST: all

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1033

Job <1033>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Fri Jul 7 10:27:55: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Fri Jul 7 10:27:55: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Fri Jul 7 10:27:55: Starting (Pid 14241);
Fri Jul 7 10:27:55: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <14241>;

Summary of time in seconds spent in various states by Fri Jul 7 10:28:23
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 28 0 0 0 28

이 상황에서 caffe job을 하나 더 넣을 경우, 아래와 같이 exit code 134와 함께 error가 납니다. 즉, caffe가 default 거동에 따라 첫번째 GPU에 또 caffe job을 배치하므로 error가 나면서 job이 죽는 것입니다. 이와 같은 상황은 LSF가 이렇게 submit된 job을 GPU 자원을 필요로 하는 job이라고 인식 못 하기 때문에 발생하는 것입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1034> is submitted to default queue <normal>.
test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1033 test RUN normal ubuntu02 ubuntu02 *_quick.sh Jul 7 10:27
test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1034

Job <1034>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Fri Jul 7 10:28:40: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Fri Jul 7 10:28:41: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Fri Jul 7 10:28:41: Starting (Pid 14363);
Fri Jul 7 10:28:41: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <14363>;
Fri Jul 7 10:28:42: Exited with exit code 134. The CPU time used is 0.2 second
s;
Fri Jul 7 10:28:42: Completed <exit>;

Summary of time in seconds spent in various states by Fri Jul 7 10:28:42
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 1 0 0 0 2

이를 해결하기 위해서는 job을 submit할 때, 이것이 GPU 자원을 필요로 하는 것이고, 그에 따라 배정되어야 한다는 것을 LSF에게 알려야 합니다. 그것이 바로 select와 rusage 옵션입니다.

아래의 예에서, select[ngpus>0]는 gpu가 1개 이상인 서버에 job을 assign하라는 뜻이고, rusage[ngpus_excl_p=1]는 이 job이 EXCLUSIVE_PROCESS 모드의 GPU를 1개 사용한다는 뜻입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=1]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1153> is submitted to default queue <normal>.

이렇게 옵션을 주면, bhost 명령으로 볼 때 ngpus_excl_p 항목의 값이 4에서 3으로 줄고, 대신 그 밑의 Reserved 항목 값이 1로 바뀐 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 123 1 0 781G 37.6G 124.1G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 3.0
Reserved 0.0 0.0 - 1.0

이 상태에서 두번째 job을 던지면 어떻게 될까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=1]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1154> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 1% 0.0 28 1 0 781G 37.6G 123.7G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 2.0
Reserved 0.0 0.0 - 2.0

보시다시피 ngpus_excl_p 개수가 2개로 줄고, Reserved가 2로 늘어난 것을 보실 수 있습니다. 즉, 이제 LSF가 caffe job을 default로 던지는 것이 아니라, 이미 점거된 GPU는 환경에서 빼고 던지는 것입니다 !

bjobs 명령으로 첫번째 job id인 1153을 살펴 보겠습니다. 저 아래에 ubuntu02:gpus=2 라고 gpu 2번이 할당된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1153

Job <1153>, User <test>, Project <default>, Status <RUN>, Queue <normal>, Comma
nd <sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl
/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_qu
ick.sh>, Share group charged </test>
Fri Jul 7 17:34:02: Submitted from host <ubuntu02>, CWD </opt/DL/caffe-nv>, Re
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=1]>
;
Fri Jul 7 17:34:02: Started 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Slot(
s) on Host(s) <ubuntu02>, Execution Home </home/test>, Exe
cution CWD </opt/DL/caffe-nv>;
Fri Jul 7 17:35:01: Resource usage collected.
The CPU time used is 117 seconds.
MEM: 277 Mbytes; SWAP: 0 Mbytes; NTHREAD: 78
PGID: 12981; PIDs: 12981 12985 12987 12988 12989 12990

MEMORY USAGE:
MAX MEM: 277 Mbytes; AVG MEM: 274 Mbytes

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 test Jul 7 17:34 ubuntu02:gpus=2; N

RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=1.00]
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex
cl_p=1.00]

bjobs 명령으로 두번째 job id인 1154를 살펴 보겠습니다. 저 아래에 ubuntu02:gpus=3 이라고 gpu 3번이 할당된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1154

Job <1154>, User <test>, Project <default>, Status <EXIT>, Queue <normal>, Comm
and <sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/ncc
l/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_q
uick.sh>, Share group charged </test>
Fri Jul 7 17:34:04: Submitted from host <ubuntu02>, CWD </opt/DL/caffe-nv>, Re
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=1]>
;
Fri Jul 7 17:34:04: Started 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Slot(
s) on Host(s) <ubuntu02>, Execution Home </home/test>, Exe
cution CWD </opt/DL/caffe-nv>;
...

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 test Jul 7 17:34 ubuntu02:gpus=3; N

RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=1.00]
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex
cl_p=1.00]

이번에는 GPU를 2개씩 사용하도록 caffe 명령에 -gpu 옵션을 붙여 보겠습니다. 아래처럼 -gpu 0,1 이라고 지정해놓으면 gpu0과 gpu1을 지정해서 사용하게 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ cat ./examples/cifar10/train_01.sh
#!/usr/bin/env sh

# Check if CAFFE_BIN is unset
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

# reduce learning rate by factor of 10 after 8 epochs
$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5

이 script를 연달아 2번 돌려보겠습니다. 0번 1번 GPU라고 지정했으니, 같은 GPU 2개를 두개의 job이 서로 점유하려고 할까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1155> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1156> is submitted to default queue <normal>.

아래처럼 bhosts에서 ngpus_excl_p가 0으로, Reserved가 4로 변한 것을 보실 수 있습니다. 즉, gpu0, gpu1이 이미 첫번째 job에 의해 점유된 것을 보고, LSF가 두번째 job은 gpu2, gpu3에 할당한 것입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 11 1 0 781G 37.6G 124.1G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 0.0
Reserved 0.0 0.0 - 4.0

bjobs 명령으로 보면 좀더 확실히 확인하실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1155 | grep gpu
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=2]>
ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
0 test Jul 7 17:39 ubuntu02:gpus=2,3; N
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1156 | grep gpu
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=2]>
ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
0 test Jul 7 17:39 ubuntu02:gpus=0,1; N
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex

이번에는 이렇게 GPU 2개를 사용하는 job을 연달아 3번 submit하면 어떻게 될까요 ? Error가 날까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1269> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1270> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1271> is submitted to default queue <normal>.

아닙니다. 첫번째와 두번째 job들이 GPU 자원을 2개씩 다 사용하므로 세번째 job은 당장 가용한 GPU 자원이 없게 되는데, 이 경우 그냥 PENDING 상태에서 다른 job들이 다 종료되어 GPU 자원이 풀려나기를 기다리게 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1269 test RUN normal ubuntu02 ubuntu02 *ain_01.sh Jul 7 18:02
1270 test RUN normal ubuntu02 ubuntu02 *ain_01.sh Jul 7 18:02
1271 test PEND normal ubuntu02 *ain_01.sh Jul 7 18:02

모든 job들이 다 완료된 이후, bhist 명령으로 job들의 history를 보겠습니다. 두번째 수행된 job 1270의 경우 16초만 PENDING 상태에 있다가 곧장 dispatch되어 GPU를 사용하기 시작했습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1270

Job <1270>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_01.sh>
Fri Jul 7 18:02:11: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>, Requested Resources <select[ngpus>0] rusa
ge[ngpus_excl_p=2]>;
Fri Jul 7 18:02:27: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[(ng
pus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=2.00] >;
Fri Jul 7 18:02:27: Starting (Pid 5830);
Fri Jul 7 18:02:27: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <5830>;
Fri Jul 7 18:02:28: External Message "ubuntu02:gpus=0,1;" was posted from "tes
t" to message box 0;
Fri Jul 7 18:03:44: Done successfully. The CPU time used is 222.3 seconds;
Fri Jul 7 18:03:45: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 517 Mbytes; AVG MEM: 467 Mbytes

Summary of time in seconds spent in various states by Fri Jul 7 18:03:45
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
16 0 77 0 0 0 93

그러나 세번째로 수행된 job 1271의 경우, 첫번째 job인 job 1269가 끝날 때까지 약 86초 동안 PENDING 상태에 있다가 dispatch된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1271

Job <1271>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_01.sh>
Fri Jul 7 18:02:15: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>, Requested Resources <select[ngpus>0] rusa
ge[ngpus_excl_p=2]>;
Fri Jul 7 18:03:41: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[(ng
pus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=2.00] >;
Fri Jul 7 18:03:42: Starting (Pid 6729);
Fri Jul 7 18:03:42: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <6729>;
Fri Jul 7 18:04:52: Done successfully. The CPU time used is 207.9 seconds;
Fri Jul 7 18:04:53: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 530 Mbytes; AVG MEM: 463 Mbytes

Summary of time in seconds spent in various states by Fri Jul 7 18:04:53
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
86 0 71 0 0 0 157

이제 기본적인 GPU를 이용한 Deep Learning용 LSF 환경이 준비된 것입니다.