먼저 우리가 수행할 queue 확인하시고 (여기서는 s822lc_p100nvme 입니다)
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/official/resnet$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
b7p268 30 Open:Active - - - - 0 0 0 0
pmr 30 Open:Active - - - - 0 0 0 0
test-pmr 30 Open:Active - - - - 0 0 0 0
test-redhat 30 Open:Active - - - - 0 0 0 0
s822lc_p100_k80 30 Closed:Inact - - - - 0 0 0 0
822normal 30 Open:Active - - - - 0 0 0 0
s822lc_p100 30 Open:Active - - - - 22901 20021 2880 0
b7s004 30 Open:Active - - - - 0 0 0 0
coral_power9 30 Open:Active - - - - 0 0 0 0
s822lc_p100nvme 30 Open:Active - - - - 3 0 3 0
normal 30 Open:Active - - - - 0 0 0 0
s822lc_k80 30 Closed:Inact - - - - 0 0 0 0
거기에 bsub 명령어 이용하여 gpu 1개 이상인 node로 job을 submit 합니다.
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/official/resnet$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=1]" -x -q s822lc_p100nvme PYTHONPATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda3/lib/python3.6/site-packages time /gpfs/gpfs_gl4_16mb/b7p286za/anaconda3/bin/python /gpfs/gpfs_gl4_16mb/b7p286za/models/official/resnet/cifar10_main.py
Job <158267> is submitted to queue <s822lc_p100nvme>.
여기서 -x option은 exclusive, 즉 GPU를 1장만 쓰더라도 전체 node를 exclusive로 쓰겠다는 option입니다. Poughkeepsie 센터 운영자들은 사용자들이 이 -x 옵션을 쓰는 것을 관리 편의성 때문에 더 선호한다고 합니다.
Job이 성공적으로 잘 수행 중인지, 어디서 수행 중인지는 bhist 명령으로 확인합니다.
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/official/resnet$ bhist -l 158267
Job <158267>, User <b7p286za>, Project <default>, Command <PYTHONPATH=/gpfs/gpf
s_gl4_16mb/b7p286za/anaconda3/lib/python3.6/site-packages
time /gpfs/gpfs_gl4_16mb/b7p286za/anaconda3/bin/python /gp
fs/gpfs_gl4_16mb/b7p286za/models/official/resnet/cifar10_m
ain.py>
Mon Nov 6 01:10:45: Submitted from host <p10login1>, to Queue <s822lc_p100nvme
>, CWD </gpfs/gpfs_gl4_16mb/b7p286za/models/official/resne
t>, Requested Resources <select[ngpus>0] rusage[ngpus_excl
_p=1]>;
Mon Nov 6 01:10:46: Dispatched 1 Task(s) on Host(s) <p10a113>, Allocated 1 Slo
t(s) on Host(s) <p10a113>, Effective RES_REQ <select[(ngpu
s>0) && (type == local)] order[r15s:pg] rusage[ngpus_excl_
p=1.00] >;
Mon Nov 6 01:10:46: Starting (Pid 157797);
Mon Nov 6 01:10:53: Running with execution home </u/b7p286za>, Execution CWD <
/gpfs/gpfs_gl4_16mb/b7p286za/models/official/resnet>, Exec
ution Pid <157797>;
Mon Nov 6 01:10:53: External Message "p10a113:gpus=1;" was posted from "b7p286
za" to message box 0;
Summary of time in seconds spent in various states by Mon Nov 6 01:11:19
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 33 0 0 0 34
댓글 없음:
댓글 쓰기