HW 엔지니어를 위한 Deep Learning: ppc64le 아키텍처 cluster에서 HPCC 수행하는 방법

HPCC는 간단하게 수퍼컴의 성능을 측정할 수 있는, HPL (High Performance LINPACK)을 포함한 7개 HPC code들의 묶음 suite입니다. 아래가 홈페이지입니다.

http://icl.cs.utk.edu/hpcc/software/index.html

여기 나온 정보만으로는 컴파일해서 돌리는 것이 쉽지 않은데, 아래 site의 HPL 수행 방법을 보면 그나마 좀 이해가 됩니다.

http://www.crc.nd.edu/~rich/CRC_Summer_Scholars_2014/HPL-HowTo.pdf

여기서 돌리는 테스트들의 내용 등은 수학적 지식이 있어야 어느 정도 이해가 됩니다만, 시스템 엔지니어 입장에서는 그런 것 모르고도 대충 돌릴 수는 있습니다. 아래에는 ppc64le 아키텍처, 즉 IBM POWER8 프로세서 환경에서 어떻게 수행하면 되는지를 step by step으로 정리했습니다. 실은 x86 아키텍처가 아닌 ppc64le라고 해서 딱히 수행 방법이 다르지는 않습니다.

여기서는 PDP (Power Development Cloud) 환경의 1-core짜리 ppc64le Ubuntu 16.04 가상머신을 2대 이용했습니다.
(* Power Development Cloud, https://www-356.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_com_sys_power-development-platform 에서 신청하면 무료로 2주간 1-core짜리 Linux on POWER 환경을 제공. 2주 후 다시 또 무료로 재신청 가능. 최대 5개 VM을 한꺼번에 신청 가능)

다음과 같이 openmpi와 BLAS가 기본으로 설치되어 있어야 합니다. apt-get install libopenmpi-dev libblas-dev 명령으로 쉽게 설치됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep openmpi
ii libopenmpi-dev 1.10.2-8ubuntu1 ppc64el high performance message passing library -- header files
ii libopenmpi1.10 1.10.2-8ubuntu1 ppc64el high performance message passing library -- shared library
ii openmpi-bin 1.10.2-8ubuntu1 ppc64el high performance message passing library -- binaries
ii openmpi-common 1.10.2-8ubuntu1 all high performance message passing library -- common files

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep blas
ii libblas-common 3.6.0-2ubuntu2 ppc64el Dependency package for all BLAS implementations
ii libblas-dev 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Subroutines 3, static library
ii libblas3 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Reference implementations, shared library

먼저 source를 download 받고, tar를 풉니다.

u0017649@sys-90393:~$ wget http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.5.0.tar.gz

u0017649@sys-90393:~$ tar -zxf hpcc-1.5.0.tar.gz
u0017649@sys-90393:~$ cd hpcc-1.5.0

먼저 hpl/setup 디렉토리에 있는 make_generic을 수행하여 Make.UNKNOWN을 생성합니다. 여기서 대략 이 환경에 맞는 값들로 Makefile이 만들어집니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cd hpl/setup

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ sh make_generic

여기서 만들어진 Make.UNKNOWN은 다음과 같은 내용을 담고 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ grep -v \# Make.UNKNOWN
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
ARCH = $(arch)
TOPdir = ../../..
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
MPdir =
MPinc =
MPlib =
LAdir =
LAinc =
LAlib = -lblas
F2CDEFS = -DAdd_ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lm
HPL_OPTS =
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS)
LINKER = mpif77
LINKFLAGS =
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo

이제 이 Make.UNKNOWN를 상위 디렉토리인 hpl 디렉토리에 Make.Linux라는 이름으로 복사합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cp Make.UNKNOWN ../Make.Linux

그리고난 뒤 TOPdir, 즉 hpcc-1.5.0으로 올라와서 make arch=Linux를 수행합니다. 약간 헷갈릴 수 있는데, Make.Linux를 복사해둔 hpl 디렉토리가 아니라 그 위의 hpcc-1.5.0 디렉토리에서 make를 수행한다는 점에 유의하십시오. 그러면 아래처럼 mpicc가 수행되면서 7개 HPC code들을 모두 build합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cd ../..

u0017649@sys-90393:~/hpcc-1.5.0$ make arch=Linux
...
mpicc -o ../../../../FFT/wrapfftw.o -c ../../../../FFT/wrapfftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
mpicc -o ../../../../FFT/wrapmpifftw.o -c ../../../../FFT/wrapmpifftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
...
ar: creating ../../../lib/Linux/libhpl.a
echo ../../../lib/Linux/libhpl.a
../../../lib/Linux/libhpl.a
mpif77 -o ../../../../hpcc ../../../lib/Linux/libhpl.a -lblas -lm
make[1]: Leaving directory '/home/u0017649/hpcc-1.5.0/hpl/lib/arch/build'

결과로 hpcc-1.5.0 디렉토리에 hpcc라는 실행 파일이 생성됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ file hpcc
hpcc: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /opt/at10.0/lib64/ld64.so.2, for GNU/Linux 4.4.0, BuildID[sha1]=b47fb43c4d96819e25da7469049a780f8251458b, not stripped

이 hpcc 파일을 수행하면 7개 HPC code들을 순차적으로 모두 수행하는 것입니다. 이를 위해서 먼저 LD_LIBRARY_PATH를 다음과 같이 설정합니다.

u0017649@sys-90393:~/hpcc-1.5.0$ export LD_LIBRARY_PATH=/usr/lib:/usr/lib/powerpc64le-linux-gnu:$LD_LIBRARY_PATH

그리고 INPUT data file을 만들어야 합니다. 함께 제공되는 _hpccinf.txt를 hpccinf.txt라는 이름으로 복사하여 그대로 사용하셔도 됩니다만, 여기서는 http://www.netlib.org/benchmark/hpl/tuning.html 에 나오는 내용대로 해보겠습니다. INPUT data file은 몇번째 줄에는 무슨 정보가 들어가야 한다는 일정한 format이 정해져 있어서 그대로 입력하셔야 하고, 각 줄의 의미는 앞에서 언급한 tuning.html 을 참조하시면 됩니다. 다만, 여기에 나오는 것처럼 P x Q 정보를 2 x 8 로 하면 총 16개 processor가 있어야 수행을 할 수 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ vi hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
3000 6000 10000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 2 Ps
6 8 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

이대로 수행하면 다음과 같이 최소 16개 process가 필요하다면서 error가 납니다. 제가 수행하는 PDP 환경에는 1-core만 있기 때문입니다.

u0017649@sys-90393:~/hpcc-1.5.0$ ./hpcc
HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

따라서 위의 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1로 바꿔주겠습니다.

1 1 Ps
1 1 Qs

이걸 그대로 수행하면 최소 18시간 이상 계속 돌아가더군요. 그래서 도중에 중단시키고, problem size인 6번째 줄의 값들을 1/10 씩으로 줄이겠습니다.

300 600 1000 Ns

이제 single node에서 돌릴 준비가 끝났습니다. 다음과 같이 hpccinf.txt를 만드셔서 수행하시면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
300 600 1000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

수행 방법은 그냥 hpcc를 수행하는 것 뿐입니다. 위의 input data로 하면 1-core POWER8 환경에서는 약 20분 정도 걸립니다.

u0017649@sys-90393:~/hpcc-1.5.0$ time ./hpcc

그 결과물은 hpccoutf.txt 이라는 이름의 파일에 쌓이는데, 약 1.6MB 정도의 크기로 쌓이고 그 끝부분 내용은 아래와 같습니다.

...
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R3R8 1000 160 1 1 0.08 8.655e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0055214 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R4R8 1000 160 1 1 0.07 8.982e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0059963 ...... PASSED
================================================================================

Finished 3240 tests with the following results:
3240 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Current time (1513215290) is Wed Dec 13 20:34:50 2017

End of HPL section.
Begin of Summary section.
VersionMajor=1
VersionMinor=5
VersionMicro=0
VersionRelease=f
LANG=C
Success=1
sizeof_char=1
sizeof_short=2
sizeof_int=4
sizeof_long=8
sizeof_void_ptr=8
sizeof_size_t=8
sizeof_float=4
sizeof_double=8
sizeof_s64Int=8
sizeof_u64Int=8
sizeof_struct_double_double=16
CommWorldProcs=1
MPI_Wtick=1.000000e-06
HPL_Tflops=0.00934813
HPL_time=0.071476
HPL_eps=1.11022e-16
HPL_RnormI=1.71035e-12
HPL_Anorm1=263.865
HPL_AnormI=262.773
HPL_Xnorm1=2619.63
HPL_XnormI=11.3513
HPL_BnormI=0.499776
HPL_N=1000
HPL_NB=80
HPL_nprow=1
HPL_npcol=1
HPL_depth=1
HPL_nbdiv=4
HPL_nbmin=4
HPL_cpfact=R
HPL_crfact=R
HPL_ctop=1
HPL_order=R
HPL_dMACH_EPS=1.110223e-16
HPL_dMACH_SFMIN=2.225074e-308
HPL_dMACH_BASE=2.000000e+00
HPL_dMACH_PREC=2.220446e-16
HPL_dMACH_MLEN=5.300000e+01
HPL_dMACH_RND=1.000000e+00
HPL_dMACH_EMIN=-1.021000e+03
HPL_dMACH_RMIN=2.225074e-308
HPL_dMACH_EMAX=1.024000e+03
HPL_dMACH_RMAX=1.797693e+308
HPL_sMACH_EPS=5.960464e-08
HPL_sMACH_SFMIN=1.175494e-38
HPL_sMACH_BASE=2.000000e+00
HPL_sMACH_PREC=1.192093e-07
HPL_sMACH_MLEN=2.400000e+01
HPL_sMACH_RND=1.000000e+00
HPL_sMACH_EMIN=-1.250000e+02
HPL_sMACH_RMIN=1.175494e-38
HPL_sMACH_EMAX=1.280000e+02
HPL_sMACH_RMAX=3.402823e+38
dweps=1.110223e-16
sweps=5.960464e-08
HPLMaxProcs=1
HPLMinProcs=1
DGEMM_N=576
StarDGEMM_Gflops=12.5581
SingleDGEMM_Gflops=11.5206
PTRANS_GBs=0.555537
PTRANS_time=0.000324011
PTRANS_residual=0
PTRANS_n=150
PTRANS_nb=120
PTRANS_nprow=1
PTRANS_npcol=1
MPIRandomAccess_LCG_N=524288
MPIRandomAccess_LCG_time=0.719245
MPIRandomAccess_LCG_CheckTime=0.076761
MPIRandomAccess_LCG_Errors=0
MPIRandomAccess_LCG_ErrorsFraction=0
MPIRandomAccess_LCG_ExeUpdates=2097152
MPIRandomAccess_LCG_GUPs=0.00291577
MPIRandomAccess_LCG_TimeBound=60
MPIRandomAccess_LCG_Algorithm=0
MPIRandomAccess_N=524288
MPIRandomAccess_time=0.751529
MPIRandomAccess_CheckTime=0.0741832
MPIRandomAccess_Errors=0
MPIRandomAccess_ErrorsFraction=0
MPIRandomAccess_ExeUpdates=2097152
MPIRandomAccess_GUPs=0.00279051
MPIRandomAccess_TimeBound=60
MPIRandomAccess_Algorithm=0
RandomAccess_LCG_N=524288
StarRandomAccess_LCG_GUPs=0.0547931
SingleRandomAccess_LCG_GUPs=0.0547702
RandomAccess_N=524288
StarRandomAccess_GUPs=0.0442224
SingleRandomAccess_GUPs=0.044326
STREAM_VectorSize=333333
STREAM_Threads=1
StarSTREAM_Copy=2.13593
StarSTREAM_Scale=2.09571
StarSTREAM_Add=3.23884
StarSTREAM_Triad=3.26659
SingleSTREAM_Copy=2.13593
SingleSTREAM_Scale=2.11213
SingleSTREAM_Add=3.23884
SingleSTREAM_Triad=3.26659
FFT_N=131072
StarFFT_Gflops=0.488238
SingleFFT_Gflops=0.488024
MPIFFT_N=65536
MPIFFT_Gflops=0.305709
MPIFFT_maxErr=1.23075e-15
MPIFFT_Procs=1
MaxPingPongLatency_usec=-1
RandomlyOrderedRingLatency_usec=-1
MinPingPongBandwidth_GBytes=-1
NaturallyOrderedRingBandwidth_GBytes=-1
RandomlyOrderedRingBandwidth_GBytes=-1
MinPingPongLatency_usec=-1
AvgPingPongLatency_usec=-1
MaxPingPongBandwidth_GBytes=-1
AvgPingPongBandwidth_GBytes=-1
NaturallyOrderedRingLatency_usec=-1
FFTEnblk=16
FFTEnp=8
FFTEl2size=1048576
M_OPENMP=-1
omp_get_num_threads=0
omp_get_max_threads=0
omp_get_num_procs=0
MemProc=-1
MemSpec=-1
MemVal=-1
MPIFFT_time0=0
MPIFFT_time1=0.00124693
MPIFFT_time2=0.00407505
MPIFFT_time3=0.000625134
MPIFFT_time4=0.0091598
MPIFFT_time5=0.00154018
MPIFFT_time6=0
CPS_HPCC_FFT_235=0
CPS_HPCC_FFTW_ESTIMATE=0
CPS_HPCC_MEMALLCTR=0
CPS_HPL_USE_GETPROCESSTIMES=0
CPS_RA_SANDIA_NOPT=0
CPS_RA_SANDIA_OPT2=0
CPS_USING_FFTW=0
End of Summary section.
########################################################################
End of HPC Challenge tests.
Current time (1513215290) is Wed Dec 13 20:34:50 2017

########################################################################

1184.15user 2.77system 19:47.13elapsed 99%CPU (0avgtext+0avgdata 34624maxresident)k

이제 multi-node로 수행하는 방법을 보겠습니다. 이 역시 매우 간단합니다. 먼저 다음과 같이 node 이름을 담은 파일을 만듭니다. 만약 노드들에 network interface가 여러개라면 그 중 10GbE 또는 Infiniband처럼 고속인 것을 적어주는 것이 좋습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat nodes.rf
sys-90393
sys-90505

이어서 INPUT data 파일인 hpccinf.txt를 조금 수정해줍니다. 위에서 사용한 것과는 달리, 일단 2대를 사용하니까 2 process로 돌아야 합니다. 따라서 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1 1, 그리고 1 2로 바꿔주겠습니다. 또, 이대로 돌리니 MPI overhead가 있어서인지 20분이 아니라 40분이 되도록 끝나질 않더군요. 그래서 6번째 줄의 problem size도 다시 1/10로 더 줄였습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
30 60 100 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 2 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

그러고 난 다음에는 다음과 같이 mpirun으로 hpcc를 수행해주면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ mpirun -x PATH -x LD_LIBRARY_PATH -np 2 -hostfile nodes.rf ./hpcc | tee HPCC.out

시작과 동시에 양쪽 node에서 CPU를 100% 쓰면서 돌아가는 것을 확인하실 수 있습니다.

HW 엔지니어를 위한 Deep Learning

2017년 12월 14일 목요일

ppc64le 아키텍처 cluster에서 HPCC 수행하는 방법

댓글 없음:

댓글 쓰기