HW 엔지니어를 위한 Deep Learning: libopenblas.so undefined symbol: dtrsm

HPL (High Performance Linpack)을 POWER9 AC922에서 CUDA를 이용하여 수행하는 방법을 정리했습니다. 주로 아래 site의 내용대로 테스트한 것입니다.

https://www.slothparadise.com/compile-hpl-linpack/

먼저 아래와 같이 필요한 package들을 설치합니다.

[user1@ac922 files]$ sudo yum install openmpi openmpi-devel mpich openblas openblas-static mpich-3.0-devel atlas lapack

그리고, atlas 뿐만 아니라 atlas-devel이 필요한데, 이는 Redhat optional DVD에 들어있습니다. 저는 그것이 없는 관계로 부득이 아래 rpmfind.net에서 ppc64le fedora용 atlas-3.10.2와 atlas-devel-3.10.2를 download 받아 설치했습니다.

[user1@ac922 files]$ wget https://rpmfind.net/linux/fedora-secondary/releases/25/Everything/ppc64le/os/Packages/a/atlas-3.10.2-12.fc24.ppc64le.rpm

[user1@ac922 files]$ wget https://rpmfind.net/linux/fedora-secondary/releases/25/Everything/ppc64le/os/Packages/a/atlas-devel-3.10.2-12.fc24.ppc64le.rpm

[user1@ac922 files]$ sudo rpm -Uvh atlas-3.10.2-12.fc24.ppc64le.rpm atlas-devel-3.10.2-12.fc24.ppc64le.rpm

liblapack.so 대신 liblapack.so.3.4.2라는 이름만 만들어져 있으므로, 이를 soft link를 걸어 생성해 줍니다.

[user1@ac922 files]$ sudo ln -s /usr/lib64/liblapack.so.3.4.2 /usr/lib64/liblapack.so

[user1@ac922 files]$ sudo ln -s /usr/lib64/libopenblaso-r0.2.20.so /usr/lib64/libopenblaso.so

이제 (x86_64 버전이긴 하지만) HPL의 CUDA 버전 소스코드를 받아야 합니다. 이는 아래의 NVIDIA site에 login을 하고 받을 수 있습니다. Login ID를 만들기 위해서 회원 가입을 해야 하는데, 무료입니다.

https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64

위에서 license 등에 동의하면 아래와 같이 hpl-2.0_FERMI_v15.solitairetheme8을 download 받을 수 있습니다. 이는 tar.gz 형태의 파일입니다.

[user1@ac922 files]$ tar -zxvf hpl-2.0_FERMI_v15.solitairetheme8

[user1@ac922 files]$ cd hpl-2.0_FERMI_v15

먼저, Intel MKL compiler에 편향된 cuda_dgemm.c의 source를 약간 수정해야 합니다.

[user1@ac922 hpl-2.0_FERMI_v15]$ vi ./src/cuda/cuda_dgemm.c
...
// handle2 = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle2 = dlopen ("libopenblas.so", RTLD_LAZY);
...
// dgemm_mkl = (void(*)())dlsym(handle, "dgemm");
dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
...
// handle = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle = dlopen ("libopenblas.so", RTLD_LAZY);
...
// mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm");
mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
...

위의 수정들을 하지 않으면 run_linpack 수행시 다음과 같은 runtime error가 납니다. 이는 ppc64le 아키텍처 상에서는 libmkl_intel_lp64 대신 오픈소스인 openblas를 사용하기 때문입니다.

libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
libopenblas.so.0: undefined symbol: dtrsm
libopenblas.so.0: undefined symbol: dgemm

이제 compile을 위해 Make.CUDA를 수정합니다. ppc64le 아키텍처라고 해서 크게 바뀔 건 없습니다. 아래 libmpich.a 대신 장황하게 -L과 -lmpich 등을 쓴 것은 역시 optional Redhat DVD가 없어 제 환경에는 mpich-devel을 설치하지 못하여 libmpich.a가 없기 때문입니다. 특히 -lmkl 대신 -lopenblas를 쓴 것에 주목하십시요.

[user1@ac922 hpl-2.0_FERMI_v15]$ vi Make.CUDA
...
#TOPdir = /home/mfatica/hpl-2.0_FERMI_v15
TOPdir = /home/user1/files/hpl-2.0_FERMI_v15
...
#MPdir = /opt/intel/mpi/3.0
#MPinc = -I$(MPdir)/include64
#MPlib = $(MPdir)/lib64/libmpi.a
#MPlib = $(MPdir)/lib64/libmpich.a
MPdir = /usr/lib64/openmpi
MPinc = -I /usr/include/openmpi-ppc64le
MPlib = -L /usr/lib64/openmpi/lib -lmpi -L /usr/lib64/mpich/lib -lmpich
...
#LAdir = $(TOPdir)/../../lib/em64t
#LAdir = /share/apps/intel/mkl/10.2.4.032/libem64t
#LAinc =
# CUDA
#LAlib = -L /home/cuda/Fortran_Cuda_Blas -ldgemm -L/usr/local/cuda/lib -lcublas -L$(LAdir) -lmkl -lguide -lpthread
LAdir = /usr/lib64
LAinc = -I /usr/include/openblas -I /usr/include
#LAlib = ${LAdir}/libopenblas.a
LAlib = -L $(TOPdir)/src/cuda -ldgemm -L /usr/lib64/atlas -lsatlas -ltatlas -L /usr/local/cuda-9.1/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda-9.1/lib64 -lcudart -L$(LAdir) -lpthread -lopenblas -lopenblaso -lm -L /usr/lib/gcc/ppc64le-redhat-linux/4.8.2 -lgfortran ${LAdir}/libopenblas.a

...

#CC = mpicc
CC = /usr/lib64/openmpi/bin/mpicc

이제 아래와 같이 환경변수를 맞춰주고, make arch=CUDA를 수행하면 일사천리로 compile이 수행됩니다.

[user1@ac922 hpl-2.0_FERMI_v15]$ export PATH=/usr/lib64/openmpi/bin:$PATH
[user1@ac922 CUDA]$ export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:/usr/lib64/mpich/lib:$LD_LIBRARY_PATH

[user1@ac922 hpl-2.0_FERMI_v15]$ make arch=CUDA
...
/usr/lib64/openmpi/bin/mpicc -DAdd__ -DF77_INTEGER=int -DStringSunStyle -DCUDA -I/home/user1/files/hpl-2.0_FERMI_v15/include -I/home/user1/files/hpl-2.0_FERMI_v15/include/CUDA -I /usr/include/openblas -I /usr/include -I /usr/include/openmpi-ppc64le -I/usr/local/cuda/include -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp -o /home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /home/user1/files/hpl-2.0_FERMI_v15/lib/CUDA/libhpl.a -L /home/user1/files/hpl-2.0_FERMI_v15/src/cuda -ldgemm -L /usr/lib64/atlas -lsatlas -ltatlas -L /usr/local/cuda-9.1/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda-9.1/lib64 -lcudart -L/usr/lib64 -lpthread -L /usr/lib64/openmpi/lib -lmpi -L /usr/lib64/mpich/lib -lmpich
make TOPdir=/home/user1/files/hpl-2.0_FERMI_v15 /home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat
make[3]: Entering directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
make[3]: `/home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat' is up to date.
make[3]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
touch dexe.grd
make[2]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
make[1]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15'

실행 파일은 아래와 같이 bin/CUDA 밑에 xhpl이라는 이름으로 만들어져 있습니다.

[user1@ac922 hpl-2.0_FERMI_v15]$ cd bin/CUDA
[user1@ac922 CUDA]$ ls -l
total 264
-rw-r--r--. 1 user1 user1 1344 Jul 17 2012 HPL.dat
-rw-r--r--. 1 user1 user1 1333 Jul 17 2012 HPL.dat_example
-rw-r--r--. 1 user1 user1 6816 Jul 17 2012 output_example
-rwxr-xr-x. 1 user1 user1 607 Jul 17 2012 run_linpack
-rwxrwxr-x. 1 user1 user1 284552 Mar 22 17:56 xhpl

수행할 때 xhpl을 그대로 쓰지는 않고, 미리 준비된 run_linpack script를 수행합니다. 여기서는 HPL_DIR 정도만 수정하면 됩니다.

[user1@ac922 CUDA]$ vi run_linpack
...
#HPL_DIR=/home/mfatica/hpl-2.0_FERMI_v15
HPL_DIR=/home/user1/files/hpl-2.0_FERMI_v15

그리고 input 파일이라고 할 수 있는 HPL.dat 파일을 수정해야 합니다. 이에 대해서는 아래 URL을 참조하여 수정합니다.

http://www.netlib.org/benchmark/hpl/tuning.html

HPL.dat의 주요 input 항목의 의미에 대해서는 아래 URL을 참조하시면 됩니다.

http://www.netlib.org/benchmark/hpl/tuning.html

여기서 중요한 것은 problem size(Ns)를 얼마로 두느냐와 이걸 어떤 process grid(P x Q)에 어떤 block size (NBs)로 태우느냐입니다.

problem size(Ns)를 구하는 원래의 공식은 다음과 같습니다.

sqrt(memory 크기 * node 수 * 적정 mem% / double precision 64-bit in byte)

저는 처음에 이 CUDA 버전에서는 GPU 메모리, 즉 여기서는 16GB memory를 가진 GPU 4장을 사용하니까 다음과 같이 해야 하나 생각했습니다.

sqrt(GPU mem size * # of GPUs * 적정 mem% / double precision 64-bit in byte)
sqrt(16 * 1024^3 * 4 * 0.8 / 8) = 82897

그런데 실제 해보니 Ns를 무엇으로 주더라도 GPU mem 사용량은 개당 약 2.5GB 정도만 쓰더라고요. 결국 저 Ns는 서버의 main memory에 대해서 계산해야 합니다. 즉, 만약 512GB의 RAM을 가진 서버라면 다음과 같이 해야 합니다.

sqrt(512 * 1024^3 * 0.8 / 8 ) = 234468

process grid(P x Q)는 AC922에 장착된 GPU 개수에 맞추면 됩니다. 2 x 2 =4로 하든, 1 x 4 =4로 하든, 또는 둘 다 수행하든 택하면 됩니다. 여기서는 그냥 flat grid인 1 x 4로 하겠습니다.

Process grid에 어떤 Block size(NBs)로 태울 것인가 하는 것은, CPU인 경우 32 ~ 256 정도에서 택하되 CUDA인 경우 1000 단위로 크게 하라는데, 2048보다는 1024가 더 나은 것 같습니다.

[user1@ac922 CUDA]$ vi HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
234000 Ns
1 # of NBs
1024 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
4 Qs
16.0 threshold
1 # of panel fact
0 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
32 memory alignment in double (> 0)

이제 수행하면 됩니다. PxQ = 4이므로 여기서는 mpirun을 이용하여 4개를 수행합니다.

user1@ac922 CUDA]$ nohup time mpirun -np 4 ./run_linpack > linpack_out16.txt &

결과가 궁금하실텐데, 여기서 제가 개발새발 수행한 것을 공개하는 것은 곤란하군요. 다만, 이 hpl-2.0_FERMI_v15로 구현된 것은 2011년 정도에 당시 GPU에 맞춰서 HPL을 CUDA로 변환한 것이라서, 현대적인 P100이나 V100 GPU에서는 제 성능을 내지 못 합니다. (https://devtalk.nvidia.com/default/topic/991058/cuda-programming-and-performance/poor-results-from-cuda-linpack-on-k80/post/5074677/ 참조) 신규 GPU에 맞춰 NVIDIA가 작성한 HPL-CUDA가 있을텐데, 그건 일반 공개되지는 않는다고 합니다. 실제로 제가 돌려본 결과도 이론치(Rpeak)에 훨씬 미치지 못 합니다.

이때 CPU의 사용 형태는 아래와 같이 np 당 1개씩의 core만 100% 씁니다.

lnmonq16gqqqqqq[H for help]qqqHostname=ac922qqqqqqqqRefresh= 2secs qqq10:15.1
5 CPU Utilisation qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x---------------------------+-----------------------------------------------x
xCPU User% Sys% Wait% Idle|0 |25 |50 |75 1x
x 1 0.0 0.0 0.0 100.0| > x
x 2 100.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 3 0.0 0.0 0.0 100.0| > x
x 4 5.1 1.4 0.0 93.5|UU> x
x 5 1.0 0.0 0.0 99.0| > x
x 6 1.0 0.0 0.0 99.0| > x
x 7 1.5 0.0 0.0 98.5| > x
x 8 0.5 0.0 0.0 99.5| > x
x 9 2.0 1.5 0.0 96.6| > x
x 10 1.0 0.0 0.0 99.0| > x
x 11 1.0 0.0 0.0 99.0| > x
x 12 100.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 13 0.0 0.0 0.0 100.0| > x
x 14 0.0 0.0 0.0 100.0| > x
x 15 0.0 0.0 0.0 100.0| > x
x 16 0.0 0.0 0.0 100.0| > x
x 17 100.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 18 0.0 0.0 0.0 100.0| > x
x 19 1.0 0.0 0.0 99.0| > x
x 20 99.5 0.5 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 21 0.0 0.0 0.0 100.0| x
x 22 0.0 0.0 0.0 100.0| > x
x 23 1.5 0.0 0.0 98.5| > x
x 24 1.5 0.5 0.0 98.0| > x
x 25 1.0 0.0 0.0 99.0| > x
x 26 2.0 0.0 0.0 98.0| > x
x 27 1.5 0.0 0.0 98.5| > x
x 28 1.0 0.0 0.0 99.0| > x
x 29 0.0 0.0 0.0 100.0| > x
x 30 0.0 0.0 0.0 100.0|> x
x 31 0.0 0.0 0.0 100.0|> x
x 32 0.0 0.0 0.0 100.0| > x
x---------------------------+-----------------------------------------------x
xAvg 13.2 0.1 0.0 86.7|UUUUUU > x
x---------------------------+-----------------------------------------------x
x Top Processes Procs=1297-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 112610 108.3 34164m14482m 256 14442m 0 1742724413 0 xhpl x

그리고 GPU 사용률은 계속 100%를 쓰는 것이 아니라 이따금씩 100%를 쓰는 정도로서, 높지는 않습니다. 제 추측과는 달리, Ns를 얼마로 주든 간에 GPU memory usage는 아래처럼 언제나 2516MiB로 나오네요.

Wed Mar 28 10:14:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.36 Driver Version: 387.36 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 41C P0 62W / 300W | 2586MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 46C P0 64W / 300W | 2586MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 43C P0 63W / 300W | 2586MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 48C P0 63W / 300W | 2586MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 112607 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 1 112609 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 2 112610 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 3 112611 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
+-----------------------------------------------------------------------------+
Wed Mar 28 10:14:44 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.36 Driver Version: 387.36 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 45C P0 219W / 300W | 2586MiB / 16128MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 51C P0 235W / 300W | 2586MiB / 16128MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 46C P0 64W / 300W | 2586MiB / 16128MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 50C P0 64W / 300W | 2586MiB / 16128MiB | 89% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 112607 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 1 112609 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 2 112610 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
| 3 112611 C ...1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
+-----------------------------------------------------------------------------+

HW 엔지니어를 위한 Deep Learning

2018년 3월 23일 금요일

POWER9 AC922에서 HPL CUDA 버전을 compile하고 수행하기