먼저 아래와 같이 필요한 package들을 설치합니다. Redhat에서는 가령 lapack-devel 패키지 등이 기본 DVD에 들어있지 않아서 사용하지 못했는데, Ubuntu에서는 다 제공되므로 편리하게 그것들까지 다 설치할 수 있습니다.
minsky@minsky:~$ sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev mpich libmpich12 libmpich-dev libopenblas-dev libopenblas libopenblas-base libatlas3-base libatlas-base-dev libatlas-test libatlas-dev liblapack3 liblapack-dev liblapack-test liblapacke-dev liblapacke
지난번과 마찬가지로 NVIDIA site에서 (x86_64 버전이긴 하지만) HPL의 CUDA 버전 소스코드를 받습니다.
https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64
위에서 license 등에 동의하면 아래와 같이 hpl-2.0_FERMI_v15.solitairetheme8을 download 받을 수 있습니다. 이는 tar.gz 형태의 파일입니다.
minsky@minsky:~/files$ tar -zxvf hpl-2.0_FERMI_v15.solitairetheme8
minsky@minsky:~/files$ cd hpl-2.0_FERMI_v15
먼저, Intel MKL compiler에 편향된 cuda_dgemm.c의 source를 약간 수정해야 합니다.
minsky@minsky:~/files/hpl-2.0_FERMI_v15$ vi ./src/cuda/cuda_dgemm.c
...
// handle2 = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle2 = dlopen ("libopenblas.so", RTLD_LAZY);
...
// dgemm_mkl = (void(*)())dlsym(handle, "dgemm");
dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
...
// handle = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle = dlopen ("libopenblas.so", RTLD_LAZY);
...
// mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm");
mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
...
위의 수정들을 하지 않으면 run_linpack 수행시 다음과 같은 runtime error가 납니다. 이는 ppc64le 아키텍처 상에서는 libmkl_intel_lp64 대신 오픈소스인 openblas를 사용하기 때문입니다.
libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
libopenblas.so.0: undefined symbol: dtrsm
libopenblas.so.0: undefined symbol: dgemm
이제 compile을 위해 Make.CUDA를 수정합니다. ppc64le 아키텍처라고 해서 크게 바뀔 건 없습니다. Redhat과는 달리 Ubuntu에서는 libatlas-dev와 libmpich-dev 등이 제공되므로, 그냥 libatlas.a와 libmpich.a 등을 쓸 수 있습니다. -lmkl 대신 -lopenblas를 쓴 것에 주목하십시요.
minsky@minsky:~/files/hpl-2.0_FERMI_v15$ vi Make.CUDA
...
#TOPdir = /home/mfatica/hpl-2.0_FERMI_v15
TOPdir = /home/minsky/files/hpl-2.0_FERMI_v15
...
#MPdir = /opt/intel/mpi/3.0
#MPinc = -I$(MPdir)/include64
#MPlib = $(MPdir)/lib64/libmpi.a
#MPlib = $(MPdir)/lib64/libmpich.a
MPdir = /usr/lib/openmpi/lib
MPinc = -I /usr/lib/openmpi/include
MPlib = -L /usr/lib/openmpi/lib -lmpi /usr/lib/powerpc64le-linux-gnu/libmpich.a
...
#LAdir = $(TOPdir)/../../lib/em64t
#LAdir = /share/apps/intel/mkl/10.2.4.032/libem64t
#LAinc =
# CUDA
#LAlib = -L /home/cuda/Fortran_Cuda_Blas -ldgemm -L/usr/local/cuda/lib -lcublas -L$(LAdir) -lmkl -lguide -lpthread
#LAlib = -L $(TOPdir)/src/cuda -ldgemm -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
LAdir = /usr/lib
LAinc = -I /usr/include/atlas -I /usr/include/openblas -I /usr/include
#LAlib = /usr/lib/libatlas.a ${LAdir}/libopenblas.a /usr/lib/atlas-base/atlas/libblas.a /usr/lib/atlas-base/atlas/liblapack.a
LAlib = -L $(TOPdir)/src/cuda -ldgemm -L /usr/local/cuda/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda/lib64 -lcudart -L$(LAdir) -lpthread -lm /usr/lib/libatlas.a /usr/lib/atlas-base/atlas/liblapack.a /usr/lib/atlas-base/atlas/libblas.a ${LAdir}/libopenblas.a /usr/lib/gcc/powerpc64le-linux-gnu/5/libgfortran.a
...
이제 아래와 같이 환경변수를 맞춰주고, make arch=CUDA를 수행하면 일사천리로 compile이 수행됩니다.
minsky@minsky:~/files/hpl-2.0_FERMI_v15$ export LD_LIBRARY_PATH=/usr/lib/openmpi/lib:/usr/lib/mpich/lib:$LD_LIBRARY_PATH
minsky@minsky:~/files/hpl-2.0_FERMI_v15$ make arch=CUDA
...
mpicc -o HPL_pdtest.o -c -DAdd__ -DF77_INTEGER=int -DStringSunStyle -DCUDA -I/home/minsky/files/hpl-2.0_FERMI_v15/include -I/home/minsky/files/hpl-2.0_FERMI_v15/include/CUDA -I /usr/include/atlas -I /usr/include/openblas -I /usr/include -I /usr/include/openmpi-ppc64le -I/usr/local/cuda/include -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp ../HPL_pdtest.c
mpicc -DAdd__ -DF77_INTEGER=int -DStringSunStyle -DCUDA -I/home/minsky/files/hpl-2.0_FERMI_v15/include -I/home/minsky/files/hpl-2.0_FERMI_v15/include/CUDA -I /usr/include/atlas -I /usr/include/openblas -I /usr/include -I /usr/include/openmpi-ppc64le -I/usr/local/cuda/include -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp -o /home/minsky/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /home/minsky/files/hpl-2.0_FERMI_v15/lib/CUDA/libhpl.a -L /home/minsky/files/hpl-2.0_FERMI_v15/src/cuda -ldgemm -L /usr/local/cuda/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda/lib64 -lcudart -L/usr/lib -lpthread -lm /usr/lib/libatlas.a /usr/lib/atlas-base/atlas/liblapack.a /usr/lib/atlas-base/atlas/libblas.a /usr/lib/libopenblas.a /usr/lib/gcc/powerpc64le-linux-gnu/5/libgfortran.a -L /usr/lib/openmpi/lib -lmpi /usr/lib/powerpc64le-linux-gnu/libmpich.a
make TOPdir=/home/minsky/files/hpl-2.0_FERMI_v15 /home/minsky/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat
make[3]: Entering directory '/home/minsky/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
make[3]: '/home/minsky/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat' is up to date.
make[3]: Leaving directory '/home/minsky/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
touch dexe.grd
make[2]: Leaving directory '/home/minsky/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA'
make[1]: Leaving directory '/home/minsky/files/hpl-2.0_FERMI_v15'
실행 파일은 아래와 같이 bin/CUDA 밑에 xhpl이라는 이름으로 만들어져 있습니다.
minsky@minsky:~/files/hpl-2.0_FERMI_v15$ cd bin/CUDA
minsky@minsky:~/files/hpl-2.0_FERMI_v15/bin/CUDA$ vi run_linpack
...
#HPL_DIR=/home/mfatica/hpl-2.0_FERMI_v15
HPL_DIR=/home/minsky/files/hpl-2.0_FERMI_v15
...
#CPU_CORES_PER_GPU=4
CPU_CORES_PER_GPU=8
...
#export CUDA_DGEMM_SPLIT=0.80
export CUDA_DGEMM_SPLIT=0.99
...
#export CUDA_DTRSM_SPLIT=0.70
export CUDA_DTRSM_SPLIT=0.99
그리고 input 파일이라고 할 수 있는 HPL.dat 파일을 수정해야 합니다. 이에 대해서는 아래 URL을 참조하여 수정합니다.
http://www.netlib.org/benchmark/hpl/tuning.html
HPL.dat의 주요 input 항목의 의미에 대해서는 아래 URL을 참조하시면 됩니다.
http://www.netlib.org/benchmark/hpl/tuning.html
여기서 중요한 것은 problem size(Ns)를 얼마로 두느냐와 이걸 어떤 process grid(P x Q)에 어떤 block size (NBs)로 태우느냐입니다.
problem size(Ns)를 구하는 방법은 대략 다음과 같습니다. 여기서는 16GB memory가 장착된 P100 GPU 4장이 장착되어 있으니 다음과 같이 하면 됩니다.
sqrt(GPU mem size * # of GPUs * 적정 mem% / double precision 64-bit in byte)
sqrt(64 * 1024^3 * 4 * 0.8 / 8) = 15852
process grid(P x Q)는 Minsky에 장착된 GPU 개수에 맞추면 됩니다. 2 x 2 =4로 하든, 1 x 4 =4로 하든, 또는 둘 다 수행하든 택하면 됩니다. 실제로 해보면 1 x 4로 하는 것이 성능은 좀더 잘 나오는데, 대신 큰 Ns를 사용하는 경우 검증 과정에서 fail 나는 경우가 종종 있습니다. 여기서는 그냥 flat grid인 1 x 4로 하겠습니다.
Process grid에 어떤 Block size(NBs)로 태울 것인가 하는 것은, CPU인 경우 32 ~ 256 정도에서 택하되 CUDA인 경우 1000 단위로 크게 하라는데, 2048보다는 1024가 더 나은 것 같습니다.
minsky@minsky:~/files/hpl-2.0_FERMI_v15/bin/CUDA$ vi HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
120000 Ns
1 # of NBs
1024 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
4 Qs
16.0 threshold
1 # of panel fact
0 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
32 memory alignment in double (> 0)
이제 수행하면 됩니다. PxQ = 4이므로 여기서는 mpirun을 이용하여 4개를 수행합니다.
minsky@minsky:~/files/hpl-2.0_FERMI_v15/bin/CUDA$ nohup time mpirun -np 4 ./run_linpack > linpack_out1.txt &
지난번 AC922 Redhat 환경에서와는 다른 점이 있습니다. 여기서도 대부분의 구간에서 기본적으로 np 당 1개씩의 CPU core만 100% 쓰지만, 일부 구간에서는 각각의 xhpl process가 multi-thread 형태로 여러 개의 CPU core들을 사용합니다.
lnmonq14gqqqqqq[H for help]qqqHostname=minskyqqqqqqqRefresh= 2secs qqq00:07.57qq
x CPU Utilisation qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
x---------------------------+-------------------------------------------------+
xCPU User% Sys% Wait% Idle|0 |25 |50 |75 100|
x 1 50.2 9.9 0.0 39.9|UUUUUUUUUUUUUUUUUUUUUUUUUssss >
x 2 51.7 10.3 0.0 37.9|UUUUUUUUUUUUUUUUUUUUUUUUUsssss >
x 3 46.8 2.0 0.0 51.2|UUUUUUUUUUUUUUUUUUUUUUU >
x 4 60.3 13.2 0.0 26.5|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssssss >
x 5 42.0 11.9 0.0 46.1|UUUUUUUUUUUUUUUUUUUUsssss >
x 6 51.3 9.1 0.0 39.6|UUUUUUUUUUUUUUUUUUUUUUUUUssss > |
x 7 83.3 3.9 0.0 12.7|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUs >
x 8 39.9 5.9 0.0 54.2|UUUUUUUUUUUUUUUUUUUss >
x 9 47.3 11.3 0.0 41.4|UUUUUUUUUUUUUUUUUUUUUUUsssss >
x 10 48.5 10.4 0.0 41.1|UUUUUUUUUUUUUUUUUUUUUUUUsssss >
x 11 43.8 9.9 0.0 46.3|UUUUUUUUUUUUUUUUUUUUUssss >
x 12 53.9 8.3 0.5 37.3|UUUUUUUUUUUUUUUUUUUUUUUUUUssss >
x 13 95.1 2.9 0.0 2.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUs >
x 14 35.0 2.5 0.0 62.6|UUUUUUUUUUUUUUUUUs >
x 15 51.5 6.9 0.0 41.6|UUUUUUUUUUUUUUUUUUUUUUUUUsss >
x 16 50.2 9.4 0.0 40.4|UUUUUUUUUUUUUUUUUUUUUUUUUssss >
x 17 49.3 8.9 0.0 41.9|UUUUUUUUUUUUUUUUUUUUUUUUssss >
x 18 49.8 11.8 0.0 38.4|UUUUUUUUUUUUUUUUUUUUUUUUsssss >
x 19 46.3 7.9 0.0 45.8|UUUUUUUUUUUUUUUUUUUUUUUsss >
x 20 68.6 9.8 0.0 21.6|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssss >
x 21 37.3 8.3 0.0 54.4|UUUUUUUUUUUUUUUUUUssss >
x 22 48.5 7.9 0.0 43.6|UUUUUUUUUUUUUUUUUUUUUUUUsss >
x 23 50.7 12.8 0.0 36.5|UUUUUUUUUUUUUUUUUUUUUUUUUssssss >
x 24 44.1 8.3 0.0 47.5|UUUUUUUUUUUUUUUUUUUUUUssss >
x 25 49.8 8.4 0.0 41.9|UUUUUUUUUUUUUUUUUUUUUUUUssss >
x 26 43.8 7.9 0.0 48.3|UUUUUUUUUUUUUUUUUUUUUsss >
x 27 49.0 4.9 0.0 46.1|UUUUUUUUUUUUUUUUUUUUUUUUss >
x 28 97.5 1.5 0.0 1.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU >
x 29 75.4 1.5 0.0 23.2|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU >
x 30 52.0 9.3 0.0 38.7|UUUUUUUUUUUUUUUUUUUUUUUUUssss >
x 31 41.9 6.4 0.0 51.7|UUUUUUUUUUUUUUUUUUUUsss >
x 32 47.2 10.1 0.0 42.7|UUUUUUUUUUUUUUUUUUUUUUUsssss >
x---------------------------+-------------------------------------------------+
xAvg 53.2 8.0 0.0 38.8|UUUUUUUUUUUUUUUUUUUUUUUUUUsss >
x---------------------------+-------------------------------------------------+
x Top Processes Procs=1090 mode=3 (1=Basic, 3=Perf 4=Size 5=I/O)qqqqqqqqqqqqqqqq
x PID %CPU Size Res Res Res Res Shared Faults Command
x Used KB Set Text Data Lib KB Min Maj
x 16175 517.6 52070656 11392128 512 51713408 0 142080 15 0 xhpl x
x 16178 493.8 51681984 10993664 512 51324736 0 142144 44 0 xhpl x
x 16179 485.6 51503232 10828096 512 51145984 0 142144 97 0 xhpl x
x 16177 450.7 52079424 11429888 512 51722176 0 142080 46 0 xhpl x
lnmonq14gqqqqqqqqqqqqqqqqqqqqqHostname=minskyqqqqqqqRefresh= 2secs qqq03:00.56qk
x CPU +------------------------------------------------------------------------x
x100%-| | x
x 95%-| | x
x 90%-| | x
x 85%-| | x
x 80%-| | x
x 75%-| | x
x 70%-| | x
x 65%-|ssss ss s | x
x 60%-|UUssssss ss ss | x
x 55%-|UUUUUUUssUssss | x
x 50%-|UUUUUUUUUUssUU | x
x 45%-|UUUUUUUUUUUUUU | x
x 40%-|UUUUUUUUUUUUUU | x
x 35%-|UUUUUUUUUUUUUU + x
x 30%-|UUUUUUUUUUUUUU | x
x 25%-|UUUUUUUUUUUUUU | x
x 20%-|UUUUUUUUUUUUUU | x
x 15%-|UUUUUUUUUUUUUUUUUUUUUUUUUUUU| x
x 10%-|UUUUUUUUUUUUUUUUUUUUUUUUUUUU| x
x 5%-|UUUUUUUUUUUUUUUUUUUUUUUUUUUU| x
x +--------------------User---------System---------Wait--------------------x
x Top Processes Procs=1098 mode=3 (1=Basic, 3=Perf 4=Size 5=I/O)qqqqqqqqqqqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 16175 100.4 51502720 10825152 512 51145472 0 142144 17 0 xhpl x
x 16177 100.4 51503296 10853824 512 51146048 0 142144 17 0 xhpl x
x 16179 100.4 50927104 10252096 512 50569856 0 142272 17 0 xhpl x
x 16178 99.9 51105856 10417600 512 50748608 0 142208 17 0 xhpl x
x 12926 2.5 11520 8448 192 8128 0 2432 0 0 nmon x
x 14666 2.5 5696 4608 576 640 0 3584 0 0 nvidia-smi x
x 3056 0.5 2413632 53440 35776 2366976 0 31296 0 0 dockerd x
x 1 0.0 10944 9792 1728 2304 0 5376 0 0 systemd x
x 2 0.0 0 0 0 0 0 0 0 0 kthreadd
그리고 CPU core의 병목이 이렇게 해소되니, GPU 사용률도 계속 100%를 쓰는 것까지는 아니지만 지난번 Redhat에서보다는 사용률이 훨씬 높습니다. Ns를 얼마로 주든 간에 GPU memory usage는 아래처럼 언제나 2365MiB로 나옵니다. 대신 Ns를 큰 값으로 주면, 위에서 보시는 바와 같이 xhpl 프로세스가 차지하는 서버 메모리가 크게 나옵니다.
##################################
이어서 CPU만 이용하는 평범한 HPL을 수행해보겠습니다. 이는 아래 site에서 받으실 수 있습니다.
minsky@minsky:~/files$ wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz
minsky@minsky:~/files$ tar -zxf hpl-2.2.tar.gz
minsky@minsky:~/files$ cd hpl-2.2
Make.{arch} 파일의 sample은 setup directory 밑에 있는 것 중에서 골라 사용하면 되는데, ppc64le에서는 가장 비슷해 보이는 Make.Linux_PII_CBLAS를 아래와 같이 복사해서 편집한 뒤 사용하시면 됩니다.
minsky@minsky:~/files/hpl-2.2$ cp setup/Make.Linux_PII_CBLAS Make.Linux
minsky@minsky:~/files/hpl-2.2$ vi Make.Linux
...
#ARCH = Linux_PII_CBLAS
ARCH = Linux
...
#TOPdir = $(HOME)/hpl
TOPdir = $(HOME)/files/hpl-2.2
...
#MPdir = /usr/local/mpi
#MPinc = -I$(MPdir)/include
#MPlib = $(MPdir)/lib/libmpich.a
MPdir = /usr/lib/openmpi/lib
MPinc = -I /usr/lib/openmpi/include
MPlib = -L /usr/lib/openmpi/lib -lmpi /usr/lib/powerpc64le-linux-gnu/libmpich.a
...
#LAdir = $(HOME)/netlib/ARCHIVES/Linux_PII
#LAinc =
#LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
LAdir = /usr/lib
LAinc = -I /usr/include/atlas -I /usr/include/openblas -I /usr/include
LAlib = -L$(LAdir) -lpthread -lm /usr/lib/libatlas.a /usr/lib/atlas-base/atlas/liblapack.a /usr/lib/atlas-base/atlas/libblas.a ${LAdir}/libopenblas.a /usr/lib/gcc/powerpc64le-linux-gnu/5/libgfortran.a
...
#CC = /usr/bin/gcc
CC = /usr/bin/mpicc
...
#LINKER = /usr/bin/g77
LINKER = /usr/bin/mpicc
...
이제 다음과 같이 make를 수행하시어 compile 하시면 역시 그 결과물인 xhpl 파일이 bin/Linux 밑에 생성됩니다.
minsky@minsky:~/files/hpl-2.2$ make arch=Linux
minsky@minsky:~/files/hpl-2.2$ cd bin/Linux
minsky@minsky:~/files/hpl-2.2/bin/Linux$ vi run_linpack
export HPL_DIR=/home/minsky/files/hpl-2.2
# FOR OMP
export OMP_NUM_THREADS=8
export LD_LIBRARY_PATH=$HPL_DIR/lib/Linux:$LD_LIBRARY_PATH
$HPL_DIR/bin/Linux/xhpl
minsky@minsky:~/files/hpl-2.2/bin/Linux$ chmod a+x run_linpack
HPL.dat은 역시 비슷한 방법으로 작성하면 됩니다. 다만 여기서는 RAM이 512GB니까 훨씬 더 큰 Ns 값을 줄 수 있습니다. 그리고 core 수도 물리적 core 수인 16 또는 SMT-8에 의한 값인 16 * 8 = 128을 줘 볼 수도 있습니다.
minsky@minsky:~/files/hpl-2.2/bin/Linux$ cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
234000 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
16 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
2 4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
다음과 같이 수행하면 전체 16개 core를 다 사용합니다. 여기서는 SMT를 off 시켜 놓은 상황입니다.
minsky@minsky:~/files/hpl-2.2/bin/Linux$ export HPL_DIR=/home/minsky/files/hpl-2.2
minsky@minsky:~/files/hpl-2.2/bin/Linux$ export OMP_NUM_THREADS=8
minsky@minsky:~/files/hpl-2.2/bin/Linux$ export LD_LIBRARY_PATH=$HPL_DIR/lib/Linux:$LD_LIBRARY_PATH
minsky@minsky:~/files/hpl-2.2/bin/Linux$ nohup mpirun -np 16 -x HPL_DIR -x OMP_NUM_THREADS -x LD_LIBRARY_PATH ./xhpl > linpack3.txt &
lnmonq14gqqqqqq[H for help]qqqHostname=minskyqqqqqqqRefresh= 2secs qqq00:26.53qk
x CPU +------------------------------------------------------------------------x
x100%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 95%-|UUUUUUUUUUUUUUUUUUUUUUUU+UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 90%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 85%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 80%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 75%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 70%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 65%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 60%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 55%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 50%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 45%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 40%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 35%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 30%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 25%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 20%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 15%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 10%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x 5%-|UUUUUUUUUUUUUUUUUUUUUUUU|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUx
x +--------------------User---------System---------Wait--------------------x
x Memory Stats qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x RAM High Low Swap Page Size=64 KB x
x Total MB 523135.2 -0.0 -0.0 44075.8 x
x Free MB 60951.1 -0.0 -0.0 44075.8 x
x Free Percent 11.7% 100.0% 100.0% 100.0% x
x MB MB MB x
x Cached= 15197.8 Active=453282.8 x
x Buffers= 367.3 Swapcached= 0.0 Inactive = 5598.4 x
x Dirty = 0.1 Writeback = 0.0 Mapped = 142.3 x
x Slab = 1392.9 Commit_AS =445432.4 PageTables= 114.8 x
x Top Processes Procs=1091 mode=3 (1=Basic, 3=Perf 4=Size 5=I/O)qqqqqqqqqqqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 74578 100.1 28825024 28579392 1664 28576704 0 14016 25 0 xhpl x
x 74579 100.1 28825792 28579520 1664 28577472 0 14208 32 0 xhpl x
x 74580 100.1 28825792 28579392 1664 28577472 0 14016 32 0 xhpl x
x 74581 100.1 28351360 28107904 1664 28103040 0 14144 31 0 xhpl x
x 74582 100.1 28584768 28339520 1664 28336448 0 14208 26 0 xhpl x
x 74583 100.1 28585792 28339328 1664 28337472 0 14016 41 0 xhpl x
x 74584 100.1 28584768 28339520 1664 28336448 0 14208 34 0 xhpl x
x 74589 100.1 28584768 28339328 1664 28336448 0 14016 32 0 xhpl x
아마 CUDA 버전의 HPL과 CPU 버전의 HPL의 성능 차이가 궁금하실 것입니다. 구체적인 수치를 공개하기는 그렇습니다만, 아무리 최신 GPU에 최적화되지 않은 버전의 HPL-CUDA라고 해도, 확실히 서버 전체의 CPU와 메모리를 100% 쓰는 경우보다는 훠얼~씬 빠릅니다.