HW 엔지니어를 위한 Deep Learning: Minsky

레이블이 Minsky인 게시물을 표시합니다. 모든 게시물 표시

2018년 5월 10일 목요일

Minsky Ubuntu16.04에서 tensorflow 1.5.1 build하기

P100을 장착한 Ubuntu 16.04 환경의 Minsky 서버에는 원래 CUDA 8.0과 PowerAI에 딸린 tensorflow 1.0(python2)을 주로 썼습니다. 물론 Minsky 서버에도 Ubuntu 16.04를 그대로 유지한 채 CUDA 9.1을 설치한 뒤 tensorflow 1.5.1을 설치하여 쓸 수 있습니다. 여기서는 python3 환경입니다.

u0017649@sys-93315:~$ sudo apt-get install libaprutil1-dev ant cmake automake libtool-bin openssl libcurl4-openssl-dev

전에는 bazel-0.8.1을 썼습니다만, 요즘의 openjdk 1.8.0_151 환경에서는 이 bazel 버전은 다음과 같은 error를 냅니다.

ERROR: /home/minsky/files/bazel-0.8.1/src/main/java/com/google/devtools/common/options/BUILD:27:1: Building src/main/java/com/google/devtools/common/options/liboptions_internal.jar (35 source files) failed (Exit 1): java failed: error executing command
(cd /tmp/bazel_vQeTUQIe/out/execroot/io_bazel && \
exec env - \
LC_CTYPE=en_US.UTF-8 \
external/local_jdk/bin/java -XX:+TieredCompilation '-XX:TieredStopAtLevel=1' -Xbootclasspath/p:third_party/java/jdk/langtools/javac-9-dev-r4023-3.jar -jar bazel-out/host/bin/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/bootstrap_deploy.jar @bazel-out/ppc-opt/bin/src/main/java/com/google/devtools/common/options/liboptions_internal.jar-2.params)
java.lang.InternalError: Cannot find requested resource bundle for locale en_US

이 error는 bazel-0.10.0 버전을 쓰면 없어집니다.

u0017649@sys-93315:~$ wget https://github.com/bazelbuild/bazel/releases/download/0.10.0/bazel-0.10.0-dist.zip

u0017649@sys-93315:~$ which python
/home/u0017649/anaconda3/bin/python

u0017649@sys-93315:~$ conda install protobuf

u0017649@sys-93315:~$ which protoc
/home/u0017649/anaconda3/bin/protoc

u0017649@sys-93315:~$ export PROTOC=/home/u0017649/anaconda3/bin/protoc

u0017649@sys-93315:~$ mkdir bazel-0.10.0 && cd bazel-0.10.0

u0017649@sys-93315:~/bazel-0.10.0$ unzip ../bazel-0.10.0-dist.zip

u0017649@sys-93315:~/bazel-0.10.0$ ./compile.sh

u0017649@sys-93315:~/bazel-0.10.0$ sudo cp output/bazel /usr/local/bin

u0017649@sys-93315:~$ git clone https://github.com/tensorflow/tensorflow

u0017649@sys-93315:~$ cd tensorflow

u0017649@sys-93315:~/tensorflow$ git checkout tags/v1.5.1

u0017649@sys-93315:~/tensorflow$ vi configure.py
...
# default_cc_opt_flags = '-mcpu=native'
default_cc_opt_flags = '-mcpu=power8'
else:
# default_cc_opt_flags = '-march=native'
default_cc_opt_flags = '-mcpu=power8'
...
# write_to_bazelrc('build:opt --host_copt=-march=native')
write_to_bazelrc('build:opt --host_copt=-mcpu=power8')
...

u0017649@sys-93315:~/tensorflow$ ./configure
...
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
...
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
...
Do you wish to build TensorFlow with CUDA support? [y/N]: y

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1
...
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7
...
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda/lib64
...
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]6.0,7.0 #6.0은 P100, 7.0은 V100 입니다.
...

u0017649@sys-93315:~/tensorflow$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

u0017649@sys-93315:~/tensorflow$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg

u0017649@sys-93315:~/tensorflow$ pip install ~/tensorflow_pkg/tensorflow-1.5.1-cp36-cp36m-linux_ppc64le.whl

확인은 다음과 같이 합니다.

u0017649@sys-93315:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Feb 11 2018, 08:19:13)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess=tf.Session()

아래에 그렇게 만들어진 python3용 tensorflow-1.5.1-cp36-cp36m-linux_ppc64le.whl를 올려놓았습니다. 물론 이건 Ubuntu 환경에서든 Redhat 환경에서든 다 쓰실 수 있습니다.

https://drive.google.com/open?id=1CHIM-dgr0KcMJlHcc_I0fJUqBNs2czvL

그리고 아래는 python2용 tensorflow-1.5.1-cp27-cp27mu-linux_ppc64le.whl 입니다.

https://drive.google.com/open?id=1cTZAsLwyozPNufoZfeKIIaIJ7snYGC0M

2018년 2월 6일 화요일

Minsky 등 Linux on POWER 서버에서의 nmon을 이용한 자원 모니터링

nmon은 실시간 모니터링은 물론, CSV 포맷의 log를 남겨 장단시간에 걸친 사용률 추이를 볼 수 있는 유용한 freeware입니다. 원래 AIX용으로 개발되었으나, 지금은 Linux on POWER는 물론 Linux on x86도 지원됩니다.

nmon binary는 아래 site에서 download 받으실 수 있습니다.

http://nmon.sourceforge.net/pmwiki.php?n=Site.Download

위 site에서, IBM Minsky 또는 AC922인 경우는 nmon16g_power.tar.gz를 download 받으시면 됩니다. 이 속에는 Ubuntu와 SuSE, Redhat을 위해 compile된 binary들이 각각 들어있습니다.

nmon16g_power_ubuntu1604
nmon16g_power_sles122
nmon16f_power_rhel73LE

윗 파일을 PC에서 download 받아 Minsky 또는 AC922 서버에 upload하거나, 만약 서버가 internet에 연결되어 있다면 다음과 같이 직접 download 받으시면 됩니다.

u0017649@sys-91436:~$ wget http://sourceforge.net/projects/nmon/files/nmon16g_power.tar.gz

이 중에서 원하는 파일을 풀어냅니다. 가령 Minsky인 경우 Ubuntu일테니 다음과 같이 Ubuntu용 binary만 골라 풀어냅니다.

u0017649@sys-91436:~$ tar -zxvf nmon16g_power.tar.gz nmon16g_power_ubuntu1604
nmon16g_power_ubuntu1604

이제 이 파일에 실행 permission을 준 뒤, /usr/bin/nmon으로 copy를 하시면 됩니다. 꼭 /usr/bin 밑이 아니어도 상관없습니다.

u0017649@sys-91436:~$ chmod a+x nmon16g_power_ubuntu1604
u0017649@sys-91436:~$ sudo cp nmon16g_power_ubuntu1604 /usr/bin/nmon

실시간 모니터링을 위한 실행은 그냥 nmon이라고 치시면 됩니다. 이어서 키워드 문자를 입력하여 보고 싶은 패널을 보면 되는데, 기본적으로는 l (long term usage)와 t (top process), m (memory)과 d (disk) 등을 많이 사용합니다. c (CPU usage)를 치면 SMT의 많은 logical HW thread가 다 display되므로 c는 잘 안 씁니다.

일정한 시간에 걸친 시스템 자원 사용률 모니터링을 위해 nmon을 사용하려면 다음과 같이 합니다.

u0017649@sys-91436:~$ nmon -f -s 120 -c 720 -m /tmp/nmon

위에서 -f는 csv의 spreadsheet 형태로 log를 받으라는 뜻이고, -s는 초 단위 간격, 그리고 -c는 그 간격으로 몇 번 받으라는 count입니다. 즉, 2분 단위로 720번, 즉 24시간을 받게 됩니다. -m은 log를 떨어뜨릴 directory를 지정하는 옵션입니다.

여기서는 아래와 같이 떨군 경우의 log를 보겠습니다.
u0017649@sys-91436:~$ nmon -f -s 2 -c 720 -m /tmp/nmon

u0017649@sys-91436:~$ ls -l /tmp/nmon
-rw-rw-r-- 1 u0017649 u0017649 87917 Feb 5 20:45 sys-91436_180205_2042.nmon

이것을 PC로 download 받으신 뒤, 아래 site에서 nmon analyzer를 PC에 download 받으십시요.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Systems/page/nmon_analyser

이 nmon analyzer는 Excel macro입니다. 이걸 열어보면 아래와 같이 버튼이 보이는데, 이걸 클릭한 뒤 아까 download 받은 *.nmon 파일을 선택하면 됩니다. 그러면 자동으로 로그를 분석하여 각종 그래프가 그려진 *.xls를 만들어 줍니다.

CPU나 disk는 직관적으로 보실 수 있습니다만, memory는 거의 언제나 free memory가 부족한 상태로 보일 겁니다. 이는 다른 OS들과 마찬가지로 Linux도 남는 free memory를 항상 file cache로 사용하기 때문입니다. 그러니 저 뒤쪽의 'VM' tab에 있는 'Swap space activity'가 0으로 유지된다면 memory 문제는 없다고 생각하셔도 됩니다.

단, 아직까지는 nmon에서 GPU에 대한 모니터링을 정식으로 지원하지는 않습니다. 일부 버전에서는 nvidia-smi에서 가져오는 정보를 이용하여 GPU usage를 display 해주기도 하는데 (a 옵션), 위와 같이 log를 받는 옵션은 없고, 또 nvidia-smi에서 보여주는 정보와 좀 다른 수치를 보여주는 등 아직 문제가 좀 있습니다. 현재 최신 nmon 버전에서는 이 GPU 모니터링 옵션은 빠져 있습니다. GPU 자원에 대한 모니터링은 아직까지는 nvidia-smi를 이용하셔야 합니다.

2017년 12월 28일 목요일

Ubuntu 16.04 ppc64le에 crashdump를 설정하는 방법

crashdump를 설정하는 자세한 내용은 https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html 에 나와 있으니 참조하시기 바랍니다.

먼저 다음과 같이 /proc/cmdline 및 dmesg 결과에서 crash라는 단어를 검색해봅니다. 없으면 crashdump가 enable되지 않은 상태입니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash

u0017649@sys-90592:~$ dmesg | grep -i crash

crashdump를 enable하기 위해서는 linux-crashdump를 설치합니다. 설치 과정 중에 다음과 같이 kexec-tools와 kdump-tools를 enable하겠냐는 것을 묻는데, 둘다 yes로 하셔야 합니다.

u0017649@sys-90592:~$ sudo apt install linux-crashdump

┌────────────────────────┤ Configuring kexec-tools ├────────────────────────┐
│ │
│ If you choose this option, a system reboot will trigger a restart into a │
│ kernel loaded by kexec instead of going through the full system boot │
│ loader process. │
│ │
│ Should kexec-tools handle reboots? │
│ │
│ <Yes> <No> │
│ │
└───────────────────────────────────────────────────────────────────────────┘

┌────────────────────────┤ Configuring kdump-tools ├────────────────────────┐
│ │
│ If you choose this option, the kdump-tools mechanism will be enabled. A │
│ reboot is still required in order to enable the crashkernel kernel │
│ parameter. │
│ │
│ Should kdump-tools be enabled by default? │
│ │
│ <Yes> <No> │
│ │

아마 고객사의 Minsky 환경에서는 그렇지 않겠습니다만, 제가 쓰는 PDP (Power Development Platform) cloud 환경처럼 혹시 아래 파일에 USE_GRUB_CONFIG=false로 되어 있다면 이걸 true로 고쳐 주셔야 crashdump 설정이 유효하게 됩니다.

u0017649@sys-90592:~$ sudo vi /etc/default/kexec
...
#USE_GRUB_CONFIG=false
USE_GRUB_CONFIG=true

이제 시스템을 reboot 합니다.

u0017649@sys-90592:~$ sudo shutdown -r now

시스템이 부팅되면, /var/crash 밑에 아무 것도 없는 것을 확인합니다.

u0017649@sys-90592:~$ ls /var/crash

아울러 아까는 없었던 crashkernel 관련 항목이 /proc/cmdline과 dmesg에 나오는 것을 확인합니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash
root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

u0017649@sys-90592:~$ dmesg | grep -i crash
[ 0.000000] Reserving 256MB of memory at 128MB for crashkernel (System RAM: 4096MB)
[ 0.000000] Kernel command line: root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

이제 kdump-config show 명령으로 crashdump 준비 상태를 확인합니다.

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: broken symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: Not ready to kdump

여기서는 Not ready인데, 이유는 제가 쓰는 PDP (Power Development Platform) cloud 환경에서는 저 /var/lib/kdump/initrd.img-4.4.0-104-generic 대신 /var/lib/kdump/initrd.img-4.4.0-98-generic이 들어있기 때문입니다. 아마도 고객사의 Minsky 서버에서는 이런 문제는 없을 것입니다. PDP에서만 있는 이 문제는 그냥 간단히 /var/lib/kdump/initrd.img-4.4.0-98-generic를 /var/lib/kdump/initrd.img-4.4.0-104-generic 라는 이름으로 copy해서 해결하겠습니다.

u0017649@sys-90592:~$ sudo cp /var/lib/kdump/initrd.img-4.4.0-98-generic /var/lib/kdump/initrd.img-4.4.0-104-generic

이제 kdump-config load를 수행한 뒤 다시 kdump-config show를 해보면 ready to kdump로 상태가 바뀐 것을 보실 수 있습니다.

u0017649@sys-90592:~$ sudo kdump-config load
Modified cmdline:root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service elfcorehdr=156864K
* loaded kdump kernel

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

crashdump를 테스트하는 방법은 다음과 같습니다. 단, 이것이 수행될 때는 일단 시스템이 죽고, 또 시스템 reboot이 제대로 되지 않을 수 있으므로 반드시 IPMI나 RGB 모니터 등으로 console을 확보한 뒤에 하시기 바랍니다.

sudo가 아닌, 반드시 "su -"를 통해 root 계정으로 login 한 뒤, 다음과 같이 명령을 날리면 시스템이 죽으면서 /var/crash 디렉토리에 dump를 쏟습니다.

root@sys-90592:~# echo c > /proc/sysrq-trigger

리부팅이 끝나면 다음과 같은 형태로 crash file이 만들어진 것을 보실 수 있습니다.

# ls /var/crash
linux-image-3.0.0-12-server.0.crash

이 파일을 분석하는 것은 Ubuntu 공식 문서에 따르면 https://www.dedoimedo.com/computers/crash-analyze.html 에 나온 것을 참조하여 하면 된다고는 하는데, 저도 경험이 없고, 또 쉬워 보이지도 않네요.

이 외에, 기술지원 계약자에게 시스템 정보를 모아서 보내려면 sosreport를 사용하시는 것이 좋습니다. 원래 redhat에 있던 이 기능은 ubuntu에서도 사용이 가능합니다. 먼저 아래와 같이 sosreport를 설치하시고...

u0017649@sys-90528:~$ sudo apt-get install sosreport

(반드시 root 권한으로) sosreport를 수행하시기만 하면 됩니다. 결과물은 /tmp 밑에 tar.xz 포맷으로 생성됩니다.

u0017649@sys-90528:~$ sudo sosreport
...
Please enter your first initial and last name [sys-90528]:
Please enter the case id that you are generating this report for []:
Setting up archive ...
Setting up plugins ...
Running plugins. Please wait ...

Running 23/61: kernel...
Creating compressed archive...

Your sosreport has been generated and saved in:
/tmp/sosreport-sys-90528-20171227231646.tar.xz

The checksum is: 939ce2d2b6a254fbb576a2f6728b5678

Please send this file to your support representative.

u0017649@sys-90528:~$ ls -l /tmp/sosreport-sys-90528-20171227231646.tar.xz
-rw------- 1 root root 5231340 Dec 27 23:17 /tmp/sosreport-sys-90528-20171227231646.tar.xz

저 파일 속의 내용은 뭐 대단한 것은 아니고 그냥 시스템의 주요 config 파일들을 모아 놓는 것입니다.

u0017649@sys-90528:~$ sudo tar -tf /tmp/sosreport-sys-90528-20171227231646.tar.xz | more

...

sosreport-sys-90528-20171227231646/lib/udev/rules.d/60-cdrom_id.rules
sosreport-sys-90528-20171227231646/lib/udev/rules.d/80-docker-ce.rules
sosreport-sys-90528-20171227231646/lib/nvidia-361/
sosreport-sys-90528-20171227231646/lib/nvidia-361/modprobe.conf
sosreport-sys-90528-20171227231646/lib/modules/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/modules.dep
...

sosreport-sys-90528-20171227231646/etc/rcS.d/S03udev
sosreport-sys-90528-20171227231646/etc/rcS.d/S02plymouth-log
sosreport-sys-90528-20171227231646/etc/rcS.d/S08checkfs.sh

2017년 12월 18일 월요일

ppc64le에서 사용가능한 open source anti-virus SW : CLAMAV

Minsky의 아키텍처인 ppc64le(IBM POWER8)에서도 사용 가능한 anti-virus SW가 있습니다. CLAM Anti-Virus (clamav)입니다.

CLAMAV는 open source 기반의 anti-virus SW로서, 다음이 홈페이지로 되어 있고, source를 download 받을 수도 있습니다.

http://www.clamav.net/

ppc64le에서 빌드하는 방법도 매우 간단하여, 그냥 ./configure && make && sudo make install 만 해주시면 됩니다.

그러나 deep learning에서 주로 사용하는 Ubuntu에는 아예 OS의 표준 apt repository에 포함되어 있어 손쉽게 설치 및 사용이 가능합니다.

설치는 다음과 같이 apt-get install 명령으로 하시면 됩니다.

u0017649@sys-89983:~$ sudo apt-get install clamav clamav-daemon clamav-freshclam clamav-base libclamav-dev clamav-testfiles

clamav-daemon은 다음과 같이 start 하시면 됩니다.

u0017649@sys-89983:~$ sudo systemctl start clamav-daemon.service

u0017649@sys-89983:~$ sudo systemctl status clamav-daemon.service
● clamav-daemon.service - Clam AntiVirus userspace daemon
Loaded: loaded (/lib/systemd/system/clamav-daemon.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2017-12-17 21:07:57 EST; 4s ago
Docs: man:clamd(8)
man:clamd.conf(5)
http://www.clamav.net/lang/en/doc/
Main PID: 9462 (clamd)
Tasks: 1
Memory: 234.0M
CPU: 4.121s
CGroup: /system.slice/clamav-daemon.service
└─9462 /usr/sbin/clamd --foreground=true

/home/u0017649/hpcc-1.5.0 라는 directory 내용을 scan하여, 혹시 virus에 감염된 파일이 있을 경우 경고(bell)를 울려주는 명령은 다음과 같이 하시면 됩니다.

u0017649@sys-89983:~$ clamscan -r --bell -i /home/u0017649/hpcc-1.5.0

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 34
Scanned files: 737
Infected files: 0
Data scanned: 9.64 MB
Data read: 6.11 MB (ratio 1.58:1)
Time: 18.255 sec (0 m 18 s)

만약 virus에 감염된 파일이 있을 경우 자동으로 제거까지 하기를 원한다면 다음과 같이 --remove 옵션을 사용하시면 됩니다. 다만, ppc64le 아키텍처에서 virus 감염 파일을 구하는 것은 정말 어려울 것이므로, 위에 언급된 clamav 홈페이지에서 clamav source code를 download 받아서 그 source를 scan해보겠습니다. 그 속에는 test용으로 들어있는 파일들이 있는 모양이더라구요.

u0017649@sys-89983:~$ tar -zxf clamav-0.99.2.tar.gz

u0017649@sys-89983:~$ clamscan -r --remove /home/u0017649/clamav-0.99.2 > clamscan.out

다음과 같이 3개 파일이 감염되었다고 제거된 것을 보실 수 있습니다.

u0017649@sys-89983:~$ grep -i removed clamscan.out
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_int.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam.isoaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_ext.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clamjol.isoaa: Removed.

u0017649@sys-89983:~$ tail clamscan.out

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 227
Scanned files: 3231
Infected files: 4
Data scanned: 93.26 MB
Data read: 50.67 MB (ratio 1.84:1)
Time: 28.488 sec (0 m 28 s)

Anti-virus SW는 virus 목록 등이 계속 업데이트 되는 것이 중요하지요. 그런 일을 해주는 것이 freshclam 입니다. 이건 설치되면 자동으로 수행되는데, 그 log는 다음과 같이 확인하실 수 있습니다.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:57:04 2017 -> ClamAV update process started at Sun Dec 17 20:57:04 2017
Sun Dec 17 20:58:02 2017 -> Downloading main.cvd [100%]
Sun Dec 17 20:58:13 2017 -> main.cvd updated (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 20:58:35 2017 -> Downloading daily.cvd [100%]
Sun Dec 17 20:58:39 2017 -> daily.cvd updated (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 20:58:40 2017 -> Downloading bytecode.cvd [100%]
Sun Dec 17 20:58:40 2017 -> bytecode.cvd updated (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------

위의 log를 보면 clamd.ctl 파일이 없어서 clamd에 대한 notification이 제대로 되지 않은 것을 보실 수 있습니다. 저 file은 clamav-daemon을 처음 살릴 때 자동 생성되는데, 제가 위에서 'systemctl start clamav-daemon.service' 명령으로 clamav-daemon을 살리기 전에 freshclam이 구동되는 바람에 벌어진 일 같습니다. 이제 clamav-daemon을 제가 살려 놓았으므로, 다음과 같이 freshclam을 죽였다가 살리면 해결됩니다.

u0017649@sys-89983:~$ ps -ef | grep freshclam
clamav 8894 1 1 20:57 ? 00:00:10 /usr/bin/freshclam -d --foreground=true
u0017649 9473 31958 0 21:08 pts/0 00:00:00 grep --color=auto freshclam

u0017649@sys-89983:~$ sudo kill -9 8894

u0017649@sys-89983:~$ sudo /usr/bin/freshclam -d --foreground=false

위에서는 freshcalm을 background daemon으로 살렸습니다. 다시 log를 보시지요.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> freshclam daemon 0.99.2 (OS: linux-gnu, ARCH: ppc, CPU: powerpc64le)
Sun Dec 17 21:09:37 2017 -> ClamAV update process started at Sun Dec 17 21:09:37 2017
Sun Dec 17 21:09:37 2017 -> main.cvd is up to date (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 21:09:37 2017 -> daily.cvd is up to date (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> bytecode.cvd is up to date (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> --------------------------------------

이제 error 없이 잘 update된 것을 보실 수 있습니다.

clamconf 명령은 clamav 관련 각종 config 파일을 점검해주는 명령입니다. 그 output은 아래와 같습니다.

u0017649@sys-89983:~$ clamconf
Checking configuration files in /etc/clamav

Config file: clamd.conf
-----------------------
LogFile = "/var/log/clamav/clamav.log"
StatsHostID = "auto"
StatsEnabled disabled
StatsPEDisabled = "yes"
StatsTimeout = "10"
LogFileUnlock disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogClean disabled
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
ExtendedDetectionInfo = "yes"
PidFile disabled
TemporaryDirectory disabled
DatabaseDirectory = "/var/lib/clamav"
OfficialDatabaseOnly disabled
LocalSocket = "/var/run/clamav/clamd.ctl"
LocalSocketGroup = "clamav"
LocalSocketMode = "666"
FixStaleSocket = "yes"
TCPSocket disabled
TCPAddr disabled
MaxConnectionQueueLength = "15"
StreamMaxLength = "26214400"
StreamMinPort = "1024"
StreamMaxPort = "2048"
MaxThreads = "12"
ReadTimeout = "180"
CommandReadTimeout = "5"
SendBufTimeout = "200"
MaxQueue = "100"
IdleTimeout = "30"
ExcludePath disabled
MaxDirectoryRecursion = "15"
FollowDirectorySymlinks disabled
FollowFileSymlinks disabled
CrossFilesystems = "yes"
SelfCheck = "3600"
DisableCache disabled
VirusEvent disabled
ExitOnOOM disabled
AllowAllMatchScan = "yes"
Foreground disabled
Debug disabled
LeaveTemporaryFiles disabled
User = "clamav"
AllowSupplementaryGroups disabled
Bytecode = "yes"
BytecodeSecurity = "TrustSigned"
BytecodeTimeout = "60000"
BytecodeUnsigned disabled
BytecodeMode = "Auto"
DetectPUA disabled
ExcludePUA disabled
IncludePUA disabled
AlgorithmicDetection = "yes"
ScanPE = "yes"
ScanELF = "yes"
DetectBrokenExecutables disabled
ScanMail = "yes"
ScanPartialMessages disabled
PhishingSignatures = "yes"
PhishingScanURLs = "yes"
PhishingAlwaysBlockCloak disabled
PhishingAlwaysBlockSSLMismatch disabled
PartitionIntersection disabled
HeuristicScanPrecedence disabled
StructuredDataDetection disabled
StructuredMinCreditCardCount = "3"
StructuredMinSSNCount = "3"
StructuredSSNFormatNormal = "yes"
StructuredSSNFormatStripped disabled
ScanHTML = "yes"
ScanOLE2 = "yes"
OLE2BlockMacros disabled
ScanPDF = "yes"
ScanSWF = "yes"
ScanXMLDOCS = "yes"
ScanHWP3 = "yes"
ScanArchive = "yes"
ArchiveBlockEncrypted disabled
ForceToDisk disabled
MaxScanSize = "104857600"
MaxFileSize = "26214400"
MaxRecursion = "16"
MaxFiles = "10000"
MaxEmbeddedPE = "10485760"
MaxHTMLNormalize = "10485760"
MaxHTMLNoTags = "2097152"
MaxScriptNormalize = "5242880"
MaxZipTypeRcg = "1048576"
MaxPartitions = "50"
MaxIconsPE = "100"
MaxRecHWP3 = "16"
PCREMatchLimit = "10000"
PCRERecMatchLimit = "5000"
PCREMaxFileSize = "26214400"
ScanOnAccess disabled
OnAccessMountPath disabled
OnAccessIncludePath disabled
OnAccessExcludePath disabled
OnAccessExcludeUID disabled
OnAccessMaxFileSize = "5242880"
OnAccessDisableDDD disabled
OnAccessPrevention disabled
OnAccessExtraScanning disabled
DevACOnly disabled
DevACDepth disabled
DevPerformance disabled
DevLiblog disabled
DisableCertCheck disabled

Config file: freshclam.conf
---------------------------
StatsHostID disabled
StatsEnabled disabled
StatsTimeout disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
PidFile disabled
DatabaseDirectory = "/var/lib/clamav"
Foreground disabled
Debug disabled
AllowSupplementaryGroups disabled
UpdateLogFile = "/var/log/clamav/freshclam.log"
DatabaseOwner = "clamav"
Checks = "24"
DNSDatabaseInfo = "current.cvd.clamav.net"
DatabaseMirror = "db.local.clamav.net", "database.clamav.net"
PrivateMirror disabled
MaxAttempts = "5"
ScriptedUpdates = "yes"
TestDatabases = "yes"
CompressLocalDatabase disabled
ExtraDatabase disabled
DatabaseCustomURL disabled
HTTPProxyServer disabled
HTTPProxyPort disabled
HTTPProxyUsername disabled
HTTPProxyPassword disabled
HTTPUserAgent disabled
NotifyClamd = "/etc/clamav/clamd.conf"
OnUpdateExecute disabled
OnErrorExecute disabled
OnOutdatedExecute disabled
LocalIPAddress disabled
ConnectTimeout = "30"
ReceiveTimeout = "30"
SubmitDetectionStats disabled
DetectionStatsCountry disabled
DetectionStatsHostID disabled
SafeBrowsing disabled
Bytecode = "yes"

clamav-milter.conf not found

Software settings
-----------------
Version: 0.99.2
Optional features supported: MEMPOOL IPv6 FRESHCLAM_DNS_FIX AUTOIT_EA06 BZIP2 LIBXML2 PCRE ICONV JSON

Database information
--------------------
Database directory: /var/lib/clamav
bytecode.cvd: version 319, sigs: 75, built on Wed Dec 6 21:17:11 2017
main.cvd: version 58, sigs: 4566249, built on Wed Jun 7 17:38:10 2017
daily.cvd: version 24138, sigs: 1806393, built on Sun Dec 17 16:10:39 2017
Total number of signatures: 6372717

Platform information
--------------------
uname: Linux 4.4.0-103-generic #126-Ubuntu SMP Mon Dec 4 16:22:09 UTC 2017 ppc64le
OS: linux-gnu, ARCH: ppc, CPU: powerpc64le
Full OS version: Ubuntu 16.04.2 LTS
zlib version: 1.2.8 (1.2.8), compile flags: a9
platform id: 0x0a3152520800000000050400

Build information
-----------------
GNU C: 5.4.0 20160609 (5.4.0)
CPPFLAGS: -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64 -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE
CXXFLAGS:
LDFLAGS: -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed
Configure: '--build=powerpc64le-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libexecdir=/usr/lib/clamav' '--disable-maintainer-mode' '--disable-dependency-tracking' 'CFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed' '--with-dbdir=/var/lib/clamav' '--sysconfdir=/etc/clamav' '--disable-clamav' '--disable-unrar' '--enable-milter' '--enable-dns-fix' '--with-libjson' '--with-gnu-ld' '--with-systemdsystemunitdir=/lib/systemd/system' 'build_alias=powerpc64le-linux-gnu'
sizeof(void*) = 8
Engine flevel: 82, dconf: 82

2017년 11월 20일 월요일

Minsky 서버에서의 JCuda 0.8.0 (CUDA 8용) build (ppc64le)

JCuda를 빌드하는 것은 아래 github에 나온 대로 따라하시면 됩니다.

https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md

먼저, 아래와 같이 9개의 project에 대해 git clone을 수행합니다.

u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda-main.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda-common.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcublas.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcufft.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcusparse.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcurand.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcusolver.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jnvgraph.git

각 directory로 들어가서, CUDA 8에 맞는 버전인 version-0.8.0으로 각각 checkout을 해줍니다.

u0017649@sys-90043:~/jcuda$ ls
jcublas jcuda jcuda-common jcuda-main jcufft jcurand jcusolver jcusparse jnvgraph

u0017649@sys-90043:~/jcuda$ cd jcublas
u0017649@sys-90043:~/jcuda/jcublas$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcublas$ cd ../jcuda
u0017649@sys-90043:~/jcuda/jcuda$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda$ cd ../jcuda-common
u0017649@sys-90043:~/jcuda/jcuda-common$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda-common$ cd ../jcuda-main
u0017649@sys-90043:~/jcuda/jcuda-main$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda-main$ cd ../jcufft
u0017649@sys-90043:~/jcuda/jcufft$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcufft$ cd ../jcurand
u0017649@sys-90043:~/jcuda/jcurand$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcurand$ cd ../jcusolver
u0017649@sys-90043:~/jcuda/jcusolver$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcusolver$ cd ../jcusparse
u0017649@sys-90043:~/jcuda/jcusparse$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcusparse$ cd ../jnvgraph
u0017649@sys-90043:~/jcuda/jnvgraph$ git checkout tags/version-0.8.0
u0017649@sys-90043:~/jcuda/jnvgraph$ cd ..

저는 이것을 Ubuntu 16.04 ppc64le에서 빌드했는데, 이걸 빌드할 때 GL/gl.h를 찾기 때문에 다음과 같이 libmesa-dev를 미리 설치해야 합니다.

u0017649@sys-90043:~/jcuda$ sudo apt-get install libmesa-dev

아래와 같이 기본적인 환경변수를 설정합니다.

u0017649@sys-90043:~/jcuda$ export LD_LIBRARY_PATH=/usr/local/cuda-8.0/targets/ppc64le-linux/lib:$LD_LIBRARY_PATH
u0017649@sys-90043:~/jcuda$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el

이제 cmake를 수행합니다. 이때 CUDA_nvrtc_LIBRARY를 cmake에게 알려주기 위해 다음과 같이 -D 옵션을 붙입니다.

u0017649@sys-90043:~/jcuda$ cmake ./jcuda-main -DCUDA_nvrtc_LIBRARY="/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libnvrtc.so"
...
-- Found CUDA: /usr/local/cuda/bin/nvcc
-- Found JNI: /usr/lib/jvm/java-8-openjdk-ppc64el/jre/lib/ppc64le/libjawt.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/u0017649/jcuda

다음으로는 make all을 수행합니다.

u0017649@sys-90043:~/jcuda$ make all
...
/home/u0017649/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:292:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
delete nativeObject->nativeTopologyData;
^
[100%] Linking CXX shared library ../../nativeLibraries/linux/ppc_64/lib/libJNvgraph-0.8.0-linux-ppc_64.so
[100%] Built target JNvgraph

끝나고 나면 jcuda-main directory로 들어가서 mvn으로 clean install을 수행합니다. 단, 여기서는 maven test는 모두 skip 했습니다. 저는 여기에 CUDA 8.0이 설치되어있긴 하지만 실제 GPU가 설치된 환경은 아니라서, test를 하면 cuda device를 찾을 수 없다며 error가 나기 때문입니다. GPU가 설치된 환경에서라면 저 "-Dmaven.test.skip=true" 옵션을 빼고 그냥 "mvn clean install"을 수행하시기 바랍니다.

u0017649@sys-90043:~/jcuda$ cd jcuda-main

u0017649@sys-90043:~/jcuda/jcuda-main$ mvn -Dmaven.test.skip=true clean install
...
[INFO] Configured Artifact: org.jcuda:jnvgraph-natives:linux-ppc_64:0.8.0:jar
[INFO] Copying jcuda-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcuda-0.8.0.jar
[INFO] Copying jcuda-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcuda-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcublas-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcublas-0.8.0.jar
[INFO] Copying jcublas-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcublas-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcufft-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcufft-0.8.0.jar
[INFO] Copying jcufft-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcufft-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcusparse-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcusparse-0.8.0.jar
[INFO] Copying jcusparse-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcusparse-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcurand-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcurand-0.8.0.jar
[INFO] Copying jcurand-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcurand-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcusolver-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcusolver-0.8.0.jar
[INFO] Copying jcusolver-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcusolver-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jnvgraph-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jnvgraph-0.8.0.jar
[INFO] Copying jnvgraph-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jnvgraph-natives-0.8.0-linux-ppc_64.jar
[INFO]
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ jcuda-main ---
[INFO] Installing /home/u0017649/jcuda/jcuda-main/pom.xml to /home/u0017649/.m2/repository/org/jcuda/jcuda-main/0.8.0/jcuda-main-0.8.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] JCuda .............................................. SUCCESS [ 2.596 s]
[INFO] jcuda-natives ...................................... SUCCESS [ 0.596 s]
[INFO] jcuda .............................................. SUCCESS [ 13.244 s]
[INFO] jcublas-natives .................................... SUCCESS [ 0.120 s]
[INFO] jcublas ............................................ SUCCESS [ 6.343 s]
[INFO] jcufft-natives ..................................... SUCCESS [ 0.029 s]
[INFO] jcufft ............................................. SUCCESS [ 2.843 s]
[INFO] jcurand-natives .................................... SUCCESS [ 0.036 s]
[INFO] jcurand ............................................ SUCCESS [ 2.428 s]
[INFO] jcusparse-natives .................................. SUCCESS [ 0.085 s]
[INFO] jcusparse .......................................... SUCCESS [ 7.853 s]
[INFO] jcusolver-natives .................................. SUCCESS [ 0.066 s]
[INFO] jcusolver .......................................... SUCCESS [ 4.158 s]
[INFO] jnvgraph-natives ................................... SUCCESS [ 0.057 s]
[INFO] jnvgraph ........................................... SUCCESS [ 2.932 s]
[INFO] jcuda-main ......................................... SUCCESS [ 1.689 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 45.413 s
[INFO] Finished at: 2017-11-20T01:41:41-05:00
[INFO] Final Memory: 53M/421M
[INFO] ------------------------------------------------------------------------

위와 같이 build는 잘 끝나고, 결과물로는 jcuda-main/target directory에 jar 파일들 14개가 생긴 것을 보실 수 있습니다.

u0017649@sys-90043:~/jcuda/jcuda-main$ cd target
u0017649@sys-90043:~/jcuda/jcuda-main/target$ ls -ltr
total 1680
-rw-rw-r-- 1 u0017649 u0017649 318740 Nov 20 01:41 jcuda-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 149350 Nov 20 01:41 jcuda-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 297989 Nov 20 01:41 jcublas-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 30881 Nov 20 01:41 jcublas-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 196292 Nov 20 01:41 jnvgraph-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 11081 Nov 20 01:41 jnvgraph-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 307435 Nov 20 01:41 jcusparse-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 38335 Nov 20 01:41 jcusparse-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 248028 Nov 20 01:41 jcusolver-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 22736 Nov 20 01:41 jcusolver-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 26684 Nov 20 01:41 jcurand-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 8372 Nov 20 01:41 jcurand-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 27414 Nov 20 01:41 jcufft-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 11052 Nov 20 01:41 jcufft-0.8.0.jar

이제 이 14개의 jar 파일을 적당한 directory에 옮겨서 tar로 묶어 주면 됩니다. 저는 /tmp 밑에 JCuda-All-0.8.0-bin-linux-ppc64le 라는 이름의 directory를 만들고 거기로 이 jar 파일들을 옮긴 뒤 그 directory를 tar로 다음과 같이 묶었습니다.

u0017649@sys-90043:/tmp$ tar -zcvf JCuda-All-0.8.0-bin-linux-ppc64le.tgz JCuda-All-0.8.0-bin-linux-ppc64le

내용은 다음과 같습니다.

u0017649@sys-90043:/tmp$ tar -ztvf JCuda-All-0.8.0-bin-linux-ppc64le.tgz
drwxrwxr-x u0017649/u0017649 0 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/
-rw-rw-r-- u0017649/u0017649 30881 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcublas-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 11052 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcufft-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 196292 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jnvgraph-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 149350 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcuda-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 307435 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusparse-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 248028 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusolver-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 11081 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jnvgraph-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 27414 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcufft-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 297989 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcublas-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 38335 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusparse-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 8372 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcurand-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 26684 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcurand-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 22736 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusolver-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 318740 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcuda-natives-0.8.0-linux-ppc_64.jar

이 JCuda-All-0.8.0-bin-linux-ppc64le.tgz 파일은 아래 link에서 download 받으실 수 있습니다.

https://drive.google.com/open?id=1CnlvJARkRWPDTbynUUlNBL_TQbu1-xbn

* 참고로, jcuda.org에서 x86용 binary를 download 받아보니 제 것과 마찬가지로 14개의 jar 파일이 들어있습니다. 아마 빌드는 제대로 된 것 같습니다.

** 참고로, 저 위에서 cmake를 할 때 -DCUDA_nvrtc_LIBRARY="/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libnvrtc.so" 옵션을 붙이는 이유는 아래의 error를 피하기 위한 것입니다.

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_nvrtc_LIBRARY
linked by target "JNvrtc" in directory /home/u0017649/jcuda/jcuda/JNvrtcJNI

*

2017년 10월 27일 금요일

caffe-ibm의 LMS 기능에 대한 설명

전에 올린 DDL 관련 포스팅(https://hwengineer.blogspot.kr/2017/10/caffe-ddl-alexnet-training.html)에서, LMS(large model support) 기능을 믿고 batch_size를 화끈하게 2048이나 4096으로 올리면 어떤가라는 질문이 있을 수 있습니다. 결론부터 말씀드리면 LMS를 쓴다고 해도 batch_size를 무한정 키울 수는 없습니다.

먼저 저 포스팅에 나와 있듯이, LMS 기능은 '기존 GPU memory 한계 때문에 돌릴 수 없었던 큰 모델도 돌릴 수 있게 해주는 기능'이지, 이 때문에 반드시 더 빠른 성능을 낼 수 있는 것은 아닙니다. 아무리 NVLink를 통해 가져온다고 해도, host server memory가 GPU memory보다는 느리니까요.

그와는 별도로, LMS도 무한정 host memory를 쓸 수는 없습니다. Lab에서 들은 이야기입니다만, LMS를 쓴다고 해도 아래 정보들은 반드시 GPU memory 상에 올라가야 한다고 합니다.

- input tensor
- output tensor
- weight
- gradient (training인 경우)

그리고 이 정보들이 차지하는 memory의 양은 batch_size가 늘어날 수록 함께 늘어나는데, 그로 인해 결국 한계가 있습니다. LMS의 핵심은, deep learning에서 layer별로 training을 할 때, 당장 처리하고 있는 layer는 GPU memory 위에 두더라도, 이미 처리했거나 나중에 처리할 layer들은 host memory에 저장할 수 있다는 것입니다. 따라서, LMS로 처리가능한 최대 neural network 크기는 그 neural network에서 가장 큰 layer의 크기에 달려 있다고 할 수 있습니다.

가령 전에 테스트했던, Alexnet의 deploy.prototxt 이용해서 'caffe time'을 수행할 때 보면 아래와 같이 data, conv1, relu1 등 총 24개의 layer가 만들어집니다.

$ grep "Creating Layer" caffe_time.log
…
I1025 16:37:01.848961 29514 net.cpp:90] Creating Layer data
I1025 16:37:01.867213 29514 net.cpp:90] Creating Layer conv1
…
I1025 16:37:03.477823 29514 net.cpp:90] Creating Layer fc8
I1025 16:37:03.481210 29514 net.cpp:90] Creating Layer prob

그리고 각 layer마다 다음과 같이 Top shape가 정해지면서 "Memory required for data"가 연산됩니다. 그리고 그 값은 처음 layer에서는 작아도 나중에는 매우 커지지요.

…
I1025 16:37:01.867137 29514 net.cpp:135] Top shape: 10 3 1600 1200 (57600000)
I1025 16:37:01.867183 29514 net.cpp:143] Memory required for data: 230400000
…
I1025 16:37:02.231456 29514 net.cpp:135] Top shape: 10 96 398 298 (113859840)
I1025 16:37:02.231468 29514 net.cpp:143] Memory required for data: 685839360
…
I1025 16:37:02.246103 29514 net.cpp:135] Top shape: 10 256 49 37 (4641280)
I1025 16:37:02.246112 29514 net.cpp:143] Memory required for data: 3315185920
…

이런 특성들로 인해, wide한 (layer별 메모리 필요량이 많은) neural network보다는 deep한 (layer 개수가 많은) neural network이 LMS의 잇점을 훨씬 더 잘 살릴 수 있습니다.

LMS의 진정한 장점을 정리하면 아래와 같습니다.

1) width로는 10-30배 정도 더 큰 모델 사용 가능
2) depth로는 무한대의 큰 모델을 사용 가능

특성상 layer들이 많은 RNN에서 LMS가 특히 유용하게 쓰일 수 있다고 합니다. 그리고 요즘 신경망 발전 방향이 wide해지는 방향이 아니라 점점 deep해지는 방향이라고 합니다. 가령 몇년 전에 나온 Alexnet 같은 경우 이론상 7개 layer라고 하는데, 그로부터 4년 뒤에 나온 ResNet 같은 경우 1000개 layer라고 하지요. LMS의 적용 범위는 점점 넓어지고 있다고 할 수 있습니다.

caffe DDL을 이용한 Alexnet training

지난 9월 포스팅(http://hwengineer.blogspot.kr/2017/09/ibm-powerai-40-caffe-distributed-deep.html)에서 PowerAI에 포함된 DDL(Distributed Deep Learning), 즉 MPI를 이용한 분산처리 기능에 대해 간단히 설명드린 바 있습니다. 이번에는 그것으로 ILSVRC2012의 128만장 image dataset을 caffe alexnet으로 training 해보겠습니다.

제가 잠깐 빌릴 수 있는 Minsky 서버가 딱 1대 뿐이라, 원래 여러대의 Minsky 서버를 묶어서 하나의 model을 train시킬 수 있지만 여기서는 1대의 서버에서 caffe DDL을 수행해보겠습니다. 잠깐, 1대라고요 ? 1대에서 MPI 분산처리가 의미가 있나요 ?

예, 없지는 않습니다. Multi-GPU를 이용한 training을 할 때 일반 caffe와 caffe DDL의 차이는 multi-thread냐, multi-process냐의 차이입니다. 좀더 쉽게 말해, 일반 caffe에서는 하나의 caffe process가 P2P를 통해 여러개의 GPU를 사용합니다. 그에 비해, caffe DDL에서는 GPU당 1개씩 별도의 caffe process가 떠서, 서로간에 MPI를 이용한 통신을 하며 여러개의 GPU를 사용합니다.

이를 그림으로 표현하면 아래와 같습니다.

실제로, 일반 caffe를 사용할 경우 nvidia-smi로 관찰해보면 다음과 같이 caffe의 PID가 모두 같지만, caffe DDL에서는 각 GPU를 사용하는 caffe PID가 서로 다릅니다.

일반 caffe :

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 30681 C caffe 15497MiB |
| 1 30681 C caffe 14589MiB |
| 2 30681 C caffe 14589MiB |
| 3 30681 C caffe 14589MiB |
+-----------------------------------------------------------------------------+

caffe DDL :

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31227 C caffe 15741MiB |
| 1 31228 C caffe 14837MiB |
| 2 31229 C caffe 14837MiB |
| 3 31230 C caffe 14837MiB |
+-----------------------------------------------------------------------------+

자, 대략 차이를 이해하셨으면, alexnet training을 한번은 일반 caffe로, 또 한번은 caffe DDL로 training해보시지요. 물론 모두 같은 Minsky 서버, 즉 4-GPU 시스템 1대를 써서 테스트한 것입니다. 각각의 성능 측정은 128만장을 2-epochs, 즉 2회 반복 training할 때 걸린 시간으로 측정하겠습니다.

일반 caffe :

test@ubuntu:/nvme$ caffe train --solver=models/bvlc_alexnet/solver.prototxt -gpu all

caffe DDL :

test@ubuntu:/nvme$ mpirun -x PATH -x LD_LIBRARY_PATH -n 4 -rf 4x1x1.rf caffe train --solver=models/bvlc_alexnet/solver.prototxt -gpu 0 -ddl "-mode n:4x1x1 -dev_sync 1"

지난번에 잠깐 설명드린 것을 반복하자면 이렇습니다.

- mpirun은 여러대의 서버 노드에 동일한 명령을 동일한 환경변수 (-x 옵션)을 써서 수행해주는 병렬환경 명령어입니다.
- 4x1x1.rf라는 이름의 파일은 rank file입니다. 이 속에 병렬 서버 환경의 toplogy가 들어있습니다.
- -n 4라는 것은 MPI client의 총 숫자이며, 쉽게 말해 training에 이용하려는 GPU의 갯수입니다.
- -gpu 0에서, 왜 4개가 아니라 gpu 0이라고 1개로 지정했는지 의아하실 수 있는데, MPI 환경에서는 각각의 GPU가 하나의 learner가 됩니다. 따라서 실제 물리적 서버 1대에 GPU가 몇 장 장착되어있든 상관없이 모두 -gpu 0, 즉 GPU는 1개로 지정한 것입니다.
- "-mode b:4x1x1"에서 b라는 것은 가능하면 enhanced NCCL을 이용하라는 뜻입니다. 4x1x1은 4장의 GPU를 가진 서버 1대가 하나의 rack에 들어있다는 뜻입니다.
- dev_sync에서 0은 GPU간 sync를 하지 말라는 것이고, 1은 통신 시작할 때 sync하라는 뜻, 2는 시작할 때와 끝낼 때 각각 sync하라는 뜻입니다.

여기서 사용된 rank file 4x1x1.rf 속의 내용은 아래와 같습니다.

rank 0=minksy slot=0:0-3
rank 1=minksy slot=0:4-7
rank 2=minksy slot=1:0-3
rank 3=minksy slot=1:4-7

128만장 x 2-epochs를 처리하기 위해서는, solver.prototxt와 train_val.prototxt 속에 표시된 batch_size와 max_iter의 곱이 128만장 x 2 = 256만장이면 됩니다. batch_size를 조절함에 따라 training 속도가 꽤 달라지는데, 여기서는 256부터 512, 768 순으로 늘려가며 테스트해보겠습니다.

위 표에서 보시다시피, batch_size가 작을 때는 일반 caffe의 성능이 더 빨랐는데, batch_size가 점점 커지면서 caffe DDL의 성능이 점점 더 빨라져서 결국 역전하게 됩니다. batch_size와 MPI를 이용한 DDL의 성능과의 상관 관계가 있을 것 같기는 한데, 아직 그 이유는 파악을 못 했습니다. Lab에 문의해봤는데, batch_size와는 무관할 것이라는 답변을 받긴 했습니다.

여기서 batch_size를 1024보다 더 키우면 어떻게 될까요 ?

...
F1025 17:48:43.059572 30265 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memoryF1025 17:48:43.071281 30285 syncedmem.cpp:651] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x3fffb645ce0c google::LogMessage::Fail()
@ 0x3fffb69649cc caffe::Solver<>::Step()
@ 0x3fffb645f284 google::LogMessage::SendToLog()
...

일반 caffe든 caffe DDL이든 batch_size가 1100만 되어도 이렇게 out-of-memory (OOM) error를 내며 죽어버립니다. 그러니 아쉽게도 더 큰 batch_size에서는 테스트가 안되는 것이지요.

batch_size가 너무 커서 OOM error가 난다면 그걸 또 피해가는 방법이 있습니다. 역시 caffe-ibm에 포함된 LMS(large model support)입니다. 아래와 같이 -lms 옵션을 주면 caffe나 caffe DDL이나 모두 batch_size=1200 정도까지는 무난히 돌릴 수 있습니다. -lms 800000이라는 것은 800000KB 이상의 memory chunk는 GPU 말고 CPU에 남겨두라는 뜻입니다. (http://hwengineer.blogspot.kr/2017/09/inference-gpu-sizing-ibm-caffe-large.html 참조)

일반 caffe with LMS :

test@ubuntu:/nvme$ caffe train -lms 800000 --solver=models/bvlc_alexnet/solver.prototxt -gpu all

caffe DDL with LMS :

test@ubuntu:/nvme$ mpirun -x PATH -x LD_LIBRARY_PATH -n 4 -rf 4x1x1.rf caffe train -lms 800000 --solver=models/bvlc_alexnet/solver.prototxt -gpu 0 -ddl "-mode n:4x1x1 -dev_sync 1"

그 결과는 아래와 같습니다. 확실히 batch_size가 커질 수록 일반 caffe보다 caffe DDL의 성능이 더 잘 나옵니다.

궁금해하실 분들을 위해서, caffe DDL을 수행할 경우 나오는 메시지의 앞부분과 뒷부분 일부를 아래에 붙여놓습니다.

--------------------------------------------------------------------------

[[31653,1],2]: A high-performance Open MPI point-to-point messaging module

was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)

Host: minsky

Another transport will be used instead, although this may result in

lower performance.

--------------------------------------------------------------------------

ubuntu: n0(0) n1(0) n2(0) n3(0)

I1025 18:59:45.681555 31227 caffe.cpp:151] [MPI:0 ] spreading GPUs per MPI rank

I1025 18:59:45.681725 31227 caffe.cpp:153] [MPI:0 ] use gpu[0]

I1025 18:59:45.681541 31228 caffe.cpp:151] [MPI:1 ] spreading GPUs per MPI rank

I1025 18:59:45.681725 31228 caffe.cpp:153] [MPI:1 ] use gpu[1]

I1025 18:59:45.681541 31229 caffe.cpp:151] [MPI:2 ] spreading GPUs per MPI rank

I1025 18:59:45.681726 31229 caffe.cpp:153] [MPI:2 ] use gpu[2]

I1025 18:59:45.681735 31229 caffe.cpp:283] Using GPUs 2

I1025 18:59:45.681541 31230 caffe.cpp:151] [MPI:3 ] spreading GPUs per MPI rank

I1025 18:59:45.681726 31230 caffe.cpp:153] [MPI:3 ] use gpu[3]

I1025 18:59:45.681735 31230 caffe.cpp:283] Using GPUs 3

I1025 18:59:45.681735 31227 caffe.cpp:283] Using GPUs 0

I1025 18:59:45.681733 31228 caffe.cpp:283] Using GPUs 1

I1025 18:59:45.683846 31228 caffe.cpp:288] GPU 1: Tesla P100-SXM2-16GB

I1025 18:59:45.683897 31227 caffe.cpp:288] GPU 0: Tesla P100-SXM2-16GB

I1025 18:59:45.683955 31230 caffe.cpp:288] GPU 3: Tesla P100-SXM2-16GB

I1025 18:59:45.684010 31229 caffe.cpp:288] GPU 2: Tesla P100-SXM2-16GB

I1025 18:59:46.056959 31227 caffe.cpp:302] [MPI:0 ] name = minsky root = 1

I1025 18:59:46.067212 31228 caffe.cpp:302] [MPI:1 ] name = minsky root = 1

I1025 18:59:46.070734 31230 caffe.cpp:302] [MPI:3 ] name = minsky root = 1

I1025 18:59:46.071211 31229 caffe.cpp:302] [MPI:2 ] name = minsky root = 1

I1025 18:59:46.073958 31227 solver.cpp:44] Initializing solver from parameters:

test_iter: 1000

test_interval: 1000

base_lr: 0.01

display: 500

max_iter: 2500

lr_policy: "step"

gamma: 0.1

...중략...

I1025 19:24:53.536928 31227 solver.cpp:414] Test net output #0: accuracy = 0.20032

I1025 19:24:53.536965 31227 solver.cpp:414] Test net output #1: loss = 4.09802 (* 1 = 4.09802 loss)

I1025 19:24:54.180562 31227 solver.cpp:223] Iteration 2000 (1.27922 iter/s, 390.864s/500 iters), loss = 4.18248

I1025 19:24:54.180598 31227 solver.cpp:242] Train net output #0: loss = 4.18248 (* 1 = 4.18248 loss)

I1025 19:24:54.180613 31227 sgd_solver.cpp:121] Iteration 2000, lr = 0.01

I1025 19:26:57.349701 31256 data_layer.cpp:86] Restarting data prefetching from start.

I1025 19:30:28.547333 31256 data_layer.cpp:86] Restarting data prefetching from start.

I1025 19:30:29.081480 31227 solver.cpp:466] Snapshotting to binary proto file models/bvlc_alexnet/caffe_alexnet_train_iter_2500.caffemodel

I1025 19:30:29.283386 31228 solver.cpp:315] Iteration 2500, loss = 3.91634

I1025 19:30:29.283444 31228 solver.cpp:320] Optimization Done.

I1025 19:30:29.283535 31230 solver.cpp:315] Iteration 2500, loss = 3.9612

I1025 19:30:29.283582 31230 solver.cpp:320] Optimization Done.

I1025 19:30:29.285512 31228 caffe.cpp:357] Optimization Done.

I1025 19:30:29.285521 31228 caffe.cpp:359] [MPI:1 ] MPI_Finalize

I1025 19:30:29.285697 31230 caffe.cpp:357] Optimization Done.

I1025 19:30:29.285706 31230 caffe.cpp:359] [MPI:3 ] MPI_Finalize

I1025 19:30:29.286912 31229 solver.cpp:315] Iteration 2500, loss = 3.90313

I1025 19:30:29.286952 31229 solver.cpp:320] Optimization Done.

I1025 19:30:29.290489 31229 caffe.cpp:357] Optimization Done.

I1025 19:30:29.290498 31229 caffe.cpp:359] [MPI:2 ] MPI_Finalize

I1025 19:30:29.973234 31227 sgd_solver.cpp:356] Snapshotting solver state to binary proto file models/bvlc_alexnet/caffe_alexnet_train_iter_2500.solverstate

I1025 19:30:30.727695 31227 solver.cpp:315] Iteration 2500, loss = 3.89638

I1025 19:30:30.727744 31227 solver.cpp:320] Optimization Done.

I1025 19:30:30.729465 31227 caffe.cpp:357] Optimization Done.

I1025 19:30:30.729475 31227 caffe.cpp:359] [MPI:0 ] MPI_Finalize

2017년 10월 25일 수요일

Minsky에서의 Caffe2 설치 및 jupyter notebook에서 MNIST 돌려보기

PyTorch와 함께 Caffe2를 찾으시는 고객분들도 점점 늘고 계십니다. PyTorch처럼 Caffe2도 아직 PowerAI 속에 포함되어 있지는 않지만, 그래도 ppc64le 아키텍처에서도 소스에서 빌드하면 잘 됩니다. 현재의 최신 Caffe2 소스코드에서는 딱 한군데, benchmark 관련 모듈에서 error가 나는 부분이 있는데, 거기에 대해서 IBM이 patch를 제공합니다. 아래를 보시고 따라 하시면 됩니다.

먼저, 기본적인 caffe2의 build는 caffe2 홈페이지에 나온 안내(아래 링크)를 그대로 따르시면 됩니다.

https://caffe2.ai/docs/getting-started.html?platform=ubuntu&configuration=compile

저는 아래와 같은 환경에서 빌드했습니다. CUDA 8.0과 libcuDNN이 설치된 Ubuntu 16.04 ppc64le 환경입니다. 특기할 부분으로는, caffe2는 Anaconda와는 뭔가 맞지 않습니다. Anaconda가 혹시 설치되어 있다면 관련 PATH를 모두 unset하시고 빌드하시기 바랍니다.

u0017649@sys-89697:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

u0017649@sys-89697:~$ dpkg -l | grep cuda | head -n 4
ii cuda 8.0.61-1 ppc64el CUDA meta-package
ii cuda-8-0 8.0.61-1 ppc64el CUDA 8.0 meta-package
ii cuda-command-line-tools-8-0 8.0.61-1 ppc64el CUDA command-line tools
ii cuda-core-8-0 8.0.61-1 ppc64el CUDA core tools

u0017649@sys-89697:~$ dpkg -l | grep libcudnn
ii libcudnn6 6.0.21-1+cuda8.0 ppc64el cuDNN runtime libraries
ii libcudnn6-dev 6.0.21-1+cuda8.0 ppc64el cuDNN development libraries and headers

위에서 언급한 대로, python이든 pip든 모두 Anaconda가 아닌, OS에서 제공하는 것을 사용하십시요. 여기서는 python 2.7을 썼습니다.

u0017649@sys-89697:~$ which python
/usr/bin/python

먼저, 기본적으로 아래의 Ubuntu OS package들을 설치하십시요.

u0017649@sys-89697:~$ sudo apt-get install -y --no-install-recommends build-essential cmake git libgoogle-glog-dev libprotobuf-dev protobuf-compiler

u0017649@sys-89697:~$ sudo apt-get install -y --no-install-recommends libgtest-dev libiomp-dev libleveldb-dev liblmdb-dev libopencv-dev libopenmpi-dev libsnappy-dev openmpi-bin openmpi-doc python-pydot libgflags-dev

그리고 PowerAI에서 제공하는 openblas와 nccl도 설치하십시요.

u0017649@sys-89697:~$ sudo dpkg -i mldl-repo-local_4.0.0_ppc64el.deb

u0017649@sys-89697:~$ sudo apt-get update

u0017649@sys-89697:~$ sudo apt-get install libnccl-dev
u0017649@sys-89697:~$ sudo apt-get install libopenblas-dev libopenblas

그리고 아래의 pip package들을 설치하십시요.

u0017649@sys-89697:~$ pip install flask future graphviz hypothesis jupyter matplotlib pydot python-nvd3 pyyaml requests scikit-image scipy setuptools six tornado protobuf

이제 caffe2의 source를 download 받습니다.

u0017649@sys-89697:~$ git clone --recursive https://github.com/caffe2/caffe2.git
Cloning into 'caffe2'...
remote: Counting objects: 36833, done.
remote: Compressing objects: 100% (64/64), done.
remote: Total 36833 (delta 37), reused 34 (delta 12), pack-reused 36757
Receiving objects: 100% (36833/36833), 149.17 MiB | 11.42 MiB/s, done.
Resolving deltas: 100% (26960/26960), done.
Checking connectivity... done.
Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK'
Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/NNPACK_deps/FP16'
...
remote: Counting objects: 353, done.
remote: Total 353 (delta 0), reused 0 (delta 0), pack-reused 353
Receiving objects: 100% (353/353), 119.74 KiB | 0 bytes/s, done.
Resolving deltas: 100% (149/149), done.
Checking connectivity... done.
Submodule path 'third_party/pybind11/tools/clang': checked out '254c7a91e3c6aa254e113197604dafb443f4d429'

이제 ppc64le를 위한 source 수정을 하나 하겠습니다. Patch가 필요한 부분은 주요 부분은 아니고, 벤치마크를 위한 third party 모듈 중 하나입니다.

u0017649@sys-89697:~$ cd caffe2/third_party/benchmark/src

u0017649@sys-89697:~/caffe2/third_party/benchmark/src$ vi cycleclock.h
...
#elif defined(__powerpc__) || defined(__ppc__)
// This returns a time-base, which is not always precisely a cycle-count.
int64_t tbl, tbu0, tbu1;
asm("mftbu %0" : "=r"(tbu0));
asm("mftb %0" : "=r"(tbl));
asm("mftbu %0" : "=r"(tbu1));
tbl &= -static_cast<long long>(tbu0 == tbu1);
// tbl &= -static_cast<int64_t>(tbu0 == tbu1);

수정의 핵심은 저 int64 대신 long long을 넣는 것입니다.

** 그 사이에 하나 더 늘었네요. 아래와 같이 third_party/NNPACK/CMakeLists.txt의 30번째 줄에 ppc64le를 넣어줍니다.

u0017649@sys-89697:~/caffe2/build# vi ../third_party/NNPACK/CMakeLists.txt
...

MESSAGE(FATAL_ERROR "Unrecognized CMAKE_SYSTEM_PROCESSOR = ${CMAKE_SYSTEM_PROCESSOR}")

...

그 다음으로는 나중에 cblas 관련 error를 없애기 위해, 아래와 같이 Dependencies.cmake 파일에서 cblas 관련 1줄을 comment-out 시킵니다.

u0017649@sys-89697:~/caffe2/cmake$ vi Dependencies.cmake
...
elseif(BLAS STREQUAL "OpenBLAS")
find_package(OpenBLAS REQUIRED)
caffe2_include_directories(${OpenBLAS_INCLUDE_DIR})
list(APPEND Caffe2_DEPENDENCY_LIBS ${OpenBLAS_LIB})
# list(APPEND Caffe2_DEPENDENCY_LIBS cblas)
...

그리고 OpenBLAS_HOME을 아래와 같이 PowerAI에서 제공하는 openblas로 설정합니다.

u0017649@sys-89697:~/caffe2/cmake$ export OpenBLAS_HOME=/opt/DL/openblas/

또 하나 고칩니다. FindNCCL.cmake에서 아래 부분을 PowerAI의 NCCL directory로 명기하면 됩니다.

root@26aa285b6c46:/data/imsi/caffe2/cmake# vi Modules/FindNCCL.cmake
...
#set(NCCL_ROOT_DIR "" CACHE PATH "Folder contains NVIDIA NCCL")
set(NCCL_ROOT_DIR "/opt/DL/nccl" CACHE PATH "Folder contains NVIDIA NCCL")
...

이제 cmake 이후 make를 돌리면 됩니다. 단, default인 /usr/local이 아닌, /opt/caffe2에 caffe2 binary가 설치되도록 CMAKE_INSTALL_PREFIX 옵션을 줍니다.

u0017649@sys-89697:~$ cd caffe2
u0017649@sys-89697:~/caffe2$ mkdir build
u0017649@sys-89697:~/caffe2$ cd build

u0017649@sys-89697:~/caffe2/build$ cmake -DCMAKE_INSTALL_PREFIX=/opt/caffe2 ..

u0017649@sys-89697:~/caffe2/build$ make

약 1시간 넘게 꽤 긴 시간이 걸린 뒤에 make가 완료됩니다. 이제 make install을 하면 /opt/caffe2 밑으로 binary들이 설치됩니다.

u0017649@sys-89697:~/caffe2/build$ sudo make install

u0017649@sys-89697:~/caffe2/build$ ls -l /opt/caffe2
total 28
drwxr-xr-x 2 root root 4096 Oct 25 03:16 bin
drwxr-xr-x 3 root root 4096 Oct 25 03:16 caffe
drwxr-xr-x 24 root root 4096 Oct 25 03:16 caffe2
drwxr-xr-x 4 root root 4096 Oct 25 03:16 include
drwxr-xr-x 2 root root 4096 Oct 25 03:16 lib
drwxr-xr-x 3 root root 4096 Oct 25 03:16 share
drwxr-xr-x 2 root root 4096 Oct 25 03:16 test

다른 일반 user들도 사용할 수 있도록 필요에 따라 /opt/caffe2의 ownership을 바꿔줍니다. (optional)

u0017649@sys-89697:~/caffe2/build$ sudo chown -R u0017649:u0017649 /opt/caffe2

이제 테스트를 해봅니다. 먼저, PYTHONPATH를 설정해줍니다.

u0017649@sys-89697:~/caffe2/build$ export PYTHONPATH=/opt/caffe2

다음과 같이 기본 테스트를 2가지 해봅니다. 먼저 import core를 해봅니다.

u0017649@sys-89697:~/caffe2/build$ python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"
Success

다음으로는 operator_test.relu_op_test를 수행해봅니다. 제가 가진 환경은 GPU가 안 달린 환경이라서 아래 붉은색의 warning이 있습니다만, 여기서는 무시하십시요.

u0017649@sys-89697:~/caffe2/build$ pip install hypothesis

u0017649@sys-89697:~$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/caffe2/lib

u0017649@sys-89697:~$ python -m caffe2.python.operator_test.relu_op_test
NVIDIA: no NVIDIA devices found
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1025 05:09:33.528787 12222 common_gpu.cc:70] Found an unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. I will set the available devices to be zero.
Trying example: test_relu(self=<__main__.TestRelu testMethod=test_relu>, X=array([ 0.], dtype=float32), gc=, dc=[], engine=u'')
Trying example: test_relu(self=<__main__.TestRelu testMethod=test_relu>, X=array([[[-0.03770517, -0.03770517, -0.03770517, -0.03770517],
[-0.03770517, -0.03770517, -0.03770517, -0.03770517],
[-0.03770517, -0.03770517, -0.03770517, -0.03770517]],
...
[ 0.96481699, 0.96481699, 0.96481699, 0.96481699],
[ 0.96481699, 0.96481699, 0.96481699, -0.74859387]]], dtype=float32), gc=, dc=[], engine=u'')
Trying example: test_relu(self=<__main__.TestRelu testMethod=test_relu>, X=array([ 0.61015409], dtype=float32), gc=, dc=[], engine=u'CUDNN')
.
----------------------------------------------------------------------
Ran 1 test in 0.957s

OK

테스트도 정상적으로 완료되었습니다. 이제 jupyter notebook에서 MNIST를 수행해보시지요.

먼저 jupyter notebook을 외부 browser에서 접속할 수 있도록 config를 수정합니다.

u0017649@sys-89697:~$ jupyter notebook --generate-config
Writing default config to: /home/u0017649/.jupyter/jupyter_notebook_config.py

u0017649@sys-89697:~$ vi /home/u0017649/.jupyter/jupyter_notebook_config.py
...
#c.NotebookApp.ip = 'localhost'
c.NotebookApp.ip = '*'

jupyter notebook을 kill 했다가 다시 띄웁니다. 이제 아래 붉은색처럼 token이 display 됩니다.

u0017649@sys-89697:~$ jupyter notebook &

[I 05:13:29.492 NotebookApp] Writing notebook server cookie secret to /home/u0017649/.local/share/jupyter/runtime/notebook_cookie_secret
[W 05:13:29.578 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 05:13:29.665 NotebookApp] Serving notebooks from local directory: /home/u0017649
[I 05:13:29.666 NotebookApp] 0 active kernels
[I 05:13:29.666 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=ca44acc48f0f2daa6dc9d935904f1de9a1496546efc95768
[I 05:13:29.666 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 05:13:29.667 NotebookApp] No web browser found: could not locate runnable browser.
[C 05:13:29.667 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=ca44acc48f0f2daa6dc9d935904f1de9a1496546efc95768

이 token을 복사해서 외부의 browser에서 입력하면 됩니다. 여기서 사용된 서버의 ip가 172.29.160.94이므로, 먼저 browser에서 http://172.29.160.94:8888의 주소를 찾아들어갑니다.

접속이 완료되면 다시 명령어 창에서 아래의 MNIST python notebook을 download 받습니다.

u0017649@sys-89697:~$ wget https://raw.githubusercontent.com/caffe2/caffe2/master/caffe2/python/tutorials/MNIST.ipynb

이 MNIST.ipynb이 jupyter 화면에 뜨면, 그걸 클릭합니다.

이제 문단 하나하나씩 play button을 누르면서 진행해보실 수 있습니다.

2017년 10월 16일 월요일

Minsky 서버에서 Tensorflow r1.3을 source로부터 build하기

2017년 10월 현재 최신 PowerAI 버전은 4.0으로서, 이 속에 포함된 tensorflow는 버전 r1.1입니다.

root@ubuntu:/tmp/kkk# dpkg -l | grep mldl
ii mldl-repo-local 4.0.0 ppc64el IBM repository for Deep Learning tools for POWER linux
ii power-mldl 4.0.0 ppc64el Meta-package for Deep Learning frameworks for POWER

root@ubuntu:/tmp/kkk# dpkg -l | grep tensor
ii ddl-tensorflow 0.9.0-4ibm1 ppc64el TensorFlow operator for IBM PowerAI Distributed Deep Learning
ii tensorflow 1.1.0-4ibm1 ppc64el Open source software library for numerical computation

현재 github에 올라온 tensorflow 최신 버전은 r1.4이며, 몇몇 최신 버전을 사용하시는 고객께서는 r1.3을 테스트하시고자 합니다. PowerAI는 대략 3개월 단위로 최신 ML/DL framework 버전을 갱신하니까, 다음 버전의 PowerAI는 아마도 11월 경에 나올 가능성이 많습니다. 그렇다면 tensorflow나 기타 framework의 최신 버전을 쓰려면 그때까지 참고 기다려야 할까요 ?

일단 기다려주십사 하는 것이 IBM의 공식 입장입니다만, 사실 꼭 그러실 필요는 없습니다. 오픈 소스니까요. 그냥 tensorflow r1.3을 직접 소스로부터 빌드하셔서 사용하실 수도 있습니다. 여기서는 그 방법을 살펴보시고, r1.3이 포함된 site-packages.tar도 올려놓겠습니다.

먼저, 여기서 제가 빌드를 위해 사용한 환경은 Ubuntu 16.04.2 LTS ppc64le입니다. CUDA는 8.0.61, libcudnn은 6.0.21, 그리고 gcc는 5.4 입니다.

u0017649@sys-89490:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

u0017649@sys-89490:~$ dpkg -l | grep cuda
ii cuda 8.0.61-1 ppc64el CUDA meta-package
ii cuda-8-0 8.0.61-1 ppc64el CUDA 8.0 meta-package
...
ii libcudnn6 6.0.21-1+cuda8.0 ppc64el cuDNN runtime libraries
ii libcudnn6-dev 6.0.21-1+cuda8.0 ppc64el cuDNN development libraries and headers

u0017649@sys-89490:~$ /usr/bin/gcc --version
gcc (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

여기서는 아래의 tensorflow 공식 홈페이지에 나온 source로부터의 빌드 방법을 거의 그대로 따라가겠습니다.

https://www.tensorflow.org/install/install_sources#ConfigureInstallation

Tensorflow 홈페이지에서는 아래의 libcupti-dev 외에 python 관련 패키지를 잔뜩 설치해야 하라고 나오는데, 실제로는 그냥 anaconda만 설치하고 진행해도 괜찮은 것 같습니다.

u0017649@sys-89490:~$ sudo apt-get install libcupti-dev

아래는 ananconda3 설치 절차입니다.

u0017649@sys-89490:~$ wget wget https://repo.continuum.io/archive/Anaconda3-4.4.0.1-Linux-ppc64le.sh

u0017649@sys-89490:~$ chmod u+x Anaconda*

u0017649@sys-89490:~$ sudo ./Anaconda3-4.4.0.1-Linux-ppc64le.sh
...
[/home/u0017649/anaconda3] >>> /usr/local/anaconda3

u0017649@sys-89490:~$ export PATH="/usr/local/anaconda3/bin:$PATH"

여기서는 주로 u0017649 userid를 사용하므로, 그 userid로 anaconda root directory(/usr/local/anaconda3)를 자유롭게 쓸 수 있도록 그 directory의 ownership을 바꿔줍니다.

u0017649@sys-89490:~$ sudo chown -R u0017649:u0017649 /usr/local/anaconda3

이제 conda 명령을 이용해서 bazel과 numpy를 설치합니다.

u0017649@sys-89490:~$ conda install bazel numpy
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /usr/local/anaconda3:

The following NEW packages will be INSTALLED:

bazel: 0.4.5-0

The following packages will be UPDATED:

anaconda: 4.4.0-np112py36_0 --> 4.4.0-np112py36_1
conda: 4.3.21-py36_0 --> 4.3.27-py36_0

Proceed ([y]/n)? y

bazel-0.4.5-0. 100% |####################################################| Time: 0:00:09 14.58 MB/s
conda-4.3.27-p 100% |####################################################| Time: 0:00:00 10.31 MB/s
anaconda-4.4.0 100% |####################################################| Time: 0:00:00 6.57 MB/s

u0017649@sys-89490:~$ which bazel
/usr/local/anaconda3/bin/bazel

이제 tensorflow의 source를 git clone 명령으로 download 받습니다.

u0017649@sys-89490:~$ git clone --recursive https://github.com/tensorflow/tensorflow.git
Cloning into 'tensorflow'...
remote: Counting objects: 246390, done.
remote: Total 246390 (delta 0), reused 0 (delta 0), pack-reused 246390
Receiving objects: 100% (246390/246390), 125.14 MiB | 6.34 MiB/s, done.
Resolving deltas: 100% (192026/192026), done.
Checking connectivity... done.

우리가 빌드하려는 것은 r1.3으로 그 branch로 switch합니다.

u0017649@sys-89490:~$ cd tensorflow

u0017649@sys-89490:~/tensorflow$ git checkout r1.3
Checking out files: 100% (3934/3934), done.
Branch r1.3 set up to track remote branch r1.3 from origin.
Switched to a new branch 'r1.3'

이제 build 준비가 끝났습니다. Build 절차는 기본적으로, configure --> bazel로 pip package build --> wheel file build --> pip 명령으로 wheel file 설치입니다.

Configure하실 때 대부분은 default를 선택하시되, 몇몇 부분에서는 별도로 지정하셔야 합니다. ppc64le 특성에 맞춰야 하는 부분은 딱 한군데, cuDNN 6 library가 위치한 directory를 지정하는 부분 뿐입니다.

u0017649@sys-89490:~/tensorflow$ ./configure
Extracting Bazel installation...
..................
You have bazel 0.4.5- installed.
Please specify the location of python. [Default is /usr/local/anaconda3/bin/python]:
Found possible Python library paths:
/usr/local/anaconda3/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/usr/local/anaconda3/lib/python3.6/site-packages]
Do you wish to build TensorFlow with MKL support? [y/N] n
...
Do you wish to build TensorFlow with CUDA support? [y/N] y
...
Please specify which gcc should be used by nvcc as the host compiler. [Default is /opt/at10.0/bin/gcc]: /usr/bin/gcc
...
Please specify the location where cuDNN 6 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/lib/powerpc64le-linux-gnu
...
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: "3.7,6.0" # 3.7은 K80, 6.0은 P100을 위한 compute capability입니다.
...

Configure가 끝나면 bazel로 TensorFlow with GPU support를 위한 pip package를 build합니다. 이 과정 중에, 아래 예시한 것과 같이 인터넷으로부터 일부 tar.gz 파일들을 download 받는 부분이 관측됩니다. 즉, 이 과정을 위해서는 인터넷 연결이 필요합니다.

u0017649@sys-89490:~/tensorflow$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
...
INFO: Downloading http://mirror.bazel.build/github.com/google/boringssl/archive/bbcaa15b0647816b9a\
1a9b9e0d209cd6712f0105.tar.gz: 3,244,032 bytes
...
INFO: Downloading http://mirror.bazel.build/github.com/google/boringssl/archive/bbcaa15b0647816b9a\
1a9b9e0d209cd6712f0105.tar.gz: 4,032,059 bytes
...
INFO: From Compiling tensorflow/python/pywrap_tensorflow_internal.cc:
bazel-out/local_linux-py3-opt/bin/tensorflow/python/pywrap_tensorflow_internal.cc: In function 'PyObject* _wrap_PyRecordReader_New(PyObject*, PyObject*)':
bazel-out/local_linux-py3-opt/bin/tensorflow/python/pywrap_tensorflow_internal.cc:5173:138: warning: 'arg2' may be used uninitialized in this function [-Wmaybe-uninitialized]
result = (tensorflow::io::PyRecordReader *)tensorflow::io::PyRecordReader::New((string const &)*arg1,arg2,(string const &)*arg3,arg4);
^
At global scope:
cc1plus: warning: unrecognized command line option '-Wno-self-assign'
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 7106.091s, Critical Path: 4427.34s

이런저런 warning message만 나올 뿐, 잘 끝났습니다. 저는 이 과정을 1-core짜리 cloud 환경에서 돌렸는데, 이렇게 거의 2시간 가까이 걸렸습니다.

이제 여기서 build된 binary를 이용하여 .whl file을 build합니다. 생성될 .whl 파일은 /tmp/tensorflow_pkg directory에 들어가도록 지정합니다.

u0017649@sys-89490:~/tensorflow$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
Fri Oct 13 03:40:03 EDT 2017 : === Using tmpdir: /tmp/tmp.UreS6KLhqt
~/tensorflow/bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles ~/tensorflow
~/tensorflow
/tmp/tmp.UreS6KLhqt ~/tensorflow
Fri Oct 13 03:40:23 EDT 2017 : === Building wheel
warning: no files found matching '*.dll' under directory '*'
warning: no files found matching '*.lib' under directory '*'
warning: no files found matching '*.h' under directory 'tensorflow/include/tensorflow'
warning: no files found matching '*' under directory 'tensorflow/include/Eigen'
warning: no files found matching '*' under directory 'tensorflow/include/external'
warning: no files found matching '*.h' under directory 'tensorflow/include/google'
warning: no files found matching '*' under directory 'tensorflow/include/third_party'
warning: no files found matching '*' under directory 'tensorflow/include/unsupported'
~/tensorflow
Fri Oct 13 03:44:36 EDT 2017 : === Output wheel file is in: /tmp/tensorflow_pkg

u0017649@sys-89490:~$ ls /tmp/tensorflow_pkg
tensorflow-1.3.1-cp36-cp36m-linux_ppc64le.whl

잘 끝났습니다. 이제 이 wheel 파일을 pip 명령으로 설치합니다. 단, 그 전에 PYTHONPATH 환경변수를 확실히 지정할 것을 권고합니다. 여기서도 tensorflow 설치에 필요한 tensorboard나 protobuf 등을 인터넷으로부터 자동으로 download 받는 것 같습니다.

u0017649@sys-89490:~$ export PYTHONPATH=/usr/local/anaconda3/lib/python3.6/site-packages

u0017649@sys-89490:~$ pip install /tmp/tensorflow_pkg/tensorflow-1.3.1-cp36-cp36m-linux_ppc64le.whl
Processing /tmp/tensorflow_pkg/tensorflow-1.3.1-cp36-cp36m-linux_ppc64le.whl
Requirement already satisfied: six>=1.10.0 in /usr/local/anaconda3/lib/python3.6/site-packages (from tensorflow==1.3.1)
Collecting tensorflow-tensorboard<0.2.0,>=0.1.0 (from tensorflow==1.3.1)
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out. (read timeout=15)",)': /simple/tensorflow-tensorboard/
Downloading tensorflow_tensorboard-0.1.8-py3-none-any.whl (1.6MB)
100% |????????????????????????????????| 1.6MB 772kB/s
Requirement already satisfied: wheel>=0.26 in /usr/local/anaconda3/lib/python3.6/site-packages (from tensorflow==1.3.1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/anaconda3/lib/python3.6/site-packages (from tensorflow==1.3.1)
Collecting protobuf>=3.3.0 (from tensorflow==1.3.1)
Downloading protobuf-3.4.0-py2.py3-none-any.whl (375kB)
100% |????????????????????????????????| 378kB 2.5MB/s
Requirement already satisfied: werkzeug>=0.11.10 in /usr/local/anaconda3/lib/python3.6/site-packages (from tensorflow-tensorboard<0.2.0,>=0.1.0->tensorflow==1.3.1)
...
Successfully installed html5lib-0.9999999 markdown-2.6.9 protobuf-3.4.0 tensorflow-1.3.1 tensorflow-tensorboard-0.1.8

자, 이제 설치가 끝났습니다. conda list 명령으로 tensorflow의 설치를 확인합니다.

u0017649@sys-89490:~$ conda list | grep tensor
tensorflow 1.3.1 <pip>
tensorflow-tensorboard 0.1.8 <pip>

이제 tensorflow가 설치된 PYTHONPATH, 즉 /usr/local/anaconda3/lib/python3.6/site-packages를 통째로 tar로 말아서 K80 GPU가 설치된 제2의 서버에 가져가겠습니다. 물론 그 제2의 서버에는 미리 anaconda3가 설치되어 있어야 합니다.

u0017649@sys-89498:/usr/local/anaconda3/lib/python3.6 $ tar -zcf site-packages.tgz site-packages

u0017649@sys-89498:~$ scp /usr/local/anaconda3/lib/python3.6/site-packages.tgz firestone:~/
site-packages.tgz 100% 313MB 19.6MB/s 00:16

이제 위에서 firestone이라고 되어 있는 그 서버로 이동해서 작업합니다. 참고로, 이 firestone이라는 서버는 Ubuntu 16.04.3으로서 빌드 서버의 16.04.2보다 약간 버전이 높습니다만, 별 문제 없이 잘 수행되더군요.

먼저 anaconda3가 설치되어 있는지 확인합니다.

root@ubuntu:~# which conda
/root/anaconda3/bin/conda

여기서는 root user로 /root/anaconda3 위치에 anaconda를 설치해놓았네요. 뭐 별로 좋은 위치는 아닙니다만, 여기서는 이걸 그대로 써보겠습니다. 여기서의 PYTHONPATH, 즉, /root/anaconda3/lib/python3.6/site-packages에 저 위의 서버에서 가져온 site-packages.tgz를 그대로 풀어놓겠습니다. 기존의 site-packages 속에 들어있는 package들은 overwrite 됩니다만, 만약 새로 풀리는 site-packages 밑에 없는 패키지들이 있었다면 그것들은 그대로 보존되니까, 그냥 풀어놓으셔도 됩니다.

root@ubuntu:~# cd /root/anaconda3/lib/python3.6

root@ubuntu:~/anaconda3/lib/python3.6# tar -zxf ~/site-packages.tgz

이제 준비 완료입니다. 먼저 conda list로 tensorflow가 설치되어 있는 것으로 나오는지 확인해봅니다.

root@ubuntu:~/anaconda3/lib/python3.6# conda list | grep tensorflow
tensorflow 1.3.1 <pip>
tensorflow-tensorboard 0.1.8 <pip>

예, 잘 보이네요. 이제 PYTHONPATH를 다시 한번 확실히 선언하고, 시험 삼아 cifar10을 수행해보겠습니다.

root@ubuntu:~# export PYTHONPATH=/root/anaconda3/lib/python3.6/site-packages

root@ubuntu:~# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 7644, done.
remote: Total 7644 (delta 0), reused 0 (delta 0), pack-reused 7644
Receiving objects: 100% (7644/7644), 157.90 MiB | 19.09 MiB/s, done.
Resolving deltas: 100% (4139/4139), done.
Checking connectivity... done.

root@ubuntu:~# cd models/tutorials/image/cifar10

root@ubuntu:~/models/tutorials/image/cifar10# ls
BUILD cifar10_input_test.py cifar10_train.py
cifar10_eval.py cifar10_multi_gpu_train.py __init__.py
cifar10_input.py cifar10.py README.md

root@ubuntu:~/models/tutorials/image/cifar10# which python
/root/anaconda3/bin/python

Download 받은 example script 중 cifar10_train.py을 수행하면 training dataset의 download부터 training까지 일사천리로 수행됩니다.

root@ubuntu:~/models/tutorials/image/cifar10# python cifar10_train.py
>> Downloading cifar-10-binary.tar.gz 100.0%
Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:127: RuntimeWarning: overflow encountered in add
a = a + a
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:129: RuntimeWarning: invalid value encountered in subtract
temp1 = temp - a
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:138: RuntimeWarning: invalid value encountered in subtract
itemp = int_conv(temp-a)
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:162: RuntimeWarning: overflow encountered in add
a = a + a
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:164: RuntimeWarning: invalid value encountered in subtract
temp1 = temp - a
/root/anaconda3/lib/python3.6/site-packages/numpy/core/machar.py:171: RuntimeWarning: invalid value encountered in subtract
if any(temp-a != zero):
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
...
2017-10-13 22:18:12.974909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2: N N Y Y
2017-10-13 22:18:12.974915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3: N N Y Y
2017-10-13 22:18:12.974933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:03:00.0)
2017-10-13 22:18:12.974942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:04:00.0)
2017-10-13 22:18:12.974952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0020:03:00.0)
2017-10-13 22:18:12.974960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0020:04:00.0)
2017-10-13 22:18:22.502001: step 0, loss = 4.67 (52.2 examples/sec; 2.451 sec/batch)
2017-10-13 22:18:22.917980: step 10, loss = 4.62 (3076.9 examples/sec; 0.042 sec/batch)
2017-10-13 22:18:23.240470: step 20, loss = 4.43 (3969.1 examples/sec; 0.032 sec/batch)
2017-10-13 22:18:23.538250: step 30, loss = 4.41 (4298.5 examples/sec; 0.030 sec/batch)
2017-10-13 22:18:23.837076: step 40, loss = 4.29 (4283.4 examples/sec; 0.030 sec/batch)
2017-10-13 22:18:24.134108: step 50, loss = 4.33 (4309.3 examples/sec; 0.030 sec/batch)
2017-10-13 22:18:24.426799: step 60, loss = 4.30 (4373.2 examples/sec; 0.029 sec/batch)
...
2017-10-14 07:09:50.128824: step 999920, loss = 0.10 (4352.2 examples/sec; 0.029 sec/batch)
2017-10-14 07:09:50.428406: step 999930, loss = 0.11 (4272.7 examples/sec; 0.030 sec/batch)
2017-10-14 07:09:50.722993: step 999940, loss = 0.14 (4345.0 examples/sec; 0.029 sec/batch)
2017-10-14 07:09:51.019963: step 999950, loss = 0.11 (4310.2 examples/sec; 0.030 sec/batch)
2017-10-14 07:09:51.317448: step 999960, loss = 0.13 (4302.7 examples/sec; 0.030 sec/batch)
2017-10-14 07:09:51.617180: step 999970, loss = 0.11 (4270.7 examples/sec; 0.030 sec/batch)
2017-10-14 07:09:51.914633: step 999980, loss = 0.14 (4303.0 examples/sec; 0.030 sec/batch)
2017-10-14 07:09:52.213837: step 999990, loss = 0.13 (4278.1 examples/sec; 0.030 sec/batch)

완벽하게 잘 수행되었습니다.

아래의 제 구글 드라이브에 site-packages.tgz를 올려두었습니다. 사이즈는 330MB 정도인 이 파일은 ppc64le 아키텍처의 Ubuntu 16.04, CUDA 8.0, libcuDNN 6.0 환경의 K80 또는 P100을 활용하실 수 있도록 빌드된 것입니다. 필요하시면 마음대로 가져가 쓰셔도 됩니다. 제가 책임은 질 수 없는 패키지라는 것은 양해 바랍니다.

https://drive.google.com/open?id=0B-F0jEb44gqUZGdYb3A2QUNaenM