HW 엔지니어를 위한 Deep Learning: 2017

2017년 12월 28일 목요일

Ubuntu 16.04 ppc64le에 crashdump를 설정하는 방법

crashdump를 설정하는 자세한 내용은 https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html 에 나와 있으니 참조하시기 바랍니다.

먼저 다음과 같이 /proc/cmdline 및 dmesg 결과에서 crash라는 단어를 검색해봅니다. 없으면 crashdump가 enable되지 않은 상태입니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash

u0017649@sys-90592:~$ dmesg | grep -i crash

crashdump를 enable하기 위해서는 linux-crashdump를 설치합니다. 설치 과정 중에 다음과 같이 kexec-tools와 kdump-tools를 enable하겠냐는 것을 묻는데, 둘다 yes로 하셔야 합니다.

u0017649@sys-90592:~$ sudo apt install linux-crashdump

┌────────────────────────┤ Configuring kexec-tools ├────────────────────────┐
│ │
│ If you choose this option, a system reboot will trigger a restart into a │
│ kernel loaded by kexec instead of going through the full system boot │
│ loader process. │
│ │
│ Should kexec-tools handle reboots? │
│ │
│ <Yes> <No> │
│ │
└───────────────────────────────────────────────────────────────────────────┘

┌────────────────────────┤ Configuring kdump-tools ├────────────────────────┐
│ │
│ If you choose this option, the kdump-tools mechanism will be enabled. A │
│ reboot is still required in order to enable the crashkernel kernel │
│ parameter. │
│ │
│ Should kdump-tools be enabled by default? │
│ │
│ <Yes> <No> │
│ │

아마 고객사의 Minsky 환경에서는 그렇지 않겠습니다만, 제가 쓰는 PDP (Power Development Platform) cloud 환경처럼 혹시 아래 파일에 USE_GRUB_CONFIG=false로 되어 있다면 이걸 true로 고쳐 주셔야 crashdump 설정이 유효하게 됩니다.

u0017649@sys-90592:~$ sudo vi /etc/default/kexec
...
#USE_GRUB_CONFIG=false
USE_GRUB_CONFIG=true

이제 시스템을 reboot 합니다.

u0017649@sys-90592:~$ sudo shutdown -r now

시스템이 부팅되면, /var/crash 밑에 아무 것도 없는 것을 확인합니다.

u0017649@sys-90592:~$ ls /var/crash

아울러 아까는 없었던 crashkernel 관련 항목이 /proc/cmdline과 dmesg에 나오는 것을 확인합니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash
root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

u0017649@sys-90592:~$ dmesg | grep -i crash
[ 0.000000] Reserving 256MB of memory at 128MB for crashkernel (System RAM: 4096MB)
[ 0.000000] Kernel command line: root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

이제 kdump-config show 명령으로 crashdump 준비 상태를 확인합니다.

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: broken symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: Not ready to kdump

여기서는 Not ready인데, 이유는 제가 쓰는 PDP (Power Development Platform) cloud 환경에서는 저 /var/lib/kdump/initrd.img-4.4.0-104-generic 대신 /var/lib/kdump/initrd.img-4.4.0-98-generic이 들어있기 때문입니다. 아마도 고객사의 Minsky 서버에서는 이런 문제는 없을 것입니다. PDP에서만 있는 이 문제는 그냥 간단히 /var/lib/kdump/initrd.img-4.4.0-98-generic를 /var/lib/kdump/initrd.img-4.4.0-104-generic 라는 이름으로 copy해서 해결하겠습니다.

u0017649@sys-90592:~$ sudo cp /var/lib/kdump/initrd.img-4.4.0-98-generic /var/lib/kdump/initrd.img-4.4.0-104-generic

이제 kdump-config load를 수행한 뒤 다시 kdump-config show를 해보면 ready to kdump로 상태가 바뀐 것을 보실 수 있습니다.

u0017649@sys-90592:~$ sudo kdump-config load
Modified cmdline:root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service elfcorehdr=156864K
* loaded kdump kernel

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

crashdump를 테스트하는 방법은 다음과 같습니다. 단, 이것이 수행될 때는 일단 시스템이 죽고, 또 시스템 reboot이 제대로 되지 않을 수 있으므로 반드시 IPMI나 RGB 모니터 등으로 console을 확보한 뒤에 하시기 바랍니다.

sudo가 아닌, 반드시 "su -"를 통해 root 계정으로 login 한 뒤, 다음과 같이 명령을 날리면 시스템이 죽으면서 /var/crash 디렉토리에 dump를 쏟습니다.

root@sys-90592:~# echo c > /proc/sysrq-trigger

리부팅이 끝나면 다음과 같은 형태로 crash file이 만들어진 것을 보실 수 있습니다.

# ls /var/crash
linux-image-3.0.0-12-server.0.crash

이 파일을 분석하는 것은 Ubuntu 공식 문서에 따르면 https://www.dedoimedo.com/computers/crash-analyze.html 에 나온 것을 참조하여 하면 된다고는 하는데, 저도 경험이 없고, 또 쉬워 보이지도 않네요.

이 외에, 기술지원 계약자에게 시스템 정보를 모아서 보내려면 sosreport를 사용하시는 것이 좋습니다. 원래 redhat에 있던 이 기능은 ubuntu에서도 사용이 가능합니다. 먼저 아래와 같이 sosreport를 설치하시고...

u0017649@sys-90528:~$ sudo apt-get install sosreport

(반드시 root 권한으로) sosreport를 수행하시기만 하면 됩니다. 결과물은 /tmp 밑에 tar.xz 포맷으로 생성됩니다.

u0017649@sys-90528:~$ sudo sosreport
...
Please enter your first initial and last name [sys-90528]:
Please enter the case id that you are generating this report for []:
Setting up archive ...
Setting up plugins ...
Running plugins. Please wait ...

Running 23/61: kernel...
Creating compressed archive...

Your sosreport has been generated and saved in:
/tmp/sosreport-sys-90528-20171227231646.tar.xz

The checksum is: 939ce2d2b6a254fbb576a2f6728b5678

Please send this file to your support representative.

u0017649@sys-90528:~$ ls -l /tmp/sosreport-sys-90528-20171227231646.tar.xz
-rw------- 1 root root 5231340 Dec 27 23:17 /tmp/sosreport-sys-90528-20171227231646.tar.xz

저 파일 속의 내용은 뭐 대단한 것은 아니고 그냥 시스템의 주요 config 파일들을 모아 놓는 것입니다.

u0017649@sys-90528:~$ sudo tar -tf /tmp/sosreport-sys-90528-20171227231646.tar.xz | more

...

sosreport-sys-90528-20171227231646/lib/udev/rules.d/60-cdrom_id.rules
sosreport-sys-90528-20171227231646/lib/udev/rules.d/80-docker-ce.rules
sosreport-sys-90528-20171227231646/lib/nvidia-361/
sosreport-sys-90528-20171227231646/lib/nvidia-361/modprobe.conf
sosreport-sys-90528-20171227231646/lib/modules/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/modules.dep
...

sosreport-sys-90528-20171227231646/etc/rcS.d/S03udev
sosreport-sys-90528-20171227231646/etc/rcS.d/S02plymouth-log
sosreport-sys-90528-20171227231646/etc/rcS.d/S08checkfs.sh

2017년 12월 18일 월요일

ppc64le에서 사용가능한 open source anti-virus SW : CLAMAV

Minsky의 아키텍처인 ppc64le(IBM POWER8)에서도 사용 가능한 anti-virus SW가 있습니다. CLAM Anti-Virus (clamav)입니다.

CLAMAV는 open source 기반의 anti-virus SW로서, 다음이 홈페이지로 되어 있고, source를 download 받을 수도 있습니다.

http://www.clamav.net/

ppc64le에서 빌드하는 방법도 매우 간단하여, 그냥 ./configure && make && sudo make install 만 해주시면 됩니다.

그러나 deep learning에서 주로 사용하는 Ubuntu에는 아예 OS의 표준 apt repository에 포함되어 있어 손쉽게 설치 및 사용이 가능합니다.

설치는 다음과 같이 apt-get install 명령으로 하시면 됩니다.

u0017649@sys-89983:~$ sudo apt-get install clamav clamav-daemon clamav-freshclam clamav-base libclamav-dev clamav-testfiles

clamav-daemon은 다음과 같이 start 하시면 됩니다.

u0017649@sys-89983:~$ sudo systemctl start clamav-daemon.service

u0017649@sys-89983:~$ sudo systemctl status clamav-daemon.service
● clamav-daemon.service - Clam AntiVirus userspace daemon
Loaded: loaded (/lib/systemd/system/clamav-daemon.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2017-12-17 21:07:57 EST; 4s ago
Docs: man:clamd(8)
man:clamd.conf(5)
http://www.clamav.net/lang/en/doc/
Main PID: 9462 (clamd)
Tasks: 1
Memory: 234.0M
CPU: 4.121s
CGroup: /system.slice/clamav-daemon.service
└─9462 /usr/sbin/clamd --foreground=true

/home/u0017649/hpcc-1.5.0 라는 directory 내용을 scan하여, 혹시 virus에 감염된 파일이 있을 경우 경고(bell)를 울려주는 명령은 다음과 같이 하시면 됩니다.

u0017649@sys-89983:~$ clamscan -r --bell -i /home/u0017649/hpcc-1.5.0

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 34
Scanned files: 737
Infected files: 0
Data scanned: 9.64 MB
Data read: 6.11 MB (ratio 1.58:1)
Time: 18.255 sec (0 m 18 s)

만약 virus에 감염된 파일이 있을 경우 자동으로 제거까지 하기를 원한다면 다음과 같이 --remove 옵션을 사용하시면 됩니다. 다만, ppc64le 아키텍처에서 virus 감염 파일을 구하는 것은 정말 어려울 것이므로, 위에 언급된 clamav 홈페이지에서 clamav source code를 download 받아서 그 source를 scan해보겠습니다. 그 속에는 test용으로 들어있는 파일들이 있는 모양이더라구요.

u0017649@sys-89983:~$ tar -zxf clamav-0.99.2.tar.gz

u0017649@sys-89983:~$ clamscan -r --remove /home/u0017649/clamav-0.99.2 > clamscan.out

다음과 같이 3개 파일이 감염되었다고 제거된 것을 보실 수 있습니다.

u0017649@sys-89983:~$ grep -i removed clamscan.out
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_int.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam.isoaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_ext.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clamjol.isoaa: Removed.

u0017649@sys-89983:~$ tail clamscan.out

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 227
Scanned files: 3231
Infected files: 4
Data scanned: 93.26 MB
Data read: 50.67 MB (ratio 1.84:1)
Time: 28.488 sec (0 m 28 s)

Anti-virus SW는 virus 목록 등이 계속 업데이트 되는 것이 중요하지요. 그런 일을 해주는 것이 freshclam 입니다. 이건 설치되면 자동으로 수행되는데, 그 log는 다음과 같이 확인하실 수 있습니다.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:57:04 2017 -> ClamAV update process started at Sun Dec 17 20:57:04 2017
Sun Dec 17 20:58:02 2017 -> Downloading main.cvd [100%]
Sun Dec 17 20:58:13 2017 -> main.cvd updated (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 20:58:35 2017 -> Downloading daily.cvd [100%]
Sun Dec 17 20:58:39 2017 -> daily.cvd updated (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 20:58:40 2017 -> Downloading bytecode.cvd [100%]
Sun Dec 17 20:58:40 2017 -> bytecode.cvd updated (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------

위의 log를 보면 clamd.ctl 파일이 없어서 clamd에 대한 notification이 제대로 되지 않은 것을 보실 수 있습니다. 저 file은 clamav-daemon을 처음 살릴 때 자동 생성되는데, 제가 위에서 'systemctl start clamav-daemon.service' 명령으로 clamav-daemon을 살리기 전에 freshclam이 구동되는 바람에 벌어진 일 같습니다. 이제 clamav-daemon을 제가 살려 놓았으므로, 다음과 같이 freshclam을 죽였다가 살리면 해결됩니다.

u0017649@sys-89983:~$ ps -ef | grep freshclam
clamav 8894 1 1 20:57 ? 00:00:10 /usr/bin/freshclam -d --foreground=true
u0017649 9473 31958 0 21:08 pts/0 00:00:00 grep --color=auto freshclam

u0017649@sys-89983:~$ sudo kill -9 8894

u0017649@sys-89983:~$ sudo /usr/bin/freshclam -d --foreground=false

위에서는 freshcalm을 background daemon으로 살렸습니다. 다시 log를 보시지요.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> freshclam daemon 0.99.2 (OS: linux-gnu, ARCH: ppc, CPU: powerpc64le)
Sun Dec 17 21:09:37 2017 -> ClamAV update process started at Sun Dec 17 21:09:37 2017
Sun Dec 17 21:09:37 2017 -> main.cvd is up to date (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 21:09:37 2017 -> daily.cvd is up to date (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> bytecode.cvd is up to date (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> --------------------------------------

이제 error 없이 잘 update된 것을 보실 수 있습니다.

clamconf 명령은 clamav 관련 각종 config 파일을 점검해주는 명령입니다. 그 output은 아래와 같습니다.

u0017649@sys-89983:~$ clamconf
Checking configuration files in /etc/clamav

Config file: clamd.conf
-----------------------
LogFile = "/var/log/clamav/clamav.log"
StatsHostID = "auto"
StatsEnabled disabled
StatsPEDisabled = "yes"
StatsTimeout = "10"
LogFileUnlock disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogClean disabled
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
ExtendedDetectionInfo = "yes"
PidFile disabled
TemporaryDirectory disabled
DatabaseDirectory = "/var/lib/clamav"
OfficialDatabaseOnly disabled
LocalSocket = "/var/run/clamav/clamd.ctl"
LocalSocketGroup = "clamav"
LocalSocketMode = "666"
FixStaleSocket = "yes"
TCPSocket disabled
TCPAddr disabled
MaxConnectionQueueLength = "15"
StreamMaxLength = "26214400"
StreamMinPort = "1024"
StreamMaxPort = "2048"
MaxThreads = "12"
ReadTimeout = "180"
CommandReadTimeout = "5"
SendBufTimeout = "200"
MaxQueue = "100"
IdleTimeout = "30"
ExcludePath disabled
MaxDirectoryRecursion = "15"
FollowDirectorySymlinks disabled
FollowFileSymlinks disabled
CrossFilesystems = "yes"
SelfCheck = "3600"
DisableCache disabled
VirusEvent disabled
ExitOnOOM disabled
AllowAllMatchScan = "yes"
Foreground disabled
Debug disabled
LeaveTemporaryFiles disabled
User = "clamav"
AllowSupplementaryGroups disabled
Bytecode = "yes"
BytecodeSecurity = "TrustSigned"
BytecodeTimeout = "60000"
BytecodeUnsigned disabled
BytecodeMode = "Auto"
DetectPUA disabled
ExcludePUA disabled
IncludePUA disabled
AlgorithmicDetection = "yes"
ScanPE = "yes"
ScanELF = "yes"
DetectBrokenExecutables disabled
ScanMail = "yes"
ScanPartialMessages disabled
PhishingSignatures = "yes"
PhishingScanURLs = "yes"
PhishingAlwaysBlockCloak disabled
PhishingAlwaysBlockSSLMismatch disabled
PartitionIntersection disabled
HeuristicScanPrecedence disabled
StructuredDataDetection disabled
StructuredMinCreditCardCount = "3"
StructuredMinSSNCount = "3"
StructuredSSNFormatNormal = "yes"
StructuredSSNFormatStripped disabled
ScanHTML = "yes"
ScanOLE2 = "yes"
OLE2BlockMacros disabled
ScanPDF = "yes"
ScanSWF = "yes"
ScanXMLDOCS = "yes"
ScanHWP3 = "yes"
ScanArchive = "yes"
ArchiveBlockEncrypted disabled
ForceToDisk disabled
MaxScanSize = "104857600"
MaxFileSize = "26214400"
MaxRecursion = "16"
MaxFiles = "10000"
MaxEmbeddedPE = "10485760"
MaxHTMLNormalize = "10485760"
MaxHTMLNoTags = "2097152"
MaxScriptNormalize = "5242880"
MaxZipTypeRcg = "1048576"
MaxPartitions = "50"
MaxIconsPE = "100"
MaxRecHWP3 = "16"
PCREMatchLimit = "10000"
PCRERecMatchLimit = "5000"
PCREMaxFileSize = "26214400"
ScanOnAccess disabled
OnAccessMountPath disabled
OnAccessIncludePath disabled
OnAccessExcludePath disabled
OnAccessExcludeUID disabled
OnAccessMaxFileSize = "5242880"
OnAccessDisableDDD disabled
OnAccessPrevention disabled
OnAccessExtraScanning disabled
DevACOnly disabled
DevACDepth disabled
DevPerformance disabled
DevLiblog disabled
DisableCertCheck disabled

Config file: freshclam.conf
---------------------------
StatsHostID disabled
StatsEnabled disabled
StatsTimeout disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
PidFile disabled
DatabaseDirectory = "/var/lib/clamav"
Foreground disabled
Debug disabled
AllowSupplementaryGroups disabled
UpdateLogFile = "/var/log/clamav/freshclam.log"
DatabaseOwner = "clamav"
Checks = "24"
DNSDatabaseInfo = "current.cvd.clamav.net"
DatabaseMirror = "db.local.clamav.net", "database.clamav.net"
PrivateMirror disabled
MaxAttempts = "5"
ScriptedUpdates = "yes"
TestDatabases = "yes"
CompressLocalDatabase disabled
ExtraDatabase disabled
DatabaseCustomURL disabled
HTTPProxyServer disabled
HTTPProxyPort disabled
HTTPProxyUsername disabled
HTTPProxyPassword disabled
HTTPUserAgent disabled
NotifyClamd = "/etc/clamav/clamd.conf"
OnUpdateExecute disabled
OnErrorExecute disabled
OnOutdatedExecute disabled
LocalIPAddress disabled
ConnectTimeout = "30"
ReceiveTimeout = "30"
SubmitDetectionStats disabled
DetectionStatsCountry disabled
DetectionStatsHostID disabled
SafeBrowsing disabled
Bytecode = "yes"

clamav-milter.conf not found

Software settings
-----------------
Version: 0.99.2
Optional features supported: MEMPOOL IPv6 FRESHCLAM_DNS_FIX AUTOIT_EA06 BZIP2 LIBXML2 PCRE ICONV JSON

Database information
--------------------
Database directory: /var/lib/clamav
bytecode.cvd: version 319, sigs: 75, built on Wed Dec 6 21:17:11 2017
main.cvd: version 58, sigs: 4566249, built on Wed Jun 7 17:38:10 2017
daily.cvd: version 24138, sigs: 1806393, built on Sun Dec 17 16:10:39 2017
Total number of signatures: 6372717

Platform information
--------------------
uname: Linux 4.4.0-103-generic #126-Ubuntu SMP Mon Dec 4 16:22:09 UTC 2017 ppc64le
OS: linux-gnu, ARCH: ppc, CPU: powerpc64le
Full OS version: Ubuntu 16.04.2 LTS
zlib version: 1.2.8 (1.2.8), compile flags: a9
platform id: 0x0a3152520800000000050400

Build information
-----------------
GNU C: 5.4.0 20160609 (5.4.0)
CPPFLAGS: -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64 -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE
CXXFLAGS:
LDFLAGS: -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed
Configure: '--build=powerpc64le-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libexecdir=/usr/lib/clamav' '--disable-maintainer-mode' '--disable-dependency-tracking' 'CFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed' '--with-dbdir=/var/lib/clamav' '--sysconfdir=/etc/clamav' '--disable-clamav' '--disable-unrar' '--enable-milter' '--enable-dns-fix' '--with-libjson' '--with-gnu-ld' '--with-systemdsystemunitdir=/lib/systemd/system' 'build_alias=powerpc64le-linux-gnu'
sizeof(void*) = 8
Engine flevel: 82, dconf: 82

2017년 12월 14일 목요일

ppc64le 아키텍처 cluster에서 HPCC 수행하는 방법

HPCC는 간단하게 수퍼컴의 성능을 측정할 수 있는, HPL (High Performance LINPACK)을 포함한 7개 HPC code들의 묶음 suite입니다. 아래가 홈페이지입니다.

http://icl.cs.utk.edu/hpcc/software/index.html

여기 나온 정보만으로는 컴파일해서 돌리는 것이 쉽지 않은데, 아래 site의 HPL 수행 방법을 보면 그나마 좀 이해가 됩니다.

http://www.crc.nd.edu/~rich/CRC_Summer_Scholars_2014/HPL-HowTo.pdf

여기서 돌리는 테스트들의 내용 등은 수학적 지식이 있어야 어느 정도 이해가 됩니다만, 시스템 엔지니어 입장에서는 그런 것 모르고도 대충 돌릴 수는 있습니다. 아래에는 ppc64le 아키텍처, 즉 IBM POWER8 프로세서 환경에서 어떻게 수행하면 되는지를 step by step으로 정리했습니다. 실은 x86 아키텍처가 아닌 ppc64le라고 해서 딱히 수행 방법이 다르지는 않습니다.

여기서는 PDP (Power Development Cloud) 환경의 1-core짜리 ppc64le Ubuntu 16.04 가상머신을 2대 이용했습니다.
(* Power Development Cloud, https://www-356.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_com_sys_power-development-platform 에서 신청하면 무료로 2주간 1-core짜리 Linux on POWER 환경을 제공. 2주 후 다시 또 무료로 재신청 가능. 최대 5개 VM을 한꺼번에 신청 가능)

다음과 같이 openmpi와 BLAS가 기본으로 설치되어 있어야 합니다. apt-get install libopenmpi-dev libblas-dev 명령으로 쉽게 설치됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep openmpi
ii libopenmpi-dev 1.10.2-8ubuntu1 ppc64el high performance message passing library -- header files
ii libopenmpi1.10 1.10.2-8ubuntu1 ppc64el high performance message passing library -- shared library
ii openmpi-bin 1.10.2-8ubuntu1 ppc64el high performance message passing library -- binaries
ii openmpi-common 1.10.2-8ubuntu1 all high performance message passing library -- common files

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep blas
ii libblas-common 3.6.0-2ubuntu2 ppc64el Dependency package for all BLAS implementations
ii libblas-dev 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Subroutines 3, static library
ii libblas3 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Reference implementations, shared library

먼저 source를 download 받고, tar를 풉니다.

u0017649@sys-90393:~$ wget http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.5.0.tar.gz

u0017649@sys-90393:~$ tar -zxf hpcc-1.5.0.tar.gz
u0017649@sys-90393:~$ cd hpcc-1.5.0

먼저 hpl/setup 디렉토리에 있는 make_generic을 수행하여 Make.UNKNOWN을 생성합니다. 여기서 대략 이 환경에 맞는 값들로 Makefile이 만들어집니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cd hpl/setup

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ sh make_generic

여기서 만들어진 Make.UNKNOWN은 다음과 같은 내용을 담고 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ grep -v \# Make.UNKNOWN
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
ARCH = $(arch)
TOPdir = ../../..
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
MPdir =
MPinc =
MPlib =
LAdir =
LAinc =
LAlib = -lblas
F2CDEFS = -DAdd_ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lm
HPL_OPTS =
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS)
LINKER = mpif77
LINKFLAGS =
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo

이제 이 Make.UNKNOWN를 상위 디렉토리인 hpl 디렉토리에 Make.Linux라는 이름으로 복사합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cp Make.UNKNOWN ../Make.Linux

그리고난 뒤 TOPdir, 즉 hpcc-1.5.0으로 올라와서 make arch=Linux를 수행합니다. 약간 헷갈릴 수 있는데, Make.Linux를 복사해둔 hpl 디렉토리가 아니라 그 위의 hpcc-1.5.0 디렉토리에서 make를 수행한다는 점에 유의하십시오. 그러면 아래처럼 mpicc가 수행되면서 7개 HPC code들을 모두 build합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cd ../..

u0017649@sys-90393:~/hpcc-1.5.0$ make arch=Linux
...
mpicc -o ../../../../FFT/wrapfftw.o -c ../../../../FFT/wrapfftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
mpicc -o ../../../../FFT/wrapmpifftw.o -c ../../../../FFT/wrapmpifftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
...
ar: creating ../../../lib/Linux/libhpl.a
echo ../../../lib/Linux/libhpl.a
../../../lib/Linux/libhpl.a
mpif77 -o ../../../../hpcc ../../../lib/Linux/libhpl.a -lblas -lm
make[1]: Leaving directory '/home/u0017649/hpcc-1.5.0/hpl/lib/arch/build'

결과로 hpcc-1.5.0 디렉토리에 hpcc라는 실행 파일이 생성됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ file hpcc
hpcc: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /opt/at10.0/lib64/ld64.so.2, for GNU/Linux 4.4.0, BuildID[sha1]=b47fb43c4d96819e25da7469049a780f8251458b, not stripped

이 hpcc 파일을 수행하면 7개 HPC code들을 순차적으로 모두 수행하는 것입니다. 이를 위해서 먼저 LD_LIBRARY_PATH를 다음과 같이 설정합니다.

u0017649@sys-90393:~/hpcc-1.5.0$ export LD_LIBRARY_PATH=/usr/lib:/usr/lib/powerpc64le-linux-gnu:$LD_LIBRARY_PATH

그리고 INPUT data file을 만들어야 합니다. 함께 제공되는 _hpccinf.txt를 hpccinf.txt라는 이름으로 복사하여 그대로 사용하셔도 됩니다만, 여기서는 http://www.netlib.org/benchmark/hpl/tuning.html 에 나오는 내용대로 해보겠습니다. INPUT data file은 몇번째 줄에는 무슨 정보가 들어가야 한다는 일정한 format이 정해져 있어서 그대로 입력하셔야 하고, 각 줄의 의미는 앞에서 언급한 tuning.html 을 참조하시면 됩니다. 다만, 여기에 나오는 것처럼 P x Q 정보를 2 x 8 로 하면 총 16개 processor가 있어야 수행을 할 수 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ vi hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
3000 6000 10000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 2 Ps
6 8 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

이대로 수행하면 다음과 같이 최소 16개 process가 필요하다면서 error가 납니다. 제가 수행하는 PDP 환경에는 1-core만 있기 때문입니다.

u0017649@sys-90393:~/hpcc-1.5.0$ ./hpcc
HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

따라서 위의 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1로 바꿔주겠습니다.

1 1 Ps
1 1 Qs

이걸 그대로 수행하면 최소 18시간 이상 계속 돌아가더군요. 그래서 도중에 중단시키고, problem size인 6번째 줄의 값들을 1/10 씩으로 줄이겠습니다.

300 600 1000 Ns

이제 single node에서 돌릴 준비가 끝났습니다. 다음과 같이 hpccinf.txt를 만드셔서 수행하시면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
300 600 1000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

수행 방법은 그냥 hpcc를 수행하는 것 뿐입니다. 위의 input data로 하면 1-core POWER8 환경에서는 약 20분 정도 걸립니다.

u0017649@sys-90393:~/hpcc-1.5.0$ time ./hpcc

그 결과물은 hpccoutf.txt 이라는 이름의 파일에 쌓이는데, 약 1.6MB 정도의 크기로 쌓이고 그 끝부분 내용은 아래와 같습니다.

...
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R3R8 1000 160 1 1 0.08 8.655e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0055214 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R4R8 1000 160 1 1 0.07 8.982e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0059963 ...... PASSED
================================================================================

Finished 3240 tests with the following results:
3240 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Current time (1513215290) is Wed Dec 13 20:34:50 2017

End of HPL section.
Begin of Summary section.
VersionMajor=1
VersionMinor=5
VersionMicro=0
VersionRelease=f
LANG=C
Success=1
sizeof_char=1
sizeof_short=2
sizeof_int=4
sizeof_long=8
sizeof_void_ptr=8
sizeof_size_t=8
sizeof_float=4
sizeof_double=8
sizeof_s64Int=8
sizeof_u64Int=8
sizeof_struct_double_double=16
CommWorldProcs=1
MPI_Wtick=1.000000e-06
HPL_Tflops=0.00934813
HPL_time=0.071476
HPL_eps=1.11022e-16
HPL_RnormI=1.71035e-12
HPL_Anorm1=263.865
HPL_AnormI=262.773
HPL_Xnorm1=2619.63
HPL_XnormI=11.3513
HPL_BnormI=0.499776
HPL_N=1000
HPL_NB=80
HPL_nprow=1
HPL_npcol=1
HPL_depth=1
HPL_nbdiv=4
HPL_nbmin=4
HPL_cpfact=R
HPL_crfact=R
HPL_ctop=1
HPL_order=R
HPL_dMACH_EPS=1.110223e-16
HPL_dMACH_SFMIN=2.225074e-308
HPL_dMACH_BASE=2.000000e+00
HPL_dMACH_PREC=2.220446e-16
HPL_dMACH_MLEN=5.300000e+01
HPL_dMACH_RND=1.000000e+00
HPL_dMACH_EMIN=-1.021000e+03
HPL_dMACH_RMIN=2.225074e-308
HPL_dMACH_EMAX=1.024000e+03
HPL_dMACH_RMAX=1.797693e+308
HPL_sMACH_EPS=5.960464e-08
HPL_sMACH_SFMIN=1.175494e-38
HPL_sMACH_BASE=2.000000e+00
HPL_sMACH_PREC=1.192093e-07
HPL_sMACH_MLEN=2.400000e+01
HPL_sMACH_RND=1.000000e+00
HPL_sMACH_EMIN=-1.250000e+02
HPL_sMACH_RMIN=1.175494e-38
HPL_sMACH_EMAX=1.280000e+02
HPL_sMACH_RMAX=3.402823e+38
dweps=1.110223e-16
sweps=5.960464e-08
HPLMaxProcs=1
HPLMinProcs=1
DGEMM_N=576
StarDGEMM_Gflops=12.5581
SingleDGEMM_Gflops=11.5206
PTRANS_GBs=0.555537
PTRANS_time=0.000324011
PTRANS_residual=0
PTRANS_n=150
PTRANS_nb=120
PTRANS_nprow=1
PTRANS_npcol=1
MPIRandomAccess_LCG_N=524288
MPIRandomAccess_LCG_time=0.719245
MPIRandomAccess_LCG_CheckTime=0.076761
MPIRandomAccess_LCG_Errors=0
MPIRandomAccess_LCG_ErrorsFraction=0
MPIRandomAccess_LCG_ExeUpdates=2097152
MPIRandomAccess_LCG_GUPs=0.00291577
MPIRandomAccess_LCG_TimeBound=60
MPIRandomAccess_LCG_Algorithm=0
MPIRandomAccess_N=524288
MPIRandomAccess_time=0.751529
MPIRandomAccess_CheckTime=0.0741832
MPIRandomAccess_Errors=0
MPIRandomAccess_ErrorsFraction=0
MPIRandomAccess_ExeUpdates=2097152
MPIRandomAccess_GUPs=0.00279051
MPIRandomAccess_TimeBound=60
MPIRandomAccess_Algorithm=0
RandomAccess_LCG_N=524288
StarRandomAccess_LCG_GUPs=0.0547931
SingleRandomAccess_LCG_GUPs=0.0547702
RandomAccess_N=524288
StarRandomAccess_GUPs=0.0442224
SingleRandomAccess_GUPs=0.044326
STREAM_VectorSize=333333
STREAM_Threads=1
StarSTREAM_Copy=2.13593
StarSTREAM_Scale=2.09571
StarSTREAM_Add=3.23884
StarSTREAM_Triad=3.26659
SingleSTREAM_Copy=2.13593
SingleSTREAM_Scale=2.11213
SingleSTREAM_Add=3.23884
SingleSTREAM_Triad=3.26659
FFT_N=131072
StarFFT_Gflops=0.488238
SingleFFT_Gflops=0.488024
MPIFFT_N=65536
MPIFFT_Gflops=0.305709
MPIFFT_maxErr=1.23075e-15
MPIFFT_Procs=1
MaxPingPongLatency_usec=-1
RandomlyOrderedRingLatency_usec=-1
MinPingPongBandwidth_GBytes=-1
NaturallyOrderedRingBandwidth_GBytes=-1
RandomlyOrderedRingBandwidth_GBytes=-1
MinPingPongLatency_usec=-1
AvgPingPongLatency_usec=-1
MaxPingPongBandwidth_GBytes=-1
AvgPingPongBandwidth_GBytes=-1
NaturallyOrderedRingLatency_usec=-1
FFTEnblk=16
FFTEnp=8
FFTEl2size=1048576
M_OPENMP=-1
omp_get_num_threads=0
omp_get_max_threads=0
omp_get_num_procs=0
MemProc=-1
MemSpec=-1
MemVal=-1
MPIFFT_time0=0
MPIFFT_time1=0.00124693
MPIFFT_time2=0.00407505
MPIFFT_time3=0.000625134
MPIFFT_time4=0.0091598
MPIFFT_time5=0.00154018
MPIFFT_time6=0
CPS_HPCC_FFT_235=0
CPS_HPCC_FFTW_ESTIMATE=0
CPS_HPCC_MEMALLCTR=0
CPS_HPL_USE_GETPROCESSTIMES=0
CPS_RA_SANDIA_NOPT=0
CPS_RA_SANDIA_OPT2=0
CPS_USING_FFTW=0
End of Summary section.
########################################################################
End of HPC Challenge tests.
Current time (1513215290) is Wed Dec 13 20:34:50 2017

########################################################################

1184.15user 2.77system 19:47.13elapsed 99%CPU (0avgtext+0avgdata 34624maxresident)k

이제 multi-node로 수행하는 방법을 보겠습니다. 이 역시 매우 간단합니다. 먼저 다음과 같이 node 이름을 담은 파일을 만듭니다. 만약 노드들에 network interface가 여러개라면 그 중 10GbE 또는 Infiniband처럼 고속인 것을 적어주는 것이 좋습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat nodes.rf
sys-90393
sys-90505

이어서 INPUT data 파일인 hpccinf.txt를 조금 수정해줍니다. 위에서 사용한 것과는 달리, 일단 2대를 사용하니까 2 process로 돌아야 합니다. 따라서 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1 1, 그리고 1 2로 바꿔주겠습니다. 또, 이대로 돌리니 MPI overhead가 있어서인지 20분이 아니라 40분이 되도록 끝나질 않더군요. 그래서 6번째 줄의 problem size도 다시 1/10로 더 줄였습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
30 60 100 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 2 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

그러고 난 다음에는 다음과 같이 mpirun으로 hpcc를 수행해주면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ mpirun -x PATH -x LD_LIBRARY_PATH -np 2 -hostfile nodes.rf ./hpcc | tee HPCC.out

시작과 동시에 양쪽 node에서 CPU를 100% 쓰면서 돌아가는 것을 확인하실 수 있습니다.

2017년 12월 12일 화요일

Caffe를 이용하여 ILSVRC2012 dataset을 alexnet으로 training하기

먼저 작업 환경을 PowerAI에 포함된 caffe-nv로 하기 위해 PATH 등 각종 환경 변수를 설정해주는 다음 script를 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ source /opt/DL/caffe-nv/bin/caffe-activate

다음과 같이 caffe가 caffe-nv로 잡히는지 확인합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ which caffe
/opt/DL/caffe-nv/bin/caffe

PowerAI에 포함된 caffe-nv 밑의 example과 data를 GPFS 파일시스템 쪽으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/examples examples
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/data data

거기서 아래와 같이 get_ilsvrc_aux.sh를 수행하여 ilsvrc2012 dataset 생성에 필요한 label 파일 등을 download 받습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd data/ilsvrc12
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ./get_ilsvrc_aux.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ls -ltr
total 37888
-rw-r----- 1 b7p286za IBM1 3200000 Feb 25 2014 test.txt
-rw-r----- 1 b7p286za IBM1 10000 Feb 25 2014 synsets.txt
-rw-r----- 1 b7p286za IBM1 786446 Feb 25 2014 imagenet_mean.binaryproto
-rw-r----- 1 b7p286za IBM1 1644500 Feb 25 2014 val.txt
-rw-r----- 1 b7p286za IBM1 43829433 Feb 25 2014 train.txt
-rw-r----- 1 b7p286za IBM1 31675 Apr 8 2014 synset_words.txt
-rw-r----- 1 b7p286za IBM1 3787 Jun 8 2014 det_synset_words.txt
-rw-r----- 1 b7p286za IBM1 14931117 Jul 11 2014 imagenet.bet.pickle
-rwxr-x--- 1 b7p286za IBM1 585 Dec 12 02:12 get_ilsvrc_aux.sh

이제 imagenet data, 즉 ILSVRC2012를 download 받습니다. Training dataset은 앞선 posting에서 사용한 tensorflow resnet training에서 사용했던 raw-data를 이용하면 됩니다. 다만, 거기서는 validation dataset도 label명에 따른 디렉토리로 분산해서 넣었는데, 이 alexnet에서는 val이라는 디렉토리에 한꺼번에 풀어놓아야 합니다. 따라서 다음과 같이 val만 새로 풀어놓습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ cd ../..

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mkdir raw-data/val && cd raw-data/val

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ tar -xf ../../ILSVRC2012_img_val.tar

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ cd ../..

이제 raw-data 밑의 train과 val 속의 JPEG 파일들을 LMDB 포맷으로 변환해야 합니다. 다음과 같이 create_imagenet.sh 스크립트를 수정해서 사용합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/create_imagenet.sh
...
export CAFFE_BIN=/opt/DL/caffe-nv/bin (추가)
...
TRAIN_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/train/
VAL_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val/
...
#RESIZE=false
RESIZE=true

수정을 마치고 다음과 같이 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/create_imagenet.sh

이 과정도 200GB가 넘는 data를 LMDB format으로 변환하는 것이므로 스토리지 상황에 따라 6~7시간 가량 걸립니다. 위의 script가 다 돌고나면 examples/imagenet/ilsvrc12_train_lmdb와 examples/imagenet/ilsvrc12_val_lmdb에 LMDB format으로 변환된 dataset이 생깁니다.

이제 생성된 LMDB로부터 전체 imagenet data의 평균값을 구하기 위해 make_imagenet_mean.sh를 수행합니다. 여기서도 script 맨 앞에 다음과 같이 CAFFE_BIN을 정의해줍니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/make_imagenet_mean.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
...

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/make_imagenet_mean.sh

다음으로는 solver.prototxt를 수정합니다. 먼저 /opt/DL/caffe-nv/models에 있는 bvlc_alexnet 디렉토리를 GPFS 파일시스템으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/models/bvlc_alexnet .

그리고나서 다음과 같이 solver.prototxt 속의 디렉토리 이름들과 max_iter 등을 적절히 수정해줍니다.
여기서는 나중에 batch_size를 2048로 할 것이므로, max_iter를 1250으로 하면 대략 1250 x 2048 / 1280000 = 20 epochs의 training을 완료하게 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/solver.prototxt
#net: "models/bvlc_alexnet/train_val.prototxt"
net: "bvlc_alexnet/train_val.prototxt"
...
#display: 20
display: 500
#max_iter: 100000
max_iter: 1250
...
#snapshot_prefix: "models/bvlc_alexnet/caffe_alexnet_train"
snapshot_prefix: "bvlc_alexnet/caffe_alexnet_train"

다음으로는 bvlc_alexnet/train_val.prototxt를 필요시 수정하여 train data의 batch_size를 늘이거나 줄이고, 각종 path도 적절히 변경합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/train_val.prototxt
...
source: "examples/imagenet/ilsvrc12_train_lmdb"
# batch_size: 1024
batch_size: 2048
...

이제 다음과 같은 train_alexnet.sh를 만들어 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/train_alexnet.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
set -e
$CAFFE_BIN/caffe train -gpu all --solver=bvlc_alexnet/solver.prototxt

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ nohup time ./examples/imagenet/train_alexnet.sh &

결과 log는 nohup.out에서 보실 수 있습니다. 위와 같이 20 epochs를 수행하는데는 12분 정도 밖에 안 걸립니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ grep iter nohup.out
test_iter: 1000
max_iter: 1250
I1212 14:15:42.151552 82131 solver.cpp:242] Iteration 0 (0 iter/s, 24.031s/500 iter), loss = 6.91103
I1212 14:22:10.507652 82131 solver.cpp:242] Iteration 500 (1.28749 iter/s, 388.352s/500 iter), loss = 6.37464
I1212 14:26:19.506183 82131 solver.cpp:242] Iteration 1000 (2.00806 iter/s, 248.996s/500 iter), loss = 5.34417
I1212 14:27:50.453514 82131 solver.cpp:479] Snapshotting to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.caffemodel
I1212 14:27:51.540899 82131 sgd_solver.cpp:273] Snapshotting solver state to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.solverstate

Tensorflow로 ILSVRC2012 dataset을 이용하여 resnet101 training하기

먼저, 다음과 같이 anaconda2를 설치합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ wget https://repo.continuum.io/archive/Anaconda2-5.0.0-Linux-ppc64le.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ chmod a+x ./Anaconda2-5.0.0-Linux-ppc64le.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ ./Anaconda2-5.0.0-Linux-ppc64le.sh
--> 설치 directory는 여기서는 user home directory인 /gpfs/gpfs_gl4_16mb/b7p286za/anaconda2 로 합니다만, 환경에 따라서 다른 곳에 하셔도 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ export PATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin:$PATH

이제 python이 OS의 기본 python이 아니라 anaconda에 딸린 python으로 설정되었는지 확인합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ which python
/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin/python

이제 다음 명령으로 tensorflow 1.2.1을 설치합니다.

혹시 tensorflow 1.3이 꼭 필요한 경우엔 이 URL(http://hwengineer.blogspot.kr/2017/10/minsky-tensorflow-r13-source-build.html)을 참조하여 직접 build 하셔야 합니다. 빌드 및 수행은 ppc64le에서도 잘 됩니다. 다만 tensorflow 1.2.1로도 충분하므로 굳이 1.3을 build하실 필요까지는 없습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ conda install tensorflow tensorflow-gpu

그 다음으로 benchmark용 resnet model 등이 들어있는 다음의 git repository를 다음과 같이 clone 하십시요. 이는 원래 https://github.com/tensorflow/models.git 에 들어 있는 내용에 일부 script를 추가한 것입니다. 원래의 script는 imagenet training dataset을 download하는 것부터 시작하는데, 그건 시간이 너무 오래 걸리므로, 이미 download 받은 dataset을 이용하여 TFrecord로 변환하는 등의 script를 추가했습니다. 이 script 작성은 IBM 이보란님께서 수고해주셨습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ git clone https://github.com/brlee08/models.git

여기에서 사용할 ILSVRC2012 imagenet dataset들은 다음과 같으며, 이는 미리 download 받으두셔야 합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ ls -l ILSVRC2012*
-rw-r----- 1 b7p286za IBM1 20861852 Aug 12 2012 ILSVRC2012_bbox_train_v2.tar.gz
-rw-r----- 1 b7p286za IBM1 147897477120 Nov 5 07:02 ILSVRC2012_img_train.tar
-rw-r----- 1 b7p286za IBM1 6744924160 Nov 5 07:03 ILSVRC2012_img_val.tar

이를 다음과 같이 적절한 위치에 풀어두셔야 합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mkdir -p raw-data/bounding_boxes

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_bbox_train_v2.tar.gz raw-data/bounding_boxes/annotations.tar.gz
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_img_train.tar raw-data/
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_img_val.tar raw-data/

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd models/research/inception/inception/data/

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ ./T2_ibm_uncompress_imagenet.sh /gpfs/gpfs_gl4_16mb/raw-data/ /gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt

--> 여기서 앞의 directory 이름 끝에 반드시 /를 붙이셔야 합니다. (안그러면 error 납니다.) 이 script에 의해 앞에 쓴 directory 밑에 raw image (JPEG)들이 풀리면서 label명인 sub-directory로 분산되어 들어갑니다. 뒤에 쓴 imagenet_lsvrc_2015_synsets.txt 파일은 이 ILSVRC2012 data의 label 이름입니다.

위 script가 다 수행되고 나면 다음과 같이 이 JPEG 파일들을 TFrecord 포맷으로 변환합니다. 그를 위해, models/research/inception/inception/data 밑에 있는 T2_ibm_preprocess.sh 에서 아래 부분을 수정합니다.

#source /opt/DL/tensorflow/bin/tensorflow-activate (맨 위의 tensorflow-activate 부분을 comment-out 처리. PowerAI에 있는 TF 1.0 대신 conda install로 설치한 TF 1.2.1을 사용하기 위한 것임)
WORK_DIR="<models 디렉토리가 위치한 경로>/models/research/inception/inception"
python <models 디렉토리가 위치한 경로>/models/research/inception/inception/data/build_imagenet_data.py

여기서는 아래와 같이 고쳤습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ vi ./T2_ibm_preprocess.sh
#source /opt/DL/tensorflow/bin/tensorflow-activate
...
WORK_DIR="/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception"
...
python /gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data/build_imagenet_data.py \
...

수정이 끝나면 다음과 같이 수행합니다. T2_ibm_preprocess.sh 뒤에 적어주는 directory 밑에 ilsvrc_tf라는 sub-directory가 생기면서 TFrecord 파일들이 생성됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ time ./T2_ibm_preprocess.sh /gpfs/gpfs_gl4_16mb/b7p286za/

위 script는 200GB가 넘는 파일들을 처리하므로 시간이 꽤 오래, 약 4시간 정도 걸립니다. 다 끝마치면 다음과 같이 train-xxxx과 validation-xxxx 등의 TFrecord 파일들이 생깁니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/ilsvrc_tf$ ls -l | more
total 62635520
-rw-r----- 1 b7p286za IBM1 149402267 Dec 12 00:17 train-00000-of-01024
-rw-r----- 1 b7p286za IBM1 150240608 Dec 12 00:19 train-00001-of-01024
-rw-r----- 1 b7p286za IBM1 141760185 Dec 12 00:20 train-00002-of-01024
-rw-r----- 1 b7p286za IBM1 152134069 Dec 12 00:22 train-00003-of-01024
-rw-r----- 1 b7p286za IBM1 141508613 Dec 12 00:24 train-00004-of-01024
-rw-r----- 1 b7p286za IBM1 148320681 Dec 12 00:25 train-00005-of-01024
-rw-r----- 1 b7p286za IBM1 146087263 Dec 12 00:27 train-00006-of-01024
...

이제 다음과 같이 benchmark script들이 들어있는 git repo를 clone 합니다. 역시 이 script 작성은 IBM 이보란님께서 수고해주셨습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ git clone https://github.com/brlee08/benchmark.git

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd benchmark/tensorflow

이중 bench_ibm_single.sh을 이용하여 수행하되, 먼저 일부를 다음과 같이 수정합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow$ vi bench_ibm_single.sh
...
DATA_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/ilsvrc_tf (tfrecord 위치한 디렉토리)
LOG_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow/output_single (log를 쌓을 디렉토리)
...
TRAIN_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow/train_log (tensorboard용 log 쌓을 디렉토리)
...
NUM_EPOCHS=10
NUM_GPU=4
INPUT_BATCH=64
INPUT_MODEL="resnet101"
...
#TRAIN_LOG_DIR="${TRAIN_DIR}/googlenet-10e-128b-4G" (쓰지 않는 것이므로 comment-out으로 막으십시요.)
...
#source /opt/DL/tensorflow/bin/tensorflow-activate (역시 PowerAI에 있는 TF 1.0 대신 conda install로 설치한 TF 1.2.1을 사용하기 위해 comment-out)
export PATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin:$PATH
export PYTHONPATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/lib/python2.7/site-packages (원래 source에는 anaconda3를 쓰고 있으나 여기서는 anaconda2의 site-packages를 PYTHONPATH로 설정해야 함)
...
# --data_name=imagenet --train_dir=${TRAIN_LOG_DIR} --data_dir=${DATA_DIR} --variable_update=${VARIABLE_UPDATE} \

--data_name=imagenet --train_dir=${TRAIN_DIR} --data_dir=${DATA_DIR} --variable_update=${VARIABLE_UPDATE} \

(원본에 오타가 있었습니다. --train_dir=${TRAIN_LOG_DIR}를 --train_dir=${TRAIN_DIR}로 수정해야 합니다.)

이제 다음과 같이 resent training을 수행하면 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow$ nohup time /gpfs/gpfs_gl4_16mb/b7p284za/benchmark/tensorflow/bench_ibm_single.sh &

처음에는 몇몇 warning message와 함께 tensorflow 기동하는데 10분 정도 걸리므로 당황하지 마십시요. 대략 다음과 같은 결과가 나옵니다.

50010 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.943
50020 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.716
50030 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.847
50040 images/sec: 485.6 +/- 0.1 (jitter = 3.8) 4.639
----------------------------------------------------------------
total images/sec: 485.42
----------------------------------------------------------------
Training Finish - 2017-12-12 13:59:14
Elapsed Time - 02:31:31

2017년 11월 22일 수요일

ppc64le에서의 hadoop build와 구성 방법

* 아래에서 빌드한 hadoop-2.7.4.tar.gz는 아래 링크에서 download 받으실 수 있도록 google drive에 올려놓았습니다.

https://drive.google.com/open?id=1W0QYAD5DkSeY_vBHRHmu_iril4t9svJz

POWER (ppc64le) chip 위에서 hadoop을 compile하는 것은 매우 간단합니다. 그냥 https://github.com/apache/hadoop/blob/trunk/BUILDING.txt 에 나온 대로 따라 하면 됩니다. 딱 하나, protobuf 버전이 안 맞는 문제 때문에 아래와 처럼 protobuf 2.5를 별도로 설치하는 부분만 추가됩니다.

먼저 Ubuntu OS에서 기본으로 필요한 다음 package들을 설치합니다.

u0017649@sys-90043:~$ sudo apt-get install software-properties-common

u0017649@sys-90043:~$ sudo apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev protobuf-compiler snappy libsnappy-dev

u0017649@sys-90043:~$ sudo apt-get install libjansson-dev bzip2 libbz2-dev fuse libfuse-dev zstd

protoc를 위해 protobuf 2.5.0의 source를 다운받아서 아래와 같이 설치합니다. apt-get으로 설치 가능한 OS에 포함된 버전은 2.6.1인데, 묘하게도 hadoop에서는 2.5.0을 꼭 써야 한다고 고집하네요.

u0017649@sys-90043:~$ git clone --recursive https://github.com/ibmsoe/Protobuf.git
u0017649@sys-90043:~$ cd Protobuf
u0017649@sys-90043:~/Protobuf$ ./configure
u0017649@sys-90043:~/Protobuf$ make
u0017649@sys-90043:~/Protobuf$ sudo make install
u0017649@sys-90043:~/Protobuf$ which protoc
/usr/local/bin/protoc

* 참고로, 위와 같이 protobuf를 따로 해주지 않으면 아래와 같은 error 발생합니다.

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.1.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 2.6.1', expected version is '2.5.0' -> [Help 1]

환경 변수도 설정합니다.

u0017649@sys-90043:~$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el
u0017649@sys-90043:~$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
u0017649@sys-90043:~$ export MAVEN_OPTS="-Xmx2048m"

그리고나서 hadoop-2.7.4의 source를 다음과 같이 download 받습니다. 현재 최신 버전은 3.0인데, 이건 아직 안정화 버전은 아닌 것 같고, 최근 버전의 HortonWorks에 포함된 버전인 2.7.4로 하겠습니다.

u0017649@sys-90043:~$ wget http://apache.tt.co.kr/hadoop/common/hadoop-2.7.4/hadoop-2.7.4-src.tar.gz
u0017649@sys-90043:~$ tar -zxf hadoop-2.7.4-src.tar.gz
u0017649@sys-90043:~$ cd hadoop-2.7.4-src

빌드 자체는 maven으로 수행되는데, 시간은 좀 걸립니다만 상대적으로 매우 간단합니다. 아래와 같이 수행하면 빌드된 binary가 tar.gz로 묶여서 hadoop-dist/target 디렉토리에 생성됩니다.

u0017649@sys-90043:~/hadoop-2.7.4-src$ mvn package -Pdist -DskipTests -Dtar
...
main:
[exec] $ tar cf hadoop-2.7.4.tar hadoop-2.7.4
[exec] $ gzip -f hadoop-2.7.4.tar
[exec]
[exec] Hadoop dist tar available at: /home/u0017649/hadoop-2.7.4-src/hadoop-dist/target/hadoop-2.7.4.tar.gz
[exec]
[INFO] Executed tasks
[INFO]
[INFO] --- maven-javadoc-plugin:2.8.1:jar (module-javadocs) @ hadoop-dist ---
[INFO] Building jar: /home/u0017649/hadoop-2.7.4-src/hadoop-dist/target/hadoop-dist-2.7.4-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main ................................. SUCCESS [ 4.780 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [ 2.711 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [ 1.633 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [ 2.645 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [ 0.386 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [ 2.546 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 6.019 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [ 11.630 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 12.236 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [ 9.364 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [02:21 min]
[INFO] Apache Hadoop NFS .................................. SUCCESS [ 11.743 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 16.980 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [ 3.316 s]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [02:42 min]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [ 34.161 s]
[INFO] Apache Hadoop HDFS BookKeeper Journal .............. SUCCESS [ 13.819 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [ 5.306 s]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [ 0.080 s]
[INFO] hadoop-yarn ........................................ SUCCESS [ 0.073 s]
[INFO] hadoop-yarn-api .................................... SUCCESS [ 39.900 s]
[INFO] hadoop-yarn-common ................................. SUCCESS [ 41.698 s]
[INFO] hadoop-yarn-server ................................. SUCCESS [ 0.160 s]
[INFO] hadoop-yarn-server-common .......................... SUCCESS [ 13.859 s]
[INFO] hadoop-yarn-server-nodemanager ..................... SUCCESS [ 16.781 s]
[INFO] hadoop-yarn-server-web-proxy ....................... SUCCESS [ 5.143 s]
[INFO] hadoop-yarn-server-applicationhistoryservice ....... SUCCESS [ 10.619 s]
[INFO] hadoop-yarn-server-resourcemanager ................. SUCCESS [ 25.832 s]
[INFO] hadoop-yarn-server-tests ........................... SUCCESS [ 6.436 s]
[INFO] hadoop-yarn-client ................................. SUCCESS [ 9.209 s]
[INFO] hadoop-yarn-server-sharedcachemanager .............. SUCCESS [ 4.691 s]
[INFO] hadoop-yarn-applications ........................... SUCCESS [ 0.052 s]
[INFO] hadoop-yarn-applications-distributedshell .......... SUCCESS [ 4.187 s]
[INFO] hadoop-yarn-applications-unmanaged-am-launcher ..... SUCCESS [ 2.589 s]
[INFO] hadoop-yarn-site ................................... SUCCESS [ 0.052 s]
[INFO] hadoop-yarn-registry ............................... SUCCESS [ 8.977 s]
[INFO] hadoop-yarn-project ................................ SUCCESS [ 4.737 s]
[INFO] hadoop-mapreduce-client ............................ SUCCESS [ 0.271 s]
[INFO] hadoop-mapreduce-client-core ....................... SUCCESS [ 28.766 s]
[INFO] hadoop-mapreduce-client-common ..................... SUCCESS [ 18.916 s]
[INFO] hadoop-mapreduce-client-shuffle .................... SUCCESS [ 6.326 s]
[INFO] hadoop-mapreduce-client-app ........................ SUCCESS [ 12.547 s]
[INFO] hadoop-mapreduce-client-hs ......................... SUCCESS [ 8.090 s]
[INFO] hadoop-mapreduce-client-jobclient .................. SUCCESS [ 10.544 s]
[INFO] hadoop-mapreduce-client-hs-plugins ................. SUCCESS [ 2.727 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [ 7.638 s]
[INFO] hadoop-mapreduce ................................... SUCCESS [ 3.216 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 6.935 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 15.235 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [ 4.425 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 7.658 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 5.281 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 3.525 s]
[INFO] Apache Hadoop Ant Tasks ............................ SUCCESS [ 2.382 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [ 4.387 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 0.031 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 6.296 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 14.099 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 6.764 s]
[INFO] Apache Hadoop Client ............................... SUCCESS [ 8.594 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 1.873 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 8.096 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 8.972 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.039 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [01:00 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15:09 min
[INFO] Finished at: 2017-11-21T22:31:05-05:00
[INFO] Final Memory: 205M/808M
[INFO] ------------------------------------------------------------------------

이제 이렇게 빌드된 tar.gz을 가지고 hadoop을 (비록 1대이지만) 기본 구성하는 것을 해보겠습니다.

가령 Minsky 서버로 docker 기반의 cloud 서비스를 해주는 Nimbix cloud의 가상 머신을 사용하고 계시다면, 유일한 persistent storage인 /data에 hadoop을 설치하셔야 합니다. 그 외의 directory들에 설치하면 이 Nimbix instance를 reboot 하면 다 초기화 되어 없어져 버립니다. 이는 Nimbix가 진짜 가상 머신이 아니라 docker instance이기 때문입니다.

아래와 같이 /data 밑에 그냥 hadoop tar.gz 파일을 풀어놓으면 설치는 끝납니다.

u0017649@sys-90043:~$ cd /data
u0017649@sys-90043:/data$ tar -zxf /home/u0017649/hadoop-2.7.4-src/hadoop-dist/target/hadoop-2.7.4.tar.gz
u0017649@sys-90043:/data$ cd hadoop-2.7.4

이제 기본 환경 변수를 설정합니다. JAVA_HOME도 위에서처럼 제대로 설정해주셔야 합니다.

u0017649@sys-90043:/data/hadoop-2.7.4$ export HADOOP_INSTALL=/data/hadoop-2.7.4
u0017649@sys-90043:/data/hadoop-2.7.4$ export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

일단 hadoop binary가 제대로 작동하는지 확인합니다.

u0017649@sys-90043:/data/hadoop-2.7.4$ hadoop version
Hadoop 2.7.4
Subversion Unknown -r Unknown
Compiled by u0017649 on 2017-11-22T03:17Z
Compiled with protoc 2.5.0
From source with checksum 50b0468318b4ce9bd24dc467b7ce1148
This command was run using /data/hadoop-2.7.4/share/hadoop/common/hadoop-common-2.7.4.jar

그리고나서 configuration directory에 들어가 기본 설정을 다음과 같이 해줍니다.

u0017649@sys-90043:/data/hadoop-2.7.4$ cd etc/hadoop

u0017649@sys-90043:/data/hadoop-2.7.4/etc/hadoop$ vi hadoop-env.sh
...
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el
...

u0017649@sys-90043:/data/hadoop-2.7.4/etc/hadoop$ vi core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-2.7.4/hadoop-${user.name}</value>
</property>
</configuration>

u0017649@sys-90043:/data/hadoop-2.7.4/etc/hadoop$ vi mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>${hadoop.tmp.dir}/mapred/system</value>
</property>
</configuration>

slaves 파일에는 자기 자신인 localhost를 적어 줍니다. 그러면 자기 자신이 namenode도 되고 datanode도 되는 것입니다.

u0017649@sys-90043:/data/hadoop-2.7.4/etc/hadoop$ cat slaves
localhost

이제 hadoop을 기동시켜 볼텐데, 그러자면 먼저 localhost 자신에 대해서도 passwd 없이 "ssh localhost" 와 "ssh 0.0.0.0"이 가능하도록 ssh-keygen 및 ssh-copy-id가 수행되어야 합니다.

이제 namenode 포맷을 합니다.

u0017649@sys-90043:~$ hadoop namenode -format
...
17/11/21 23:53:38 INFO namenode.FSImageFormatProtobuf: Image file /data/hadoop-2.7.4/hadoop-u0017649/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 325 bytes saved in 0 seconds.
17/11/21 23:53:38 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
17/11/21 23:53:38 INFO util.ExitUtil: Exiting with status 0
17/11/21 23:53:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sys-90043/172.29.160.241
************************************************************/

그리고나서 hadoop과 yarn을 start 합니다.

u0017649@sys-90043:~$ start-all.sh
...
node-sys-90043.out
starting yarn daemons
starting resourcemanager, logging to /data/hadoop-2.7.4/logs/yarn-u0017649-resourcemanager-sys-90043.out
localhost: starting nodemanager, logging to /data/hadoop-2.7.4/logs/yarn-u0017649-nodemanager-sys-90043.out

다음과 같이 기초적인 hdfs 명령을 수행해 봅니다. 잘 되는 것을 보실 수 있습니다.

u0017649@sys-90043:~$ hadoop fs -df
Filesystem Size Used Available Use%
hdfs://localhost:9000 36849713152 24576 5312647168 0%

u0017649@sys-90043:~$ hadoop fs -mkdir -p /user/u0017649
u0017649@sys-90043:~$ hadoop fs -mkdir input

u0017649@sys-90043:~$ hadoop fs -ls -R
drwxr-xr-x - u0017649 supergroup 0 2017-11-21 23:58 input
-rw-r--r-- 3 u0017649 supergroup 258 2017-11-21 23:58 input/hosts

u0017649@sys-90043:~$ hadoop fs -text input/hosts
127.0.0.1 localhost
127.0.1.1 ubuntu1604-dr-01.dal-ebis.ihost.com ubuntu1604-dr-01
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.29.160.241 sys-90043

2017년 11월 20일 월요일

Minsky 서버에서의 JCuda 0.8.0 (CUDA 8용) build (ppc64le)

JCuda를 빌드하는 것은 아래 github에 나온 대로 따라하시면 됩니다.

https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md

먼저, 아래와 같이 9개의 project에 대해 git clone을 수행합니다.

u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda-main.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda-common.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcuda.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcublas.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcufft.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcusparse.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcurand.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jcusolver.git
u0017649@sys-90043:~/jcuda$ git clone https://github.com/jcuda/jnvgraph.git

각 directory로 들어가서, CUDA 8에 맞는 버전인 version-0.8.0으로 각각 checkout을 해줍니다.

u0017649@sys-90043:~/jcuda$ ls
jcublas jcuda jcuda-common jcuda-main jcufft jcurand jcusolver jcusparse jnvgraph

u0017649@sys-90043:~/jcuda$ cd jcublas
u0017649@sys-90043:~/jcuda/jcublas$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcublas$ cd ../jcuda
u0017649@sys-90043:~/jcuda/jcuda$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda$ cd ../jcuda-common
u0017649@sys-90043:~/jcuda/jcuda-common$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda-common$ cd ../jcuda-main
u0017649@sys-90043:~/jcuda/jcuda-main$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcuda-main$ cd ../jcufft
u0017649@sys-90043:~/jcuda/jcufft$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcufft$ cd ../jcurand
u0017649@sys-90043:~/jcuda/jcurand$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcurand$ cd ../jcusolver
u0017649@sys-90043:~/jcuda/jcusolver$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcusolver$ cd ../jcusparse
u0017649@sys-90043:~/jcuda/jcusparse$ git checkout tags/version-0.8.0

u0017649@sys-90043:~/jcuda/jcusparse$ cd ../jnvgraph
u0017649@sys-90043:~/jcuda/jnvgraph$ git checkout tags/version-0.8.0
u0017649@sys-90043:~/jcuda/jnvgraph$ cd ..

저는 이것을 Ubuntu 16.04 ppc64le에서 빌드했는데, 이걸 빌드할 때 GL/gl.h를 찾기 때문에 다음과 같이 libmesa-dev를 미리 설치해야 합니다.

u0017649@sys-90043:~/jcuda$ sudo apt-get install libmesa-dev

아래와 같이 기본적인 환경변수를 설정합니다.

u0017649@sys-90043:~/jcuda$ export LD_LIBRARY_PATH=/usr/local/cuda-8.0/targets/ppc64le-linux/lib:$LD_LIBRARY_PATH
u0017649@sys-90043:~/jcuda$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el

이제 cmake를 수행합니다. 이때 CUDA_nvrtc_LIBRARY를 cmake에게 알려주기 위해 다음과 같이 -D 옵션을 붙입니다.

u0017649@sys-90043:~/jcuda$ cmake ./jcuda-main -DCUDA_nvrtc_LIBRARY="/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libnvrtc.so"
...
-- Found CUDA: /usr/local/cuda/bin/nvcc
-- Found JNI: /usr/lib/jvm/java-8-openjdk-ppc64el/jre/lib/ppc64le/libjawt.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/u0017649/jcuda

다음으로는 make all을 수행합니다.

u0017649@sys-90043:~/jcuda$ make all
...
/home/u0017649/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:292:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
delete nativeObject->nativeTopologyData;
^
[100%] Linking CXX shared library ../../nativeLibraries/linux/ppc_64/lib/libJNvgraph-0.8.0-linux-ppc_64.so
[100%] Built target JNvgraph

끝나고 나면 jcuda-main directory로 들어가서 mvn으로 clean install을 수행합니다. 단, 여기서는 maven test는 모두 skip 했습니다. 저는 여기에 CUDA 8.0이 설치되어있긴 하지만 실제 GPU가 설치된 환경은 아니라서, test를 하면 cuda device를 찾을 수 없다며 error가 나기 때문입니다. GPU가 설치된 환경에서라면 저 "-Dmaven.test.skip=true" 옵션을 빼고 그냥 "mvn clean install"을 수행하시기 바랍니다.

u0017649@sys-90043:~/jcuda$ cd jcuda-main

u0017649@sys-90043:~/jcuda/jcuda-main$ mvn -Dmaven.test.skip=true clean install
...
[INFO] Configured Artifact: org.jcuda:jnvgraph-natives:linux-ppc_64:0.8.0:jar
[INFO] Copying jcuda-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcuda-0.8.0.jar
[INFO] Copying jcuda-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcuda-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcublas-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcublas-0.8.0.jar
[INFO] Copying jcublas-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcublas-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcufft-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcufft-0.8.0.jar
[INFO] Copying jcufft-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcufft-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcusparse-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcusparse-0.8.0.jar
[INFO] Copying jcusparse-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcusparse-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcurand-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcurand-0.8.0.jar
[INFO] Copying jcurand-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcurand-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jcusolver-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jcusolver-0.8.0.jar
[INFO] Copying jcusolver-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jcusolver-natives-0.8.0-linux-ppc_64.jar
[INFO] Copying jnvgraph-0.8.0.jar to /home/u0017649/jcuda/jcuda-main/target/jnvgraph-0.8.0.jar
[INFO] Copying jnvgraph-natives-0.8.0-linux-ppc_64.jar to /home/u0017649/jcuda/jcuda-main/target/jnvgraph-natives-0.8.0-linux-ppc_64.jar
[INFO]
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ jcuda-main ---
[INFO] Installing /home/u0017649/jcuda/jcuda-main/pom.xml to /home/u0017649/.m2/repository/org/jcuda/jcuda-main/0.8.0/jcuda-main-0.8.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] JCuda .............................................. SUCCESS [ 2.596 s]
[INFO] jcuda-natives ...................................... SUCCESS [ 0.596 s]
[INFO] jcuda .............................................. SUCCESS [ 13.244 s]
[INFO] jcublas-natives .................................... SUCCESS [ 0.120 s]
[INFO] jcublas ............................................ SUCCESS [ 6.343 s]
[INFO] jcufft-natives ..................................... SUCCESS [ 0.029 s]
[INFO] jcufft ............................................. SUCCESS [ 2.843 s]
[INFO] jcurand-natives .................................... SUCCESS [ 0.036 s]
[INFO] jcurand ............................................ SUCCESS [ 2.428 s]
[INFO] jcusparse-natives .................................. SUCCESS [ 0.085 s]
[INFO] jcusparse .......................................... SUCCESS [ 7.853 s]
[INFO] jcusolver-natives .................................. SUCCESS [ 0.066 s]
[INFO] jcusolver .......................................... SUCCESS [ 4.158 s]
[INFO] jnvgraph-natives ................................... SUCCESS [ 0.057 s]
[INFO] jnvgraph ........................................... SUCCESS [ 2.932 s]
[INFO] jcuda-main ......................................... SUCCESS [ 1.689 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 45.413 s
[INFO] Finished at: 2017-11-20T01:41:41-05:00
[INFO] Final Memory: 53M/421M
[INFO] ------------------------------------------------------------------------

위와 같이 build는 잘 끝나고, 결과물로는 jcuda-main/target directory에 jar 파일들 14개가 생긴 것을 보실 수 있습니다.

u0017649@sys-90043:~/jcuda/jcuda-main$ cd target
u0017649@sys-90043:~/jcuda/jcuda-main/target$ ls -ltr
total 1680
-rw-rw-r-- 1 u0017649 u0017649 318740 Nov 20 01:41 jcuda-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 149350 Nov 20 01:41 jcuda-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 297989 Nov 20 01:41 jcublas-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 30881 Nov 20 01:41 jcublas-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 196292 Nov 20 01:41 jnvgraph-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 11081 Nov 20 01:41 jnvgraph-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 307435 Nov 20 01:41 jcusparse-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 38335 Nov 20 01:41 jcusparse-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 248028 Nov 20 01:41 jcusolver-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 22736 Nov 20 01:41 jcusolver-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 26684 Nov 20 01:41 jcurand-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 8372 Nov 20 01:41 jcurand-0.8.0.jar
-rw-rw-r-- 1 u0017649 u0017649 27414 Nov 20 01:41 jcufft-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- 1 u0017649 u0017649 11052 Nov 20 01:41 jcufft-0.8.0.jar

이제 이 14개의 jar 파일을 적당한 directory에 옮겨서 tar로 묶어 주면 됩니다. 저는 /tmp 밑에 JCuda-All-0.8.0-bin-linux-ppc64le 라는 이름의 directory를 만들고 거기로 이 jar 파일들을 옮긴 뒤 그 directory를 tar로 다음과 같이 묶었습니다.

u0017649@sys-90043:/tmp$ tar -zcvf JCuda-All-0.8.0-bin-linux-ppc64le.tgz JCuda-All-0.8.0-bin-linux-ppc64le

내용은 다음과 같습니다.

u0017649@sys-90043:/tmp$ tar -ztvf JCuda-All-0.8.0-bin-linux-ppc64le.tgz
drwxrwxr-x u0017649/u0017649 0 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/
-rw-rw-r-- u0017649/u0017649 30881 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcublas-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 11052 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcufft-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 196292 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jnvgraph-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 149350 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcuda-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 307435 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusparse-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 248028 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusolver-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 11081 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jnvgraph-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 27414 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcufft-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 297989 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcublas-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 38335 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusparse-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 8372 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcurand-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 26684 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcurand-natives-0.8.0-linux-ppc_64.jar
-rw-rw-r-- u0017649/u0017649 22736 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcusolver-0.8.0.jar
-rw-rw-r-- u0017649/u0017649 318740 2017-11-20 01:46 JCuda-All-0.8.0-bin-linux-ppc64le/jcuda-natives-0.8.0-linux-ppc_64.jar

이 JCuda-All-0.8.0-bin-linux-ppc64le.tgz 파일은 아래 link에서 download 받으실 수 있습니다.

https://drive.google.com/open?id=1CnlvJARkRWPDTbynUUlNBL_TQbu1-xbn

* 참고로, jcuda.org에서 x86용 binary를 download 받아보니 제 것과 마찬가지로 14개의 jar 파일이 들어있습니다. 아마 빌드는 제대로 된 것 같습니다.

** 참고로, 저 위에서 cmake를 할 때 -DCUDA_nvrtc_LIBRARY="/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libnvrtc.so" 옵션을 붙이는 이유는 아래의 error를 피하기 위한 것입니다.

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_nvrtc_LIBRARY
linked by target "JNvrtc" in directory /home/u0017649/jcuda/jcuda/JNvrtcJNI

*

2017년 11월 10일 금요일

tensorflow 1.3, caffe2, pytorch의 nvidia-docker를 이용한 테스트

tensorflow 1.3, caffe2, pytorch의 nvidia-docker를 이용한 테스트 방법입니다.

1) tensorflow v1.3

다음과 같이 tensorflow 1.3 docker image를 구동합니다.

root@minsky:~# nvidia-docker run -ti --rm -v /data:/data bsyu/tf1.3-ppc64le:v0.1 bash

먼저 각종 PATH 환경 변수를 확인합니다.

root@67c0e6901bb2:/# env | grep PATH
LIBRARY_PATH=/usr/local/cuda/lib64/stubs:
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/opt/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages

cifar10 관련된 example code가 들어있는 directory로 이동합니다.

root@67c0e6901bb2:/# cd /data/imsi/tensorflow/models/tutorials/image/cifar10

수행할 cifar10_multi_gpu_train.py code를 일부 수정합니다. (원래는 --train_dir 등의 명령어 파라미터로 조정이 가능해야 하는데, 실제로는 직접 source를 수정해야 제대로 수행되는 것 같습니다.)

root@67c0e6901bb2:/data/imsi/tensorflow/models/tutorials/image/cifar10# time python cifar10_multi_gpu_train.py --batch_size 512 --num_gpus 2
usage: cifar10_multi_gpu_train.py [-h] [--batch_size BATCH_SIZE]
[--data_dir DATA_DIR] [--use_fp16 USE_FP16]
cifar10_multi_gpu_train.py: error: unrecognized arguments: --num_gpus 2

위와 같은 error를 막기 위해, 아래와 같이 직접 code를 수정합니다.

root@67c0e6901bb2:/data/imsi/tensorflow/models/tutorials/image/cifar10# vi cifar10_multi_gpu_train.py
...
#parser.add_argument('--train_dir', type=str, default='/tmp/cifar10_train',
parser.add_argument('--train_dir', type=str, default='/data/imsi/test/tf1.3',
help='Directory where to write event logs and checkpoint.')

#parser.add_argument('--max_steps', type=int, default=1000000,
parser.add_argument('--max_steps', type=int, default=10000,
help='Number of batches to run.')

#parser.add_argument('--num_gpus', type=int, default=1,
parser.add_argument('--num_gpus', type=int, default=4,
help='How many GPUs to use.')

이제 다음과 같이 run 하시면 됩니다. 여기서는 batch_size를 512로 했는데, 더 크게 잡아도 될 것 같습니다.

root@67c0e6901bb2:/data/imsi/tensorflow/models/tutorials/image/cifar10# time python cifar10_multi_gpu_train.py --batch_size 512
>> Downloading cifar-10-binary.tar.gz 6.1%
...
2017-11-10 01:20:23.628755: step 9440, loss = 0.63 (15074.6 examples/sec; 0.034 sec/batch)
2017-11-10 01:20:25.052011: step 9450, loss = 0.64 (14615.4 examples/sec; 0.035 sec/batch)
2017-11-10 01:20:26.489564: step 9460, loss = 0.55 (14872.0 examples/sec; 0.034 sec/batch)
2017-11-10 01:20:27.860303: step 9470, loss = 0.61 (14515.9 examples/sec; 0.035 sec/batch)
2017-11-10 01:20:29.289386: step 9480, loss = 0.54 (13690.6 examples/sec; 0.037 sec/batch)
2017-11-10 01:20:30.799570: step 9490, loss = 0.69 (15940.8 examples/sec; 0.032 sec/batch)
2017-11-10 01:20:32.239056: step 9500, loss = 0.54 (12581.4 examples/sec; 0.041 sec/batch)
2017-11-10 01:20:34.219832: step 9510, loss = 0.60 (14077.9 examples/sec; 0.036 sec/batch)
...

다음으로는 전체 CPU, 즉 2개 chip 총 16-core의 절반인 1개 chip 8-core와, 전체 GPU 4개 중 2개의 GPU만 할당한 docker를 수행합니다. 여기서 --cpuset-cpus을 써서 CPU 자원을 control할 때, 저렇게 CPU 번호를 2개씩 그룹으로 줍니다. 이는 IBM POWER8가 SMT(hyperthread)가 core당 8개씩 낼 수 있는 특성 때문에 core 1개당 8개의 logical CPU 번호를 할당하기 때문입니다. 현재는 deep learning의 성능 최적화를 위해 SMT를 8이 아닌 2로 맞추어 놓았습니다.

root@minsky:~# NV_GPU=0,1 nvidia-docker run -ti --rm --cpuset-cpus="0,1,8,9,16,17,24,25,32,33,40,41,48,49" -v /data:/data bsyu/tf1.3-ppc64le:v0.1 bash

root@3b2c2614811d:~# nvidia-smi

Fri Nov 10 02:24:14 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 38C P0 30W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 40C P0 33W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

root@3b2c2614811d:/# cd /data/imsi/tensorflow/models/tutorials/image/cifar10

이제 GPU가 4개가 아니라 2개이므로, cifar10_multi_gpu_train.py도 아래와 같이 수정합니다.

root@3b2c2614811d:/data/imsi/tensorflow/models/tutorials/image/cifar10# vi cifar10_multi_gpu_train.py
...
#parser.add_argument('--num_gpus', type=int, default=1,
parser.add_argument('--num_gpus', type=int, default=2,
help='How many GPUs to use.')

수행하면 잘 돌아갑니다.

root@3b2c2614811d:/data/imsi/tensorflow/models/tutorials/image/cifar10# time python cifar10_multi_gpu_train.py --batch_size 512
>> Downloading cifar-10-binary.tar.gz 1.7%
...
2017-11-10 02:35:50.040462: step 120, loss = 4.07 (15941.4 examples/sec; 0.032 sec/batch)
2017-11-10 02:35:50.587970: step 130, loss = 4.14 (19490.7 examples/sec; 0.026 sec/batch)
2017-11-10 02:35:51.119347: step 140, loss = 3.91 (18319.8 examples/sec; 0.028 sec/batch)
2017-11-10 02:35:51.655916: step 150, loss = 3.87 (20087.1 examples/sec; 0.025 sec/batch)
2017-11-10 02:35:52.181703: step 160, loss = 3.90 (19215.5 examples/sec; 0.027 sec/batch)
2017-11-10 02:35:52.721608: step 170, loss = 3.82 (17780.1 examples/sec; 0.029 sec/batch)
2017-11-10 02:35:53.245088: step 180, loss = 3.92 (18888.4 examples/sec; 0.027 sec/batch)
2017-11-10 02:35:53.777146: step 190, loss = 3.80 (19103.7 examples/sec; 0.027 sec/batch)
2017-11-10 02:35:54.308063: step 200, loss = 3.76 (18554.2 examples/sec; 0.028 sec/batch)
...

2) caffe2

여기서는 처음부터 GPU 2개와 CPU core 8개만 가지고 docker를 띄워 보겠습니다.

root@minsky:~# NV_GPU=0,1 nvidia-docker run -ti --rm --cpuset-cpus="0,1,8,9,16,17,24,25,32,33,40,41,48,49" -v /data:/data bsyu/caffe2-ppc64le:v0.3 bash

보시는 바와 같이 GPU가 2개만 올라옵니다.

root@dc853a5495a0:/# nvidia-smi

Fri Nov 10 07:22:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 32C P0 29W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 35C P0 32W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

환경변수를 확인합니다. 여기서는 caffe2가 /opt/caffe2에 설치되어 있으므로, LD_LIBRARY_PATH나 PYTHONPATH를 거기에 맞춥니다.

root@dc853a5495a0:/# env | grep PATH
LIBRARY_PATH=/usr/local/cuda/lib64/stubs:
LD_LIBRARY_PATH=/opt/caffe2/lib:/opt/DL/nccl/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/opt/caffe2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHONPATH=/opt/caffe2

caffe2는 아래의 resnet50_trainer.py를 이용해 테스트합니다. 그 전에, 먼저 https://github.com/caffe2/caffe2/issues/517 에 나온 lmdb 생성 문제를 해결하기 위해 이 URL에서 제시하는 대로 아래와 같이 code 일부를 수정합니다.

root@dc853a5495a0:/# cd /data/imsi/caffe2/caffe2/python/examples
root@dc853a5495a0:/data/imsi/caffe2/caffe2/python/examples# vi lmdb_create_example.py
...
flatten_img = img_data.reshape(np.prod(img_data.shape))
# img_tensor.float_data.extend(flatten_img)
img_tensor.float_data.extend(flatten_img.flat)

이어서 다음과 같이 lmdb를 생성합니다. 이미 1번 수행했으므로 다시 할 경우 매우 빨리 수행될 것입니다.

root@dc853a5495a0:/data/imsi/caffe2/caffe2/python/examples# python lmdb_create_example.py --output_file /data/imsi/test/caffe2/lmdb
>>> Write database...
Inserted 0 rows
Inserted 16 rows
Inserted 32 rows
Inserted 48 rows
Inserted 64 rows
Inserted 80 rows
Inserted 96 rows
Inserted 112 rows
Checksum/write: 1744827
>>> Read database...
Checksum/read: 1744827

그 다음에 training을 다음과 같이 수행합니다. 여기서는 GPU가 2개만 보이는 환경이므로, --gpus에 0,1,2,3 대신 0,1만 써야 합니다.

root@dc853a5495a0:/data/imsi/caffe2/caffe2/python/examples# time python resnet50_trainer.py --train_data /data/imsi/test/caffe2/lmdb --gpus 0,1 --batch_size 128 --num_epochs 1

수행하면 다음과 같이 'not a valid file'이라는 경고 메시지가 나옵니다만, github 등을 googling해보면 무시하셔도 되는 메시지입니다.

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Running on GPUs: [0, 1]
INFO:resnet50_trainer:Using epoch size: 1499904
INFO:data_parallel_model:Parallelizing model for devices: [0, 1]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.252535104752 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.253523111343 secs
INFO:resnet50_trainer:Starting epoch 0/1
INFO:resnet50_trainer:Finished iteration 1/11718 of epoch 0 (27.70 images/sec)
INFO:resnet50_trainer:Training loss: 7.39205980301, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/11718 of epoch 0 (378.51 images/sec)
INFO:resnet50_trainer:Training loss: 0.0, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 3/11718 of epoch 0 (387.87 images/sec)
INFO:resnet50_trainer:Training loss: 0.0, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 4/11718 of epoch 0 (383.28 images/sec)
INFO:resnet50_trainer:Training loss: 0.0, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 5/11718 of epoch 0 (381.71 images/sec)
...

다만 위와 같이 처음부터 accuracy가 1.0으로 나오는 문제가 있습니다. 이 resnet50_trainer.py 문제에 대해서는 caffe2의 github에 아래와 같이 discussion들이 있었습니다만, 아직 뾰족한 해결책은 없는 상태입니다. 하지만 상대적 시스템 성능 측정에는 별 문제가 없습니다.

https://github.com/caffe2/caffe2/issues/810

3) pytorch

이번에는 pytorch 이미지로 테스트하겠습니다.

root@8ccd72116fee:~# env | grep PATH
LIBRARY_PATH=/usr/local/cuda/lib64/stubs:
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/opt/anaconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

먼저 docker image를 아래와 같이 구동합니다. 단, 여기서는 --ipc=host 옵션을 씁니다. 이유는 https://discuss.pytorch.org/t/imagenet-example-is-crashing/1363/2 에서 언급된 hang 현상을 피하기 위한 것입니다.

root@minsky:~# nvidia-docker run -ti --rm --ipc=host -v /data:/data bsyu/pytorch-ppc64le:v0.1 bash

가장 간단한 example인 mnist를 아래와 같이 수행합니다. 10 epochs를 수행하는데 대략 1분 30초 정도가 걸립니다.

root@8ccd72116fee:/data/imsi/examples/mnist# time python main.py --batch-size 512 --epochs 10
...
rain Epoch: 9 [25600/60000 (42%)] Loss: 0.434816
Train Epoch: 9 [30720/60000 (51%)] Loss: 0.417652
Train Epoch: 9 [35840/60000 (59%)] Loss: 0.503125
Train Epoch: 9 [40960/60000 (68%)] Loss: 0.477776
Train Epoch: 9 [46080/60000 (76%)] Loss: 0.346416
Train Epoch: 9 [51200/60000 (85%)] Loss: 0.361492
Train Epoch: 9 [56320/60000 (93%)] Loss: 0.383941

Test set: Average loss: 0.1722, Accuracy: 9470/10000 (95%)

Train Epoch: 10 [0/60000 (0%)] Loss: 0.369119
Train Epoch: 10 [5120/60000 (8%)] Loss: 0.377726
Train Epoch: 10 [10240/60000 (17%)] Loss: 0.402854
Train Epoch: 10 [15360/60000 (25%)] Loss: 0.349409
Train Epoch: 10 [20480/60000 (34%)] Loss: 0.295271
...

다만 이건 single-GPU만 사용하는 example입니다. Multi-GPU를 사용하기 위해서는 아래의 imagenet example을 수행해야 하는데, 그러자면 ilsvrc2012 dataset을 download 받아 풀어놓아야 합니다. 그 data는 아래와 같이 /data/imagenet_dir/train과 /data/imagenet_dir/val에 각각 JPEG 형태로 풀어놓았습니다.

root@minsky:/data/imagenet_dir/train# while read SYNSET; do
> mkdir -p ${SYNSET}
> tar xf ../../ILSVRC2012_img_train.tar "${SYNSET}.tar"
> tar xf "${SYNSET}.tar" -C "${SYNSET}"
> rm -f "${SYNSET}.tar"
> done < /opt/DL/caffe-nv/data/ilsvrc12/synsets.txt

root@minsky:/data/imagenet_dir/train# ls -1 | wc -l
1000
root@minsky:/data/imagenet_dir/train# du -sm .
142657 .
root@minsky:/data/imagenet_dir/train# find . | wc -l
1282168

root@minsky:/data/imagenet_dir/val# ls -1 | wc -l
50000

이 상태에서 그대로 main.py를 수행하면 다음과 같은 error를 겪게 됩니다. 이유는 이 main.py는 val 디렉토리 밑에도 label별 디렉토리에 JPEG 파일들이 들어가 있기를 기대하는 구조이기 때문입니다.

RuntimeError: Found 0 images in subfolders of: /data/imagenet_dir/val
Supported image extensions are: .jpg,.JPG,.jpeg,.JPEG,.png,.PNG,.ppm,.PPM,.bmp,.BMP

따라서 아래와 같이 inception 디렉토리의 preprocess_imagenet_validation_data.py를 이용하여 label별 디렉토리로 JPEG 파일들을 분산 재배치해야 합니다.

root@minsky:/data/models/research/inception/inception/data# python preprocess_imagenet_validation_data.py /data/imagenet_dir/val imagenet_2012_validation_synset_labels.txt

이제 다시 보면 label별로 재분배된 것을 보실 수 있습니다.

root@minsky:/data/imagenet_dir/val# ls | head -n 3
n01440764
n01443537
n01484850

root@minsky:/data/imagenet_dir/val# ls | wc -l
1000
root@minsky:/data/imagenet_dir/val# find . | wc -l
51001

이제 다음과 같이 main.py를 수행하면 됩니다.

root@8ccd72116fee:~# cd /data/imsi/examples/imagenet

root@8ccd72116fee:/data/imsi/examples/imagenet# time python main.py -a resnet18 --epochs 1 /data/imagenet_dir
=> creating model 'resnet18'
Epoch: [0][0/5005] Time 11.237 (11.237) Data 2.330 (2.330) Loss 7.0071 (7.0071) Prec@1 0.391 (0.391) Prec@5 0.391 (0.391)
Epoch: [0][10/5005] Time 0.139 (1.239) Data 0.069 (0.340) Loss 7.1214 (7.0515) Prec@1 0.000 (0.284) Prec@5 0.000 (1.065)
Epoch: [0][20/5005] Time 0.119 (0.854) Data 0.056 (0.342) Loss 7.1925 (7.0798) Prec@1 0.000 (0.260) Prec@5 0.781 (0.930)
...

* 위에서 사용된 docker image들은 다음과 같이 backup을 받아두었습니다.

root@minsky:/data/docker_save# docker save --output caffe2-ppc64le.v0.3.tar bsyu/caffe2-ppc64le:v0.3
root@minsky:/data/docker_save# docker save --output pytorch-ppc64le.v0.1.tar bsyu/pytorch-ppc64le:v0.1
root@minsky:/data/docker_save# docker save --output tf1.3-ppc64le.v0.1.tar bsyu/tf1.3-ppc64le:v0.1
root@minsky:/data/docker_save# docker save --output cudnn6-conda2-ppc64le.v0.1.tar bsyu/cudnn6-conda2-ppc64le:v0.1
root@minsky:/data/docker_save# docker save --output cudnn6-conda3-ppc64le.v0.1.tar bsyu/cudnn6-conda3-ppc64le:v0.1

root@minsky:/data/docker_save# ls -l
total 28023280
-rw------- 1 root root 4713168896 Nov 10 16:48 caffe2-ppc64le.v0.3.tar
-rw------- 1 root root 4218520064 Nov 10 17:10 cudnn6-conda2-ppc64le.v0.1.tar
-rw------- 1 root root 5272141312 Nov 10 17:11 cudnn6-conda3-ppc64le.v0.1.tar
-rw------- 1 root root 6921727488 Nov 10 16:51 pytorch-ppc64le.v0.1.tar
-rw------- 1 root root 7570257920 Nov 10 16:55 tf1.3-ppc64le.v0.1.tar

비상시엔 이 이미지들을 docker load 명령으로 load 하시면 됩니다.

(예) docker load --input caffe2-ppc64le.v0.3.tar