HW 엔지니어를 위한 Deep Learning: 12월 2017

2017년 12월 28일 목요일

Ubuntu 16.04 ppc64le에 crashdump를 설정하는 방법

crashdump를 설정하는 자세한 내용은 https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html 에 나와 있으니 참조하시기 바랍니다.

먼저 다음과 같이 /proc/cmdline 및 dmesg 결과에서 crash라는 단어를 검색해봅니다. 없으면 crashdump가 enable되지 않은 상태입니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash

u0017649@sys-90592:~$ dmesg | grep -i crash

crashdump를 enable하기 위해서는 linux-crashdump를 설치합니다. 설치 과정 중에 다음과 같이 kexec-tools와 kdump-tools를 enable하겠냐는 것을 묻는데, 둘다 yes로 하셔야 합니다.

u0017649@sys-90592:~$ sudo apt install linux-crashdump

┌────────────────────────┤ Configuring kexec-tools ├────────────────────────┐
│ │
│ If you choose this option, a system reboot will trigger a restart into a │
│ kernel loaded by kexec instead of going through the full system boot │
│ loader process. │
│ │
│ Should kexec-tools handle reboots? │
│ │
│ <Yes> <No> │
│ │
└───────────────────────────────────────────────────────────────────────────┘

┌────────────────────────┤ Configuring kdump-tools ├────────────────────────┐
│ │
│ If you choose this option, the kdump-tools mechanism will be enabled. A │
│ reboot is still required in order to enable the crashkernel kernel │
│ parameter. │
│ │
│ Should kdump-tools be enabled by default? │
│ │
│ <Yes> <No> │
│ │

아마 고객사의 Minsky 환경에서는 그렇지 않겠습니다만, 제가 쓰는 PDP (Power Development Platform) cloud 환경처럼 혹시 아래 파일에 USE_GRUB_CONFIG=false로 되어 있다면 이걸 true로 고쳐 주셔야 crashdump 설정이 유효하게 됩니다.

u0017649@sys-90592:~$ sudo vi /etc/default/kexec
...
#USE_GRUB_CONFIG=false
USE_GRUB_CONFIG=true

이제 시스템을 reboot 합니다.

u0017649@sys-90592:~$ sudo shutdown -r now

시스템이 부팅되면, /var/crash 밑에 아무 것도 없는 것을 확인합니다.

u0017649@sys-90592:~$ ls /var/crash

아울러 아까는 없었던 crashkernel 관련 항목이 /proc/cmdline과 dmesg에 나오는 것을 확인합니다.

u0017649@sys-90592:~$ cat /proc/cmdline | grep crash
root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

u0017649@sys-90592:~$ dmesg | grep -i crash
[ 0.000000] Reserving 256MB of memory at 128MB for crashkernel (System RAM: 4096MB)
[ 0.000000] Kernel command line: root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet crashkernel=384M-2G:128M,2G-:256M

이제 kdump-config show 명령으로 crashdump 준비 상태를 확인합니다.

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: broken symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: Not ready to kdump

여기서는 Not ready인데, 이유는 제가 쓰는 PDP (Power Development Platform) cloud 환경에서는 저 /var/lib/kdump/initrd.img-4.4.0-104-generic 대신 /var/lib/kdump/initrd.img-4.4.0-98-generic이 들어있기 때문입니다. 아마도 고객사의 Minsky 서버에서는 이런 문제는 없을 것입니다. PDP에서만 있는 이 문제는 그냥 간단히 /var/lib/kdump/initrd.img-4.4.0-98-generic를 /var/lib/kdump/initrd.img-4.4.0-104-generic 라는 이름으로 copy해서 해결하겠습니다.

u0017649@sys-90592:~$ sudo cp /var/lib/kdump/initrd.img-4.4.0-98-generic /var/lib/kdump/initrd.img-4.4.0-104-generic

이제 kdump-config load를 수행한 뒤 다시 kdump-config show를 해보면 ready to kdump로 상태가 바뀐 것을 보실 수 있습니다.

u0017649@sys-90592:~$ sudo kdump-config load
Modified cmdline:root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service elfcorehdr=156864K
* loaded kdump kernel

u0017649@sys-90592:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-104-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-104-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=2a159e60-be84-4802-9bf1-bdbcf457a39e ro splash quiet irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

crashdump를 테스트하는 방법은 다음과 같습니다. 단, 이것이 수행될 때는 일단 시스템이 죽고, 또 시스템 reboot이 제대로 되지 않을 수 있으므로 반드시 IPMI나 RGB 모니터 등으로 console을 확보한 뒤에 하시기 바랍니다.

sudo가 아닌, 반드시 "su -"를 통해 root 계정으로 login 한 뒤, 다음과 같이 명령을 날리면 시스템이 죽으면서 /var/crash 디렉토리에 dump를 쏟습니다.

root@sys-90592:~# echo c > /proc/sysrq-trigger

리부팅이 끝나면 다음과 같은 형태로 crash file이 만들어진 것을 보실 수 있습니다.

# ls /var/crash
linux-image-3.0.0-12-server.0.crash

이 파일을 분석하는 것은 Ubuntu 공식 문서에 따르면 https://www.dedoimedo.com/computers/crash-analyze.html 에 나온 것을 참조하여 하면 된다고는 하는데, 저도 경험이 없고, 또 쉬워 보이지도 않네요.

이 외에, 기술지원 계약자에게 시스템 정보를 모아서 보내려면 sosreport를 사용하시는 것이 좋습니다. 원래 redhat에 있던 이 기능은 ubuntu에서도 사용이 가능합니다. 먼저 아래와 같이 sosreport를 설치하시고...

u0017649@sys-90528:~$ sudo apt-get install sosreport

(반드시 root 권한으로) sosreport를 수행하시기만 하면 됩니다. 결과물은 /tmp 밑에 tar.xz 포맷으로 생성됩니다.

u0017649@sys-90528:~$ sudo sosreport
...
Please enter your first initial and last name [sys-90528]:
Please enter the case id that you are generating this report for []:
Setting up archive ...
Setting up plugins ...
Running plugins. Please wait ...

Running 23/61: kernel...
Creating compressed archive...

Your sosreport has been generated and saved in:
/tmp/sosreport-sys-90528-20171227231646.tar.xz

The checksum is: 939ce2d2b6a254fbb576a2f6728b5678

Please send this file to your support representative.

u0017649@sys-90528:~$ ls -l /tmp/sosreport-sys-90528-20171227231646.tar.xz
-rw------- 1 root root 5231340 Dec 27 23:17 /tmp/sosreport-sys-90528-20171227231646.tar.xz

저 파일 속의 내용은 뭐 대단한 것은 아니고 그냥 시스템의 주요 config 파일들을 모아 놓는 것입니다.

u0017649@sys-90528:~$ sudo tar -tf /tmp/sosreport-sys-90528-20171227231646.tar.xz | more

...

sosreport-sys-90528-20171227231646/lib/udev/rules.d/60-cdrom_id.rules
sosreport-sys-90528-20171227231646/lib/udev/rules.d/80-docker-ce.rules
sosreport-sys-90528-20171227231646/lib/nvidia-361/
sosreport-sys-90528-20171227231646/lib/nvidia-361/modprobe.conf
sosreport-sys-90528-20171227231646/lib/modules/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/
sosreport-sys-90528-20171227231646/lib/modules/4.4.0-98-generic/modules.dep
...

sosreport-sys-90528-20171227231646/etc/rcS.d/S03udev
sosreport-sys-90528-20171227231646/etc/rcS.d/S02plymouth-log
sosreport-sys-90528-20171227231646/etc/rcS.d/S08checkfs.sh

2017년 12월 18일 월요일

ppc64le에서 사용가능한 open source anti-virus SW : CLAMAV

Minsky의 아키텍처인 ppc64le(IBM POWER8)에서도 사용 가능한 anti-virus SW가 있습니다. CLAM Anti-Virus (clamav)입니다.

CLAMAV는 open source 기반의 anti-virus SW로서, 다음이 홈페이지로 되어 있고, source를 download 받을 수도 있습니다.

http://www.clamav.net/

ppc64le에서 빌드하는 방법도 매우 간단하여, 그냥 ./configure && make && sudo make install 만 해주시면 됩니다.

그러나 deep learning에서 주로 사용하는 Ubuntu에는 아예 OS의 표준 apt repository에 포함되어 있어 손쉽게 설치 및 사용이 가능합니다.

설치는 다음과 같이 apt-get install 명령으로 하시면 됩니다.

u0017649@sys-89983:~$ sudo apt-get install clamav clamav-daemon clamav-freshclam clamav-base libclamav-dev clamav-testfiles

clamav-daemon은 다음과 같이 start 하시면 됩니다.

u0017649@sys-89983:~$ sudo systemctl start clamav-daemon.service

u0017649@sys-89983:~$ sudo systemctl status clamav-daemon.service
● clamav-daemon.service - Clam AntiVirus userspace daemon
Loaded: loaded (/lib/systemd/system/clamav-daemon.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2017-12-17 21:07:57 EST; 4s ago
Docs: man:clamd(8)
man:clamd.conf(5)
http://www.clamav.net/lang/en/doc/
Main PID: 9462 (clamd)
Tasks: 1
Memory: 234.0M
CPU: 4.121s
CGroup: /system.slice/clamav-daemon.service
└─9462 /usr/sbin/clamd --foreground=true

/home/u0017649/hpcc-1.5.0 라는 directory 내용을 scan하여, 혹시 virus에 감염된 파일이 있을 경우 경고(bell)를 울려주는 명령은 다음과 같이 하시면 됩니다.

u0017649@sys-89983:~$ clamscan -r --bell -i /home/u0017649/hpcc-1.5.0

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 34
Scanned files: 737
Infected files: 0
Data scanned: 9.64 MB
Data read: 6.11 MB (ratio 1.58:1)
Time: 18.255 sec (0 m 18 s)

만약 virus에 감염된 파일이 있을 경우 자동으로 제거까지 하기를 원한다면 다음과 같이 --remove 옵션을 사용하시면 됩니다. 다만, ppc64le 아키텍처에서 virus 감염 파일을 구하는 것은 정말 어려울 것이므로, 위에 언급된 clamav 홈페이지에서 clamav source code를 download 받아서 그 source를 scan해보겠습니다. 그 속에는 test용으로 들어있는 파일들이 있는 모양이더라구요.

u0017649@sys-89983:~$ tar -zxf clamav-0.99.2.tar.gz

u0017649@sys-89983:~$ clamscan -r --remove /home/u0017649/clamav-0.99.2 > clamscan.out

다음과 같이 3개 파일이 감염되었다고 제거된 것을 보실 수 있습니다.

u0017649@sys-89983:~$ grep -i removed clamscan.out
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_int.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam.isoaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clam_IScab_ext.exeaa: Removed.
/home/u0017649/clamav-0.99.2/test/.split/split.clamjol.isoaa: Removed.

u0017649@sys-89983:~$ tail clamscan.out

----------- SCAN SUMMARY -----------
Known viruses: 6366898
Engine version: 0.99.2
Scanned directories: 227
Scanned files: 3231
Infected files: 4
Data scanned: 93.26 MB
Data read: 50.67 MB (ratio 1.84:1)
Time: 28.488 sec (0 m 28 s)

Anti-virus SW는 virus 목록 등이 계속 업데이트 되는 것이 중요하지요. 그런 일을 해주는 것이 freshclam 입니다. 이건 설치되면 자동으로 수행되는데, 그 log는 다음과 같이 확인하실 수 있습니다.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:57:04 2017 -> ClamAV update process started at Sun Dec 17 20:57:04 2017
Sun Dec 17 20:58:02 2017 -> Downloading main.cvd [100%]
Sun Dec 17 20:58:13 2017 -> main.cvd updated (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 20:58:35 2017 -> Downloading daily.cvd [100%]
Sun Dec 17 20:58:39 2017 -> daily.cvd updated (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 20:58:40 2017 -> Downloading bytecode.cvd [100%]
Sun Dec 17 20:58:40 2017 -> bytecode.cvd updated (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------

위의 log를 보면 clamd.ctl 파일이 없어서 clamd에 대한 notification이 제대로 되지 않은 것을 보실 수 있습니다. 저 file은 clamav-daemon을 처음 살릴 때 자동 생성되는데, 제가 위에서 'systemctl start clamav-daemon.service' 명령으로 clamav-daemon을 살리기 전에 freshclam이 구동되는 바람에 벌어진 일 같습니다. 이제 clamav-daemon을 제가 살려 놓았으므로, 다음과 같이 freshclam을 죽였다가 살리면 해결됩니다.

u0017649@sys-89983:~$ ps -ef | grep freshclam
clamav 8894 1 1 20:57 ? 00:00:10 /usr/bin/freshclam -d --foreground=true
u0017649 9473 31958 0 21:08 pts/0 00:00:00 grep --color=auto freshclam

u0017649@sys-89983:~$ sudo kill -9 8894

u0017649@sys-89983:~$ sudo /usr/bin/freshclam -d --foreground=false

위에서는 freshcalm을 background daemon으로 살렸습니다. 다시 log를 보시지요.

u0017649@sys-89983:~$ sudo tail -f /var/log/clamav/freshclam.log
Sun Dec 17 20:58:44 2017 -> Database updated (6372717 signatures) from db.local.clamav.net (IP: 157.131.0.17)
Sun Dec 17 20:58:44 2017 -> WARNING: Clamd was NOT notified: Can't connect to clamd through /var/run/clamav/clamd.ctl: No such file or directory
Sun Dec 17 20:58:44 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> --------------------------------------
Sun Dec 17 21:09:37 2017 -> freshclam daemon 0.99.2 (OS: linux-gnu, ARCH: ppc, CPU: powerpc64le)
Sun Dec 17 21:09:37 2017 -> ClamAV update process started at Sun Dec 17 21:09:37 2017
Sun Dec 17 21:09:37 2017 -> main.cvd is up to date (version: 58, sigs: 4566249, f-level: 60, builder: sigmgr)
Sun Dec 17 21:09:37 2017 -> daily.cvd is up to date (version: 24138, sigs: 1806393, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> bytecode.cvd is up to date (version: 319, sigs: 75, f-level: 63, builder: neo)
Sun Dec 17 21:09:37 2017 -> --------------------------------------

이제 error 없이 잘 update된 것을 보실 수 있습니다.

clamconf 명령은 clamav 관련 각종 config 파일을 점검해주는 명령입니다. 그 output은 아래와 같습니다.

u0017649@sys-89983:~$ clamconf
Checking configuration files in /etc/clamav

Config file: clamd.conf
-----------------------
LogFile = "/var/log/clamav/clamav.log"
StatsHostID = "auto"
StatsEnabled disabled
StatsPEDisabled = "yes"
StatsTimeout = "10"
LogFileUnlock disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogClean disabled
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
ExtendedDetectionInfo = "yes"
PidFile disabled
TemporaryDirectory disabled
DatabaseDirectory = "/var/lib/clamav"
OfficialDatabaseOnly disabled
LocalSocket = "/var/run/clamav/clamd.ctl"
LocalSocketGroup = "clamav"
LocalSocketMode = "666"
FixStaleSocket = "yes"
TCPSocket disabled
TCPAddr disabled
MaxConnectionQueueLength = "15"
StreamMaxLength = "26214400"
StreamMinPort = "1024"
StreamMaxPort = "2048"
MaxThreads = "12"
ReadTimeout = "180"
CommandReadTimeout = "5"
SendBufTimeout = "200"
MaxQueue = "100"
IdleTimeout = "30"
ExcludePath disabled
MaxDirectoryRecursion = "15"
FollowDirectorySymlinks disabled
FollowFileSymlinks disabled
CrossFilesystems = "yes"
SelfCheck = "3600"
DisableCache disabled
VirusEvent disabled
ExitOnOOM disabled
AllowAllMatchScan = "yes"
Foreground disabled
Debug disabled
LeaveTemporaryFiles disabled
User = "clamav"
AllowSupplementaryGroups disabled
Bytecode = "yes"
BytecodeSecurity = "TrustSigned"
BytecodeTimeout = "60000"
BytecodeUnsigned disabled
BytecodeMode = "Auto"
DetectPUA disabled
ExcludePUA disabled
IncludePUA disabled
AlgorithmicDetection = "yes"
ScanPE = "yes"
ScanELF = "yes"
DetectBrokenExecutables disabled
ScanMail = "yes"
ScanPartialMessages disabled
PhishingSignatures = "yes"
PhishingScanURLs = "yes"
PhishingAlwaysBlockCloak disabled
PhishingAlwaysBlockSSLMismatch disabled
PartitionIntersection disabled
HeuristicScanPrecedence disabled
StructuredDataDetection disabled
StructuredMinCreditCardCount = "3"
StructuredMinSSNCount = "3"
StructuredSSNFormatNormal = "yes"
StructuredSSNFormatStripped disabled
ScanHTML = "yes"
ScanOLE2 = "yes"
OLE2BlockMacros disabled
ScanPDF = "yes"
ScanSWF = "yes"
ScanXMLDOCS = "yes"
ScanHWP3 = "yes"
ScanArchive = "yes"
ArchiveBlockEncrypted disabled
ForceToDisk disabled
MaxScanSize = "104857600"
MaxFileSize = "26214400"
MaxRecursion = "16"
MaxFiles = "10000"
MaxEmbeddedPE = "10485760"
MaxHTMLNormalize = "10485760"
MaxHTMLNoTags = "2097152"
MaxScriptNormalize = "5242880"
MaxZipTypeRcg = "1048576"
MaxPartitions = "50"
MaxIconsPE = "100"
MaxRecHWP3 = "16"
PCREMatchLimit = "10000"
PCRERecMatchLimit = "5000"
PCREMaxFileSize = "26214400"
ScanOnAccess disabled
OnAccessMountPath disabled
OnAccessIncludePath disabled
OnAccessExcludePath disabled
OnAccessExcludeUID disabled
OnAccessMaxFileSize = "5242880"
OnAccessDisableDDD disabled
OnAccessPrevention disabled
OnAccessExtraScanning disabled
DevACOnly disabled
DevACDepth disabled
DevPerformance disabled
DevLiblog disabled
DisableCertCheck disabled

Config file: freshclam.conf
---------------------------
StatsHostID disabled
StatsEnabled disabled
StatsTimeout disabled
LogFileMaxSize = "4294967295"
LogTime = "yes"
LogSyslog disabled
LogFacility = "LOG_LOCAL6"
LogVerbose disabled
LogRotate = "yes"
PidFile disabled
DatabaseDirectory = "/var/lib/clamav"
Foreground disabled
Debug disabled
AllowSupplementaryGroups disabled
UpdateLogFile = "/var/log/clamav/freshclam.log"
DatabaseOwner = "clamav"
Checks = "24"
DNSDatabaseInfo = "current.cvd.clamav.net"
DatabaseMirror = "db.local.clamav.net", "database.clamav.net"
PrivateMirror disabled
MaxAttempts = "5"
ScriptedUpdates = "yes"
TestDatabases = "yes"
CompressLocalDatabase disabled
ExtraDatabase disabled
DatabaseCustomURL disabled
HTTPProxyServer disabled
HTTPProxyPort disabled
HTTPProxyUsername disabled
HTTPProxyPassword disabled
HTTPUserAgent disabled
NotifyClamd = "/etc/clamav/clamd.conf"
OnUpdateExecute disabled
OnErrorExecute disabled
OnOutdatedExecute disabled
LocalIPAddress disabled
ConnectTimeout = "30"
ReceiveTimeout = "30"
SubmitDetectionStats disabled
DetectionStatsCountry disabled
DetectionStatsHostID disabled
SafeBrowsing disabled
Bytecode = "yes"

clamav-milter.conf not found

Software settings
-----------------
Version: 0.99.2
Optional features supported: MEMPOOL IPv6 FRESHCLAM_DNS_FIX AUTOIT_EA06 BZIP2 LIBXML2 PCRE ICONV JSON

Database information
--------------------
Database directory: /var/lib/clamav
bytecode.cvd: version 319, sigs: 75, built on Wed Dec 6 21:17:11 2017
main.cvd: version 58, sigs: 4566249, built on Wed Jun 7 17:38:10 2017
daily.cvd: version 24138, sigs: 1806393, built on Sun Dec 17 16:10:39 2017
Total number of signatures: 6372717

Platform information
--------------------
uname: Linux 4.4.0-103-generic #126-Ubuntu SMP Mon Dec 4 16:22:09 UTC 2017 ppc64le
OS: linux-gnu, ARCH: ppc, CPU: powerpc64le
Full OS version: Ubuntu 16.04.2 LTS
zlib version: 1.2.8 (1.2.8), compile flags: a9
platform id: 0x0a3152520800000000050400

Build information
-----------------
GNU C: 5.4.0 20160609 (5.4.0)
CPPFLAGS: -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64 -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE
CXXFLAGS:
LDFLAGS: -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed
Configure: '--build=powerpc64le-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libexecdir=/usr/lib/clamav' '--disable-maintainer-mode' '--disable-dependency-tracking' 'CFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O3 -fstack-protector-strong -Wformat -Werror=format-security -Wall -D_FILE_OFFSET_BITS=64' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed' '--with-dbdir=/var/lib/clamav' '--sysconfdir=/etc/clamav' '--disable-clamav' '--disable-unrar' '--enable-milter' '--enable-dns-fix' '--with-libjson' '--with-gnu-ld' '--with-systemdsystemunitdir=/lib/systemd/system' 'build_alias=powerpc64le-linux-gnu'
sizeof(void*) = 8
Engine flevel: 82, dconf: 82

2017년 12월 14일 목요일

ppc64le 아키텍처 cluster에서 HPCC 수행하는 방법

HPCC는 간단하게 수퍼컴의 성능을 측정할 수 있는, HPL (High Performance LINPACK)을 포함한 7개 HPC code들의 묶음 suite입니다. 아래가 홈페이지입니다.

http://icl.cs.utk.edu/hpcc/software/index.html

여기 나온 정보만으로는 컴파일해서 돌리는 것이 쉽지 않은데, 아래 site의 HPL 수행 방법을 보면 그나마 좀 이해가 됩니다.

http://www.crc.nd.edu/~rich/CRC_Summer_Scholars_2014/HPL-HowTo.pdf

여기서 돌리는 테스트들의 내용 등은 수학적 지식이 있어야 어느 정도 이해가 됩니다만, 시스템 엔지니어 입장에서는 그런 것 모르고도 대충 돌릴 수는 있습니다. 아래에는 ppc64le 아키텍처, 즉 IBM POWER8 프로세서 환경에서 어떻게 수행하면 되는지를 step by step으로 정리했습니다. 실은 x86 아키텍처가 아닌 ppc64le라고 해서 딱히 수행 방법이 다르지는 않습니다.

여기서는 PDP (Power Development Cloud) 환경의 1-core짜리 ppc64le Ubuntu 16.04 가상머신을 2대 이용했습니다.
(* Power Development Cloud, https://www-356.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_com_sys_power-development-platform 에서 신청하면 무료로 2주간 1-core짜리 Linux on POWER 환경을 제공. 2주 후 다시 또 무료로 재신청 가능. 최대 5개 VM을 한꺼번에 신청 가능)

다음과 같이 openmpi와 BLAS가 기본으로 설치되어 있어야 합니다. apt-get install libopenmpi-dev libblas-dev 명령으로 쉽게 설치됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep openmpi
ii libopenmpi-dev 1.10.2-8ubuntu1 ppc64el high performance message passing library -- header files
ii libopenmpi1.10 1.10.2-8ubuntu1 ppc64el high performance message passing library -- shared library
ii openmpi-bin 1.10.2-8ubuntu1 ppc64el high performance message passing library -- binaries
ii openmpi-common 1.10.2-8ubuntu1 all high performance message passing library -- common files

u0017649@sys-90393:~/hpcc-1.5.0$ dpkg -l | grep blas
ii libblas-common 3.6.0-2ubuntu2 ppc64el Dependency package for all BLAS implementations
ii libblas-dev 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Subroutines 3, static library
ii libblas3 3.6.0-2ubuntu2 ppc64el Basic Linear Algebra Reference implementations, shared library

먼저 source를 download 받고, tar를 풉니다.

u0017649@sys-90393:~$ wget http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.5.0.tar.gz

u0017649@sys-90393:~$ tar -zxf hpcc-1.5.0.tar.gz
u0017649@sys-90393:~$ cd hpcc-1.5.0

먼저 hpl/setup 디렉토리에 있는 make_generic을 수행하여 Make.UNKNOWN을 생성합니다. 여기서 대략 이 환경에 맞는 값들로 Makefile이 만들어집니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cd hpl/setup

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ sh make_generic

여기서 만들어진 Make.UNKNOWN은 다음과 같은 내용을 담고 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ grep -v \# Make.UNKNOWN
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
ARCH = $(arch)
TOPdir = ../../..
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
MPdir =
MPinc =
MPlib =
LAdir =
LAinc =
LAlib = -lblas
F2CDEFS = -DAdd_ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lm
HPL_OPTS =
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS)
LINKER = mpif77
LINKFLAGS =
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo

이제 이 Make.UNKNOWN를 상위 디렉토리인 hpl 디렉토리에 Make.Linux라는 이름으로 복사합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cp Make.UNKNOWN ../Make.Linux

그리고난 뒤 TOPdir, 즉 hpcc-1.5.0으로 올라와서 make arch=Linux를 수행합니다. 약간 헷갈릴 수 있는데, Make.Linux를 복사해둔 hpl 디렉토리가 아니라 그 위의 hpcc-1.5.0 디렉토리에서 make를 수행한다는 점에 유의하십시오. 그러면 아래처럼 mpicc가 수행되면서 7개 HPC code들을 모두 build합니다.

u0017649@sys-90393:~/hpcc-1.5.0/hpl/setup$ cd ../..

u0017649@sys-90393:~/hpcc-1.5.0$ make arch=Linux
...
mpicc -o ../../../../FFT/wrapfftw.o -c ../../../../FFT/wrapfftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
mpicc -o ../../../../FFT/wrapmpifftw.o -c ../../../../FFT/wrapmpifftw.c -I../../../../include -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I../../../include -I../../../include/Linux
...
ar: creating ../../../lib/Linux/libhpl.a
echo ../../../lib/Linux/libhpl.a
../../../lib/Linux/libhpl.a
mpif77 -o ../../../../hpcc ../../../lib/Linux/libhpl.a -lblas -lm
make[1]: Leaving directory '/home/u0017649/hpcc-1.5.0/hpl/lib/arch/build'

결과로 hpcc-1.5.0 디렉토리에 hpcc라는 실행 파일이 생성됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ file hpcc
hpcc: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /opt/at10.0/lib64/ld64.so.2, for GNU/Linux 4.4.0, BuildID[sha1]=b47fb43c4d96819e25da7469049a780f8251458b, not stripped

이 hpcc 파일을 수행하면 7개 HPC code들을 순차적으로 모두 수행하는 것입니다. 이를 위해서 먼저 LD_LIBRARY_PATH를 다음과 같이 설정합니다.

u0017649@sys-90393:~/hpcc-1.5.0$ export LD_LIBRARY_PATH=/usr/lib:/usr/lib/powerpc64le-linux-gnu:$LD_LIBRARY_PATH

그리고 INPUT data file을 만들어야 합니다. 함께 제공되는 _hpccinf.txt를 hpccinf.txt라는 이름으로 복사하여 그대로 사용하셔도 됩니다만, 여기서는 http://www.netlib.org/benchmark/hpl/tuning.html 에 나오는 내용대로 해보겠습니다. INPUT data file은 몇번째 줄에는 무슨 정보가 들어가야 한다는 일정한 format이 정해져 있어서 그대로 입력하셔야 하고, 각 줄의 의미는 앞에서 언급한 tuning.html 을 참조하시면 됩니다. 다만, 여기에 나오는 것처럼 P x Q 정보를 2 x 8 로 하면 총 16개 processor가 있어야 수행을 할 수 있습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ vi hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
3000 6000 10000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 2 Ps
6 8 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

이대로 수행하면 다음과 같이 최소 16개 process가 필요하다면서 error가 납니다. 제가 수행하는 PDP 환경에는 1-core만 있기 때문입니다.

u0017649@sys-90393:~/hpcc-1.5.0$ ./hpcc
HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 16 processes for these tests <<<

따라서 위의 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1로 바꿔주겠습니다.

1 1 Ps
1 1 Qs

이걸 그대로 수행하면 최소 18시간 이상 계속 돌아가더군요. 그래서 도중에 중단시키고, problem size인 6번째 줄의 값들을 1/10 씩으로 줄이겠습니다.

300 600 1000 Ns

이제 single node에서 돌릴 준비가 끝났습니다. 다음과 같이 hpccinf.txt를 만드셔서 수행하시면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
300 600 1000 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

수행 방법은 그냥 hpcc를 수행하는 것 뿐입니다. 위의 input data로 하면 1-core POWER8 환경에서는 약 20분 정도 걸립니다.

u0017649@sys-90393:~/hpcc-1.5.0$ time ./hpcc

그 결과물은 hpccoutf.txt 이라는 이름의 파일에 쌓이는데, 약 1.6MB 정도의 크기로 쌓이고 그 끝부분 내용은 아래와 같습니다.

...
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R3R8 1000 160 1 1 0.08 8.655e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0055214 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11R4R8 1000 160 1 1 0.07 8.982e+00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0059963 ...... PASSED
================================================================================

Finished 3240 tests with the following results:
3240 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Current time (1513215290) is Wed Dec 13 20:34:50 2017

End of HPL section.
Begin of Summary section.
VersionMajor=1
VersionMinor=5
VersionMicro=0
VersionRelease=f
LANG=C
Success=1
sizeof_char=1
sizeof_short=2
sizeof_int=4
sizeof_long=8
sizeof_void_ptr=8
sizeof_size_t=8
sizeof_float=4
sizeof_double=8
sizeof_s64Int=8
sizeof_u64Int=8
sizeof_struct_double_double=16
CommWorldProcs=1
MPI_Wtick=1.000000e-06
HPL_Tflops=0.00934813
HPL_time=0.071476
HPL_eps=1.11022e-16
HPL_RnormI=1.71035e-12
HPL_Anorm1=263.865
HPL_AnormI=262.773
HPL_Xnorm1=2619.63
HPL_XnormI=11.3513
HPL_BnormI=0.499776
HPL_N=1000
HPL_NB=80
HPL_nprow=1
HPL_npcol=1
HPL_depth=1
HPL_nbdiv=4
HPL_nbmin=4
HPL_cpfact=R
HPL_crfact=R
HPL_ctop=1
HPL_order=R
HPL_dMACH_EPS=1.110223e-16
HPL_dMACH_SFMIN=2.225074e-308
HPL_dMACH_BASE=2.000000e+00
HPL_dMACH_PREC=2.220446e-16
HPL_dMACH_MLEN=5.300000e+01
HPL_dMACH_RND=1.000000e+00
HPL_dMACH_EMIN=-1.021000e+03
HPL_dMACH_RMIN=2.225074e-308
HPL_dMACH_EMAX=1.024000e+03
HPL_dMACH_RMAX=1.797693e+308
HPL_sMACH_EPS=5.960464e-08
HPL_sMACH_SFMIN=1.175494e-38
HPL_sMACH_BASE=2.000000e+00
HPL_sMACH_PREC=1.192093e-07
HPL_sMACH_MLEN=2.400000e+01
HPL_sMACH_RND=1.000000e+00
HPL_sMACH_EMIN=-1.250000e+02
HPL_sMACH_RMIN=1.175494e-38
HPL_sMACH_EMAX=1.280000e+02
HPL_sMACH_RMAX=3.402823e+38
dweps=1.110223e-16
sweps=5.960464e-08
HPLMaxProcs=1
HPLMinProcs=1
DGEMM_N=576
StarDGEMM_Gflops=12.5581
SingleDGEMM_Gflops=11.5206
PTRANS_GBs=0.555537
PTRANS_time=0.000324011
PTRANS_residual=0
PTRANS_n=150
PTRANS_nb=120
PTRANS_nprow=1
PTRANS_npcol=1
MPIRandomAccess_LCG_N=524288
MPIRandomAccess_LCG_time=0.719245
MPIRandomAccess_LCG_CheckTime=0.076761
MPIRandomAccess_LCG_Errors=0
MPIRandomAccess_LCG_ErrorsFraction=0
MPIRandomAccess_LCG_ExeUpdates=2097152
MPIRandomAccess_LCG_GUPs=0.00291577
MPIRandomAccess_LCG_TimeBound=60
MPIRandomAccess_LCG_Algorithm=0
MPIRandomAccess_N=524288
MPIRandomAccess_time=0.751529
MPIRandomAccess_CheckTime=0.0741832
MPIRandomAccess_Errors=0
MPIRandomAccess_ErrorsFraction=0
MPIRandomAccess_ExeUpdates=2097152
MPIRandomAccess_GUPs=0.00279051
MPIRandomAccess_TimeBound=60
MPIRandomAccess_Algorithm=0
RandomAccess_LCG_N=524288
StarRandomAccess_LCG_GUPs=0.0547931
SingleRandomAccess_LCG_GUPs=0.0547702
RandomAccess_N=524288
StarRandomAccess_GUPs=0.0442224
SingleRandomAccess_GUPs=0.044326
STREAM_VectorSize=333333
STREAM_Threads=1
StarSTREAM_Copy=2.13593
StarSTREAM_Scale=2.09571
StarSTREAM_Add=3.23884
StarSTREAM_Triad=3.26659
SingleSTREAM_Copy=2.13593
SingleSTREAM_Scale=2.11213
SingleSTREAM_Add=3.23884
SingleSTREAM_Triad=3.26659
FFT_N=131072
StarFFT_Gflops=0.488238
SingleFFT_Gflops=0.488024
MPIFFT_N=65536
MPIFFT_Gflops=0.305709
MPIFFT_maxErr=1.23075e-15
MPIFFT_Procs=1
MaxPingPongLatency_usec=-1
RandomlyOrderedRingLatency_usec=-1
MinPingPongBandwidth_GBytes=-1
NaturallyOrderedRingBandwidth_GBytes=-1
RandomlyOrderedRingBandwidth_GBytes=-1
MinPingPongLatency_usec=-1
AvgPingPongLatency_usec=-1
MaxPingPongBandwidth_GBytes=-1
AvgPingPongBandwidth_GBytes=-1
NaturallyOrderedRingLatency_usec=-1
FFTEnblk=16
FFTEnp=8
FFTEl2size=1048576
M_OPENMP=-1
omp_get_num_threads=0
omp_get_max_threads=0
omp_get_num_procs=0
MemProc=-1
MemSpec=-1
MemVal=-1
MPIFFT_time0=0
MPIFFT_time1=0.00124693
MPIFFT_time2=0.00407505
MPIFFT_time3=0.000625134
MPIFFT_time4=0.0091598
MPIFFT_time5=0.00154018
MPIFFT_time6=0
CPS_HPCC_FFT_235=0
CPS_HPCC_FFTW_ESTIMATE=0
CPS_HPCC_MEMALLCTR=0
CPS_HPL_USE_GETPROCESSTIMES=0
CPS_RA_SANDIA_NOPT=0
CPS_RA_SANDIA_OPT2=0
CPS_USING_FFTW=0
End of Summary section.
########################################################################
End of HPC Challenge tests.
Current time (1513215290) is Wed Dec 13 20:34:50 2017

########################################################################

1184.15user 2.77system 19:47.13elapsed 99%CPU (0avgtext+0avgdata 34624maxresident)k

이제 multi-node로 수행하는 방법을 보겠습니다. 이 역시 매우 간단합니다. 먼저 다음과 같이 node 이름을 담은 파일을 만듭니다. 만약 노드들에 network interface가 여러개라면 그 중 10GbE 또는 Infiniband처럼 고속인 것을 적어주는 것이 좋습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat nodes.rf
sys-90393
sys-90505

이어서 INPUT data 파일인 hpccinf.txt를 조금 수정해줍니다. 위에서 사용한 것과는 달리, 일단 2대를 사용하니까 2 process로 돌아야 합니다. 따라서 11~12번째 줄, 즉 P x Q 정보를 아래처럼 1 1, 그리고 1 2로 바꿔주겠습니다. 또, 이대로 돌리니 MPI overhead가 있어서인지 20분이 아니라 40분이 되도록 끝나질 않더군요. 그래서 6번째 줄의 problem size도 다시 1/10로 더 줄였습니다.

u0017649@sys-90393:~/hpcc-1.5.0$ cat hpccinf.txt
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
30 60 100 Ns
5 # of NBs
80 100 120 140 160 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 2 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

그러고 난 다음에는 다음과 같이 mpirun으로 hpcc를 수행해주면 됩니다.

u0017649@sys-90393:~/hpcc-1.5.0$ mpirun -x PATH -x LD_LIBRARY_PATH -np 2 -hostfile nodes.rf ./hpcc | tee HPCC.out

시작과 동시에 양쪽 node에서 CPU를 100% 쓰면서 돌아가는 것을 확인하실 수 있습니다.

2017년 12월 12일 화요일

Caffe를 이용하여 ILSVRC2012 dataset을 alexnet으로 training하기

먼저 작업 환경을 PowerAI에 포함된 caffe-nv로 하기 위해 PATH 등 각종 환경 변수를 설정해주는 다음 script를 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ source /opt/DL/caffe-nv/bin/caffe-activate

다음과 같이 caffe가 caffe-nv로 잡히는지 확인합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ which caffe
/opt/DL/caffe-nv/bin/caffe

PowerAI에 포함된 caffe-nv 밑의 example과 data를 GPFS 파일시스템 쪽으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/examples examples
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/data data

거기서 아래와 같이 get_ilsvrc_aux.sh를 수행하여 ilsvrc2012 dataset 생성에 필요한 label 파일 등을 download 받습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd data/ilsvrc12
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ./get_ilsvrc_aux.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ ls -ltr
total 37888
-rw-r----- 1 b7p286za IBM1 3200000 Feb 25 2014 test.txt
-rw-r----- 1 b7p286za IBM1 10000 Feb 25 2014 synsets.txt
-rw-r----- 1 b7p286za IBM1 786446 Feb 25 2014 imagenet_mean.binaryproto
-rw-r----- 1 b7p286za IBM1 1644500 Feb 25 2014 val.txt
-rw-r----- 1 b7p286za IBM1 43829433 Feb 25 2014 train.txt
-rw-r----- 1 b7p286za IBM1 31675 Apr 8 2014 synset_words.txt
-rw-r----- 1 b7p286za IBM1 3787 Jun 8 2014 det_synset_words.txt
-rw-r----- 1 b7p286za IBM1 14931117 Jul 11 2014 imagenet.bet.pickle
-rwxr-x--- 1 b7p286za IBM1 585 Dec 12 02:12 get_ilsvrc_aux.sh

이제 imagenet data, 즉 ILSVRC2012를 download 받습니다. Training dataset은 앞선 posting에서 사용한 tensorflow resnet training에서 사용했던 raw-data를 이용하면 됩니다. 다만, 거기서는 validation dataset도 label명에 따른 디렉토리로 분산해서 넣었는데, 이 alexnet에서는 val이라는 디렉토리에 한꺼번에 풀어놓아야 합니다. 따라서 다음과 같이 val만 새로 풀어놓습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/data/ilsvrc12$ cd ../..

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mkdir raw-data/val && cd raw-data/val

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ tar -xf ../../ILSVRC2012_img_val.tar

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val$ cd ../..

이제 raw-data 밑의 train과 val 속의 JPEG 파일들을 LMDB 포맷으로 변환해야 합니다. 다음과 같이 create_imagenet.sh 스크립트를 수정해서 사용합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/create_imagenet.sh
...
export CAFFE_BIN=/opt/DL/caffe-nv/bin (추가)
...
TRAIN_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/train/
VAL_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b7p286za/raw-data/val/
...
#RESIZE=false
RESIZE=true

수정을 마치고 다음과 같이 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/create_imagenet.sh

이 과정도 200GB가 넘는 data를 LMDB format으로 변환하는 것이므로 스토리지 상황에 따라 6~7시간 가량 걸립니다. 위의 script가 다 돌고나면 examples/imagenet/ilsvrc12_train_lmdb와 examples/imagenet/ilsvrc12_val_lmdb에 LMDB format으로 변환된 dataset이 생깁니다.

이제 생성된 LMDB로부터 전체 imagenet data의 평균값을 구하기 위해 make_imagenet_mean.sh를 수행합니다. 여기서도 script 맨 앞에 다음과 같이 CAFFE_BIN을 정의해줍니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/make_imagenet_mean.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
...

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ time ./examples/imagenet/make_imagenet_mean.sh

다음으로는 solver.prototxt를 수정합니다. 먼저 /opt/DL/caffe-nv/models에 있는 bvlc_alexnet 디렉토리를 GPFS 파일시스템으로 copy해옵니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cp -r /opt/DL/caffe-nv/models/bvlc_alexnet .

그리고나서 다음과 같이 solver.prototxt 속의 디렉토리 이름들과 max_iter 등을 적절히 수정해줍니다.
여기서는 나중에 batch_size를 2048로 할 것이므로, max_iter를 1250으로 하면 대략 1250 x 2048 / 1280000 = 20 epochs의 training을 완료하게 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/solver.prototxt
#net: "models/bvlc_alexnet/train_val.prototxt"
net: "bvlc_alexnet/train_val.prototxt"
...
#display: 20
display: 500
#max_iter: 100000
max_iter: 1250
...
#snapshot_prefix: "models/bvlc_alexnet/caffe_alexnet_train"
snapshot_prefix: "bvlc_alexnet/caffe_alexnet_train"

다음으로는 bvlc_alexnet/train_val.prototxt를 필요시 수정하여 train data의 batch_size를 늘이거나 줄이고, 각종 path도 적절히 변경합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi bvlc_alexnet/train_val.prototxt
...
source: "examples/imagenet/ilsvrc12_train_lmdb"
# batch_size: 1024
batch_size: 2048
...

이제 다음과 같은 train_alexnet.sh를 만들어 수행합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ vi ./examples/imagenet/train_alexnet.sh
source /opt/DL/caffe-nv/bin/caffe-activate
export CAFFE_BIN=/opt/DL/caffe-nv/bin
set -e
$CAFFE_BIN/caffe train -gpu all --solver=bvlc_alexnet/solver.prototxt

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ nohup time ./examples/imagenet/train_alexnet.sh &

결과 log는 nohup.out에서 보실 수 있습니다. 위와 같이 20 epochs를 수행하는데는 12분 정도 밖에 안 걸립니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ grep iter nohup.out
test_iter: 1000
max_iter: 1250
I1212 14:15:42.151552 82131 solver.cpp:242] Iteration 0 (0 iter/s, 24.031s/500 iter), loss = 6.91103
I1212 14:22:10.507652 82131 solver.cpp:242] Iteration 500 (1.28749 iter/s, 388.352s/500 iter), loss = 6.37464
I1212 14:26:19.506183 82131 solver.cpp:242] Iteration 1000 (2.00806 iter/s, 248.996s/500 iter), loss = 5.34417
I1212 14:27:50.453514 82131 solver.cpp:479] Snapshotting to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.caffemodel
I1212 14:27:51.540899 82131 sgd_solver.cpp:273] Snapshotting solver state to binary proto file bvlc_alexnet/caffe_alexnet_train_iter_1250.solverstate

Tensorflow로 ILSVRC2012 dataset을 이용하여 resnet101 training하기

먼저, 다음과 같이 anaconda2를 설치합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ wget https://repo.continuum.io/archive/Anaconda2-5.0.0-Linux-ppc64le.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ chmod a+x ./Anaconda2-5.0.0-Linux-ppc64le.sh

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ ./Anaconda2-5.0.0-Linux-ppc64le.sh
--> 설치 directory는 여기서는 user home directory인 /gpfs/gpfs_gl4_16mb/b7p286za/anaconda2 로 합니다만, 환경에 따라서 다른 곳에 하셔도 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ export PATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin:$PATH

이제 python이 OS의 기본 python이 아니라 anaconda에 딸린 python으로 설정되었는지 확인합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ which python
/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin/python

이제 다음 명령으로 tensorflow 1.2.1을 설치합니다.

혹시 tensorflow 1.3이 꼭 필요한 경우엔 이 URL(http://hwengineer.blogspot.kr/2017/10/minsky-tensorflow-r13-source-build.html)을 참조하여 직접 build 하셔야 합니다. 빌드 및 수행은 ppc64le에서도 잘 됩니다. 다만 tensorflow 1.2.1로도 충분하므로 굳이 1.3을 build하실 필요까지는 없습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ conda install tensorflow tensorflow-gpu

그 다음으로 benchmark용 resnet model 등이 들어있는 다음의 git repository를 다음과 같이 clone 하십시요. 이는 원래 https://github.com/tensorflow/models.git 에 들어 있는 내용에 일부 script를 추가한 것입니다. 원래의 script는 imagenet training dataset을 download하는 것부터 시작하는데, 그건 시간이 너무 오래 걸리므로, 이미 download 받은 dataset을 이용하여 TFrecord로 변환하는 등의 script를 추가했습니다. 이 script 작성은 IBM 이보란님께서 수고해주셨습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ git clone https://github.com/brlee08/models.git

여기에서 사용할 ILSVRC2012 imagenet dataset들은 다음과 같으며, 이는 미리 download 받으두셔야 합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ ls -l ILSVRC2012*
-rw-r----- 1 b7p286za IBM1 20861852 Aug 12 2012 ILSVRC2012_bbox_train_v2.tar.gz
-rw-r----- 1 b7p286za IBM1 147897477120 Nov 5 07:02 ILSVRC2012_img_train.tar
-rw-r----- 1 b7p286za IBM1 6744924160 Nov 5 07:03 ILSVRC2012_img_val.tar

이를 다음과 같이 적절한 위치에 풀어두셔야 합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mkdir -p raw-data/bounding_boxes

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_bbox_train_v2.tar.gz raw-data/bounding_boxes/annotations.tar.gz
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_img_train.tar raw-data/
b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ mv ILSVRC2012_img_val.tar raw-data/

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd models/research/inception/inception/data/

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ ./T2_ibm_uncompress_imagenet.sh /gpfs/gpfs_gl4_16mb/raw-data/ /gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt

--> 여기서 앞의 directory 이름 끝에 반드시 /를 붙이셔야 합니다. (안그러면 error 납니다.) 이 script에 의해 앞에 쓴 directory 밑에 raw image (JPEG)들이 풀리면서 label명인 sub-directory로 분산되어 들어갑니다. 뒤에 쓴 imagenet_lsvrc_2015_synsets.txt 파일은 이 ILSVRC2012 data의 label 이름입니다.

위 script가 다 수행되고 나면 다음과 같이 이 JPEG 파일들을 TFrecord 포맷으로 변환합니다. 그를 위해, models/research/inception/inception/data 밑에 있는 T2_ibm_preprocess.sh 에서 아래 부분을 수정합니다.

#source /opt/DL/tensorflow/bin/tensorflow-activate (맨 위의 tensorflow-activate 부분을 comment-out 처리. PowerAI에 있는 TF 1.0 대신 conda install로 설치한 TF 1.2.1을 사용하기 위한 것임)
WORK_DIR="<models 디렉토리가 위치한 경로>/models/research/inception/inception"
python <models 디렉토리가 위치한 경로>/models/research/inception/inception/data/build_imagenet_data.py

여기서는 아래와 같이 고쳤습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ vi ./T2_ibm_preprocess.sh
#source /opt/DL/tensorflow/bin/tensorflow-activate
...
WORK_DIR="/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception"
...
python /gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data/build_imagenet_data.py \
...

수정이 끝나면 다음과 같이 수행합니다. T2_ibm_preprocess.sh 뒤에 적어주는 directory 밑에 ilsvrc_tf라는 sub-directory가 생기면서 TFrecord 파일들이 생성됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/models/research/inception/inception/data$ time ./T2_ibm_preprocess.sh /gpfs/gpfs_gl4_16mb/b7p286za/

위 script는 200GB가 넘는 파일들을 처리하므로 시간이 꽤 오래, 약 4시간 정도 걸립니다. 다 끝마치면 다음과 같이 train-xxxx과 validation-xxxx 등의 TFrecord 파일들이 생깁니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/ilsvrc_tf$ ls -l | more
total 62635520
-rw-r----- 1 b7p286za IBM1 149402267 Dec 12 00:17 train-00000-of-01024
-rw-r----- 1 b7p286za IBM1 150240608 Dec 12 00:19 train-00001-of-01024
-rw-r----- 1 b7p286za IBM1 141760185 Dec 12 00:20 train-00002-of-01024
-rw-r----- 1 b7p286za IBM1 152134069 Dec 12 00:22 train-00003-of-01024
-rw-r----- 1 b7p286za IBM1 141508613 Dec 12 00:24 train-00004-of-01024
-rw-r----- 1 b7p286za IBM1 148320681 Dec 12 00:25 train-00005-of-01024
-rw-r----- 1 b7p286za IBM1 146087263 Dec 12 00:27 train-00006-of-01024
...

이제 다음과 같이 benchmark script들이 들어있는 git repo를 clone 합니다. 역시 이 script 작성은 IBM 이보란님께서 수고해주셨습니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ git clone https://github.com/brlee08/benchmark.git

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za$ cd benchmark/tensorflow

이중 bench_ibm_single.sh을 이용하여 수행하되, 먼저 일부를 다음과 같이 수정합니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow$ vi bench_ibm_single.sh
...
DATA_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/ilsvrc_tf (tfrecord 위치한 디렉토리)
LOG_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow/output_single (log를 쌓을 디렉토리)
...
TRAIN_DIR=/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow/train_log (tensorboard용 log 쌓을 디렉토리)
...
NUM_EPOCHS=10
NUM_GPU=4
INPUT_BATCH=64
INPUT_MODEL="resnet101"
...
#TRAIN_LOG_DIR="${TRAIN_DIR}/googlenet-10e-128b-4G" (쓰지 않는 것이므로 comment-out으로 막으십시요.)
...
#source /opt/DL/tensorflow/bin/tensorflow-activate (역시 PowerAI에 있는 TF 1.0 대신 conda install로 설치한 TF 1.2.1을 사용하기 위해 comment-out)
export PATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/bin:$PATH
export PYTHONPATH=/gpfs/gpfs_gl4_16mb/b7p286za/anaconda2/lib/python2.7/site-packages (원래 source에는 anaconda3를 쓰고 있으나 여기서는 anaconda2의 site-packages를 PYTHONPATH로 설정해야 함)
...
# --data_name=imagenet --train_dir=${TRAIN_LOG_DIR} --data_dir=${DATA_DIR} --variable_update=${VARIABLE_UPDATE} \

--data_name=imagenet --train_dir=${TRAIN_DIR} --data_dir=${DATA_DIR} --variable_update=${VARIABLE_UPDATE} \

(원본에 오타가 있었습니다. --train_dir=${TRAIN_LOG_DIR}를 --train_dir=${TRAIN_DIR}로 수정해야 합니다.)

이제 다음과 같이 resent training을 수행하면 됩니다.

b7p286za@p10login1:/gpfs/gpfs_gl4_16mb/b7p286za/benchmark/tensorflow$ nohup time /gpfs/gpfs_gl4_16mb/b7p284za/benchmark/tensorflow/bench_ibm_single.sh &

처음에는 몇몇 warning message와 함께 tensorflow 기동하는데 10분 정도 걸리므로 당황하지 마십시요. 대략 다음과 같은 결과가 나옵니다.

50010 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.943
50020 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.716
50030 images/sec: 485.7 +/- 0.1 (jitter = 3.8) 4.847
50040 images/sec: 485.6 +/- 0.1 (jitter = 3.8) 4.639
----------------------------------------------------------------
total images/sec: 485.42
----------------------------------------------------------------
Training Finish - 2017-12-12 13:59:14
Elapsed Time - 02:31:31