HW 엔지니어를 위한 Deep Learning: 2020

2020년 12월 8일 화요일

IBM PowerAI (Watson ML Community Edition)이 설치된 Ubuntu ppc64le 기반의 docker image 만들기

먼저 다음 링크를 참조하여 ppc64le (IBM POWER9) nvidia-docker2 환경에서 Ubuntu 기반의 docker image를 만듭니다.

http://hwengineer.blogspot.com/2019/05/ppc64le-ibm-power9-nvidia-docker2.html

참고로 ppc64le (IBM POWER9)에서의 CUDA 설치는 NVIDIA CUDA download page의 안내와 같이 아래처럼 진행하시면 됩니다.

# wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/ppc64el/cuda-repo-ubuntu1804_10.1.105-1_ppc64el.deb

# dpkg -i cuda-repo-ubuntu1804_10.1.105-1_ppc64el.deb

# apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/ppc64el/7fa2af80.pub

# apt-get update

# apt-get install cuda

또는 이미 만들어둔 docker image를 다음과 같이 pull 해와도 됩니다.

# docker pull bsyu/ubuntu18.04_cuda10-1_ppc64le:v0.1

이렇게 pull 받아온 docker image를 확인합니다.

root@unigpu:/files/docker# docker images

REPOSITORY TAG IMAGE ID CREATED SIZE

bsyu/ubuntu18.04_cuda10-1_ppc64le v0.1 ef8dd4d654e7 2 hours ago 6.33GB

ubuntu 18.04 ecc8dc2e4170 4 weeks ago 106MB

이 image를 다음과 같이 구동합니다.

root@unigpu:/files/docker# docker run --runtime=nvidia -ti --rm bsyu/ubuntu18.04_cuda10-1_ppc64lel:v0.1 bash

이제 그 image 속에서 다음과 같이 IBM PowerAI (IBM Watson ML Community Edition)을 설치합니다. 이는 IBM이 마련한 conda channel을 등록하고 거기에서 conda install 명령을 수행하는 방식으로 설치됩니다.

# conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

# conda create --name wmlce_env python=3.6

# conda activate wmlce_env

# apt-get install openssh-server

# conda install powerai

위와 같이 conda install powerai 명령을 내리면 tensorflow 뿐만 아니라 caffe, pytorch 등이 모두 설치됩니다. 가령 Tensorflow 1.14만 설치하고자 한다면 위 명령 대신 conda install tensorflow=1.14를 수행하시면 됩니다.

powerai 전체 package 설치는 network 사정에 따라 1~2시간이 걸리기도 합니다. 설치가 완료되면 다음과 같이 docker commit하여 docker image를 저장합니다.

# docker ps -a

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

45fb663f025c bsyu/ubuntu18.04_cuda10-1_ppc64le:v0.1 "bash" 42 seconds ago Up 39 seconds elastic_gates

이어서 v0.2 등의 새로운 tag로 commit 하시면 됩니다.

[root@ac922 docker]# docker commit 45fb663f025c bsyu/ubuntu18.04_cuda10-1_ppc64le:v0.2

아래는 그렇게 만들어진 docker image들의 사용예입니다. 제가 만든 그런 image들은 https://hub.docker.com/u/bsyu 에 올려져 있습니다.

root@unigpu:~# docker run --runtime=nvidia -ti --rm bsyu/ubuntu18.04_cuda10-1_tf1.15_pytorch1.2_ppc64le:latest

(wmlce_env) root@bdbf11e90094:/# python

Python 3.6.10 |Anaconda, Inc.| (default, Mar 26 2020, 00:22:27)

[GCC 7.3.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> import torch

>>> import tensorflow as tf

2020-05-29 04:53:09.637141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1

>>> sess=tf.Session()

2020-05-29 04:53:20.795027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

2020-05-29 04:53:20.825918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:

pciBusID: 0004:04:00.0

2020-05-29 04:53:20.827162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:

pciBusID: 0004:05:00.0

2020-05-29 04:53:20.828421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:

pciBusID: 0035:03:00.0

2020-05-29 04:53:20.829655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:

pciBusID: 0035:04:00.0

2020년 10월 27일 화요일

IBM POWER9 (ppc64le) Redhat 7에서 Spectrum Scale (GPFS) v5 구성하기

여기서의 설정은 gw(2.1.1.1) 서버를 1대 뿐인 GPFS 서버로, 그리고 2대의 서버 tac1과 tac2 (각각 2.1.1.3, 2.1.1.4)를 GPFS client 노드로 등록하는 것입니다. 즉 GPFS의 물리적 disk가 직접 연결되는 것은 gw 서버이고, tac1과 tac2 서버는 gw 서버가 보내주는 GPFS filesystem을 NSD (network storage device) 형태로 network을 통해서 받게 됩니다.

먼저 모든 서버에서 firewalld를 disable합니다. 이와 함께 각 서버 간에 passwd 없이 ssh가 가능하도록 미리 설정해둡니다.

[root@gw ~]# systemctl stop firewalld

[root@gw ~]# systemctl disable firewalld

여기서는 GPFS (새이름 SpectrumScale) installer를 이용하여 설치하겠습니다. GPFS v5부터는 ansible을 이용하여 1대에서만 설치하면 다른 cluster node들에게도 자동으로 설치가 되어 편합니다. 먼저 install 파일을 수행하면 self-extraction이 시작되며 파일들이 생성됩니다.

[root@gw SW]# ./Spectrum_Scale_Advanced-5.0.4.0-ppc64LE-Linux-install

Extracting Product RPMs to /usr/lpp/mmfs/5.0.4.0 ...

tail -n +641 ./Spectrum_Scale_Advanced-5.0.4.0-ppc64LE-Linux-install | tar -C /usr/lpp/mmfs/5.0.4.0 --wildcards -xvz installer gpfs_rpms/rhel/rhel7 hdfs_debs/ubuntu16/hdfs_3.1.0.x hdfs_rpms/rhel7/hdfs_2.7.3.x hdfs_rpms/rhel7/hdfs_3.0.0.x hdfs_rpms/rhel7/hdfs_3.1.0.x zimon_debs/ubuntu/ubuntu16 ganesha_rpms/rhel7 ganesha_rpms/rhel8 gpfs_debs/ubuntu16 gpfs_rpms/sles12 object_rpms/rhel7 smb_rpms/rhel7 smb_rpms/rhel8 tools/repo zimon_debs/ubuntu16 zimon_rpms/rhel7 zimon_rpms/rhel8 zimon_rpms/sles12 zimon_rpms/sles15 gpfs_debs gpfs_rpms manifest 1> /dev/null

- installer

- gpfs_rpms/rhel/rhel7

- hdfs_debs/ubuntu16/hdfs_3.1.0.x

- hdfs_rpms/rhel7/hdfs_2.7.3.x

...

- gpfs_debs

- gpfs_rpms

- manifest

Removing License Acceptance Process Tool from /usr/lpp/mmfs/5.0.4.0 ...

rm -rf /usr/lpp/mmfs/5.0.4.0/LAP_HOME /usr/lpp/mmfs/5.0.4.0/LA_HOME

Removing JRE from /usr/lpp/mmfs/5.0.4.0 ...

rm -rf /usr/lpp/mmfs/5.0.4.0/ibm-java*tgz

==================================================================

Product packages successfully extracted to /usr/lpp/mmfs/5.0.4.0

Cluster installation and protocol deployment

To install a cluster or deploy protocols with the Spectrum Scale Install Toolkit: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale -h

To install a cluster manually: Use the gpfs packages located within /usr/lpp/mmfs/5.0.4.0/gpfs_<rpms/debs>

To upgrade an existing cluster using the Spectrum Scale Install Toolkit:

1) Copy your old clusterdefinition.txt file to the new /usr/lpp/mmfs/5.0.4.0/installer/configuration/ location

2) Review and update the config: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config update

3) (Optional) Update the toolkit to reflect the current cluster config:

/usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config populate -N <node>

4) Run the upgrade: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale upgrade -h

To add nodes to an existing cluster using the Spectrum Scale Install Toolkit:

1) Add nodes to the clusterdefinition.txt file: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node add -h

2) Install GPFS on the new nodes: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale install -h

3) Deploy protocols on the new nodes: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale deploy -h

To add NSDs or file systems to an existing cluster using the Spectrum Scale Install Toolkit:

1) Add nsds and/or filesystems with: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale nsd add -h

2) Install the NSDs: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale install -h

3) Deploy the new file system: /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale deploy -h

To update the toolkit to reflect the current cluster config examples:

/usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config populate -N <node>

1) Manual updates outside of the install toolkit

2) Sync the current cluster state to the install toolkit prior to upgrade

3) Switching from a manually managed cluster to the install toolkit

==================================================================================

To get up and running quickly, visit our wiki for an IBM Spectrum Scale Protocols Quick Overview:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Protocols%20Quick%20Overview%20for%20IBM%20Spectrum%20Scale

===================================================================================

먼저 아래와 같이 spectrumscale 명령으로 gw 서버, 즉 2.1.1.5를 installer 서버로 지정합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale setup -s 2.1.1.5

이어서 gw 서버를 manager node이자 admin node로 지정합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node add 2.1.1.5 -m -n -a

[ INFO ] Adding node gw as a GPFS node.

[ INFO ] Adding node gw as a manager node.

[ INFO ] Adding node gw as an NSD server.

[ INFO ] Configuration updated.

[ INFO ] Tip :If all node designations are complete, add NSDs to your cluster definition and define required filessytems:./spectrumscale nsd add <device> -p <primary node> -s <secondary node> -fs <file system>

[ INFO ] Setting gw as an admin node.

[ INFO ] Configuration updated.

[ INFO ] Tip : Designate protocol or nsd nodes in your environment to use during install:./spectrumscale node add <node> -p -n

각 node들을 quorum node로 등록합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node add 2.1.1.3 -q

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node add 2.1.1.4 -q

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node add 2.1.1.5 -q

[ INFO ] Adding node gwp as a quorum node.

node list를 확인합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale node list

[ INFO ] List of nodes in current configuration:

[ INFO ] [Installer Node]

[ INFO ] 2.1.1.5

[ INFO ]

[ INFO ] [Cluster Details]

[ INFO ] No cluster name configured

[ INFO ] Setup Type: Spectrum Scale

[ INFO ]

[ INFO ] [Extended Features]

[ INFO ] File Audit logging : Disabled

[ INFO ] Watch folder : Disabled

[ INFO ] Management GUI : Disabled

[ INFO ] Performance Monitoring : Enabled

[ INFO ] Callhome : Enabled

[ INFO ]

[ INFO ] GPFS Admin Quorum Manager NSD Protocol Callhome OS Arch

[ INFO ] Node Node Node Node Server Node Server

[ INFO ] gw X X X X rhel7 ppc64le

[ INFO ] tac1p X rhel7 ppc64le

[ INFO ] tac2p X rhel7 ppc64le

[ INFO ]

[ INFO ] [Export IP address]

[ INFO ] No export IP addresses configured

sdc와 sdd를 nsd로 등록합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale nsd add /dev/sdc -p 2.1.1.5 --name data_nsd -fs data

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale nsd add /dev/sdd -p 2.1.1.5 --name backup_nsd -fs backup

nsd를 확인합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale nsd list

[ INFO ] Name FS Size(GB) Usage FG Pool Device Servers

[ INFO ] data_nsd data 400 Default 1 Default /dev/sdc [gwp]

[ INFO ] backup_nsd backup 400 Default 1 Default /dev/sdd [gwp]

filesystem을 확인합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale filesystem list

[ INFO ] Name BlockSize Mountpoint NSDs Assigned Default Data Replicas Max Data Replicas Default Metadata Replicas Max Metadata Replicas

[ INFO ] data Default (4M)/ibm/data 1 1 2 1 2

[ INFO ] backup Default (4M)/ibm/backup 1 1 2 1 2

[ INFO ]

GPFS cluster를 정의합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config gpfs -c tac_gpfs

[ INFO ] Setting GPFS cluster name to tac_gpfs

다른 node들에게의 통신은 ssh와 scp를 이용하는 것으로 지정합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config gpfs -r /usr/bin/ssh

[ INFO ] Setting Remote shell command to /usr/bin/ssh

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config gpfs -rc /usr/bin/scp

[ INFO ] Setting Remote file copy command to /usr/bin/scp

확인합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale config gpfs --list

[ INFO ] Current settings are as follows:

[ INFO ] GPFS cluster name is tac_gpfs.

[ INFO ] GPFS profile is default.

[ INFO ] Remote shell command is /usr/bin/ssh.

[ INFO ] Remote file copy command is /usr/bin/scp.

[ WARN ] No value for GPFS Daemon communication port range in clusterdefinition file.

기본으로 GPFS 서버는 장애 발생시 IBM으로 연락하는 callhome 기능이 있습니다. Internet에 연결된 노드가 아니므로 disable합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale callhome disable

[ INFO ] Disabling the callhome.

[ INFO ] Configuration updated.

이제 install 준비가 되었습니다. Install 하기 전에 precheck을 수행합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale install -pr

[ INFO ] Logging to file: /usr/lpp/mmfs/5.0.4.0/installer/logs/INSTALL-PRECHECK-23-10-2020_21:13:23.log

[ INFO ] Validating configuration

...

[ INFO ] The install toolkit will not configure call home as it is disabled. To enable call home, use the following CLI command: ./spectrumscale callhome enable

[ INFO ] Pre-check successful for install.

[ INFO ] Tip : ./spectrumscale install

이상 없으면 install을 수행합니다. 이때 gw 뿐만 아니라 tac1과 tac2에도 GPFS가 설치됩니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale install

...

[ INFO ] GPFS active on all nodes

[ INFO ] GPFS ACTIVE

[ INFO ] Checking state of NSDs

[ INFO ] NSDs ACTIVE

[ INFO ] Checking state of Performance Monitoring

[ INFO ] Running Performance Monitoring post-install checks

[ INFO ] pmcollector running on all nodes

[ INFO ] pmsensors running on all nodes

[ INFO ] Performance Monitoring ACTIVE

[ INFO ] SUCCESS

[ INFO ] All services running

[ INFO ] StanzaFile and NodeDesc file for NSD, filesystem, and cluster setup have been saved to /usr/lpp/mmfs folder on node: gwp

[ INFO ] Installation successful. 3 GPFS nodes active in cluster tac_gpfs.tac1p. Completed in 2 minutes 52 seconds.

[ INFO ] Tip :If all node designations and any required protocol configurations are complete, proceed to check the deploy configuration:./spectrumscale deploy --precheck

참고로 여기서 아래와 같은 error가 나는 경우는 전에 이미 GPFS NSD로 사용된 disk이기 때문입니다.

[ FATAL ] gwp failure whilst: Creating NSDs (SS16)

[ WARN ] SUGGESTED ACTION(S):

[ WARN ] Review your NSD device configuration in configuration/clusterdefinition.txt

[ WARN ] Ensure all disks are not damaged and can be written to.

[ FATAL ] FAILURE REASON(s) for gwp:

[ FATAL ] gwp ---- Begin output of /usr/lpp/mmfs/bin/mmcrnsd -F /usr/lpp/mmfs/StanzaFile ----

[ FATAL ] gwp STDOUT: mmcrnsd: Processing disk sdc

[ FATAL ] gwp mmcrnsd: Processing disk sdd

[ FATAL ] gwp STDERR: mmcrnsd: Disk device sdc refers to an existing NSD

[ FATAL ] gwp mmcrnsd: Disk device sdd refers to an existing NSD

[ FATAL ] gwp mmcrnsd: Command failed. Examine previous error messages to determine cause.

[ FATAL ] gwp ---- End output of /usr/lpp/mmfs/bin/mmcrnsd -F /usr/lpp/mmfs/StanzaFile ----

[ INFO ] Detailed error log: /usr/lpp/mmfs/5.0.4.0/installer/logs/INSTALL-23-10-2020_21:20:05.log

[ FATAL ] Installation failed on one or more nodes. Check the log for more details.

이건 다음과 같이 disk 앞부분 약간을 덮어쓰면 됩니다.

[root@gw SW]# dd if=/dev/zero of=/dev/sdc bs=1M count=100

100+0 records in

100+0 records out

104857600 bytes (105 MB) copied, 0.0736579 s, 1.4 GB/s

[root@gw SW]# dd if=/dev/zero of=/dev/sdd bs=1M count=100

100+0 records in

100+0 records out

104857600 bytes (105 MB) copied, 0.0737598 s, 1.4 GB/s

이제 각 node 상태를 check 합니다.

[root@gw SW]# mmgetstate -a

Node number Node name GPFS state

-------------------------------------------

1 tac1p active

2 tac2p active

3 gwp active

nsd 상태를 check 합니다. 그런데 GPFS filesystem 정의가 (free disk)로 빠져 있는 것을 보실 수 있습니다.

[root@gw SW]# mmlsnsd

File system Disk name NSD servers

---------------------------------------------------------------------------

(free disk) backup_nsd gwp

(free disk) data_nsd gwp

spectrumscale filesystem list 명령으로 다시 GPFS filesystem 상태를 보면 거기엔 정보가 들어가 있습니다. 다만 mount point가 /ibm/data 이런 식으로 잘못 되어 있네요.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale filesystem list

[ INFO ] Name BlockSize Mountpoint NSDs Assigned Default Data Replicas Max Data Replicas Default Metadata Replicas Max Metadata Replicas

[ INFO ] data Default (4M)/ibm/data 1 1 2 1 2

[ INFO ] backup Default (4M)/ibm/backup 1 1 2 1 2

[ INFO ]

잘못된 mount point들을 제대로 수정합니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale filesystem modify data -m /data

[ INFO ] The data filesystem will be mounted at /data on all nodes.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale filesystem modify backup -m /backup

[ INFO ] The backup filesystem will be mounted at /backup on all nodes.

확인합니다. 그러나 여전히 mount는 되지 않습니다.

[root@gw SW]# /usr/lpp/mmfs/5.0.4.0/installer/spectrumscale filesystem list

[ INFO ] Name BlockSize Mountpoint NSDs Assigned Default Data Replicas Max Data Replicas Default Metadata Replicas Max Metadata Replicas

[ INFO ] data Default (4M)/data 1 1 2 1 2

[ INFO ] backup Default (4M)/backup 1 1 2 1 2

[ INFO ]

이를 수정하기 위해 GPFS fileystem 설정은 예전 방식, 즉 mmcrnsd와 mmcrfs 명령을 쓰겠습니다. 먼저 disk description 파일을 아래와 같이 만듭니다.

[root@gw ~]# vi /home/SW/gpfs/disk.desc1

/dev/sdc:gwp::dataAndMetadata:1:nsd_data

[root@gw ~]# vi /home/SW/gpfs/disk.desc2

/dev/sdd:gwp::dataAndMetadata:1:nsd_backup

그리고 예전의 NSD 포맷을 지우기 위해 sdc와 sdd에 아래와 같이 dd로 overwrite를 합니다.

[root@gw ~]# dd if=/dev/zero of=/dev/sdc bs=1M count=100

100+0 records in

100+0 records out

104857600 bytes (105 MB) copied, 0.0130229 s, 8.1 GB/s

[root@gw ~]# dd if=/dev/zero of=/dev/sdd bs=1M count=100

100+0 records in

100+0 records out

104857600 bytes (105 MB) copied, 0.0128207 s, 8.2 GB/s

mmcrnsd 명령과 mmcrfs 명령을 수행하여 NSD와 GPFS filesystem을 만듭니다.

[root@gw ~]# mmcrnsd -F /home/SW/gpfs/disk.desc1

[root@gw ~]# mmcrnsd -F /home/SW/gpfs/disk.desc2

[root@gw ~]# mmcrfs /data /dev/nsd_data -F /home/SW/gpfs/disk.desc1

The following disks of nsd_data will be formatted on node gw:

nsd_data: size 409600 MB

Formatting file system ...

Disks up to size 3.18 TB can be added to storage pool system.

Creating Inode File

Creating Allocation Maps

Creating Log Files

Clearing Inode Allocation Map

Clearing Block Allocation Map

Formatting Allocation Map for storage pool system

Completed creation of file system /dev/nsd_data.

mmcrfs: Propagating the cluster configuration data to all

affected nodes. This is an asynchronous process.

[root@gw ~]# mmcrfs /backup /dev/nsd_backup -F /home/SW/gpfs/disk.desc2

The following disks of nsd_backup will be formatted on node gw:

nsd_backup: size 409600 MB

Formatting file system ...

Disks up to size 3.18 TB can be added to storage pool system.

Creating Inode File

Creating Allocation Maps

Creating Log Files

Clearing Inode Allocation Map

Clearing Block Allocation Map

Formatting Allocation Map for storage pool system

Completed creation of file system /dev/nsd_backup.

mmcrfs: Propagating the cluster configuration data to all

affected nodes. This is an asynchronous process.

이제 모든 node에서 mount 해봅니다.

[root@gw ~]# mmmount all -a

Sat Oct 24 09:45:43 KST 2020: mmmount: Mounting file systems ...

[root@gw ~]# df -h

Filesystem Size Used Avail Use% Mounted on

devtmpfs 1.7G 0 1.7G 0% /dev

tmpfs 1.8G 18M 1.8G 1% /dev/shm

tmpfs 1.8G 96M 1.7G 6% /run

tmpfs 1.8G 0 1.8G 0% /sys/fs/cgroup

/dev/sda5 50G 5.4G 45G 11% /

/dev/sda6 345G 8.6G 337G 3% /home

/dev/sda2 1014M 178M 837M 18% /boot

tmpfs 355M 0 355M 0% /run/user/0

/dev/sr0 3.4G 3.4G 0 100% /home/cdrom

nsd_backup 400G 2.8G 398G 1% /backup

nsd_data 400G 2.8G 398G 1% /data

테스트를 위해 /data 밑에 /etc/hosts 파일을 copy해 둡니다.

[root@gw ~]# cp /etc/hosts /data

[root@gw ~]# ls -l /data

total 1

-rw-r--r--. 1 root root 298 Oct 24 09:49 hosts

Client node들에서도 잘 mount 되었는지 확인합니다. 그리고 아까 copy해둔 hosts 파일이 있는 확인합니다.

[root@gw ~]# ssh tac1

Last login: Sat Oct 24 09:33:46 2020 from gwp

[root@tac1 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

devtmpfs 28G 0 28G 0% /dev

tmpfs 28G 0 28G 0% /dev/shm

tmpfs 28G 15M 28G 1% /run

tmpfs 28G 0 28G 0% /sys/fs/cgroup

/dev/sda5 50G 3.3G 47G 7% /

/dev/sda6 321G 2.8G 319G 1% /home

/dev/sda2 1014M 178M 837M 18% /boot

tmpfs 5.5G 0 5.5G 0% /run/user/0

nsd_data 400G 2.8G 398G 1% /data

nsd_backup 400G 2.8G 398G 1% /backup

[root@tac1 ~]# ls -l /data

total 1

-rw-r--r--. 1 root root 298 Oct 24 09:49 hosts

[root@gw ~]# ssh tac2

Last login: Sat Oct 24 09:33:46 2020 from gwp

[root@tac2 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

devtmpfs 28G 0 28G 0% /dev

tmpfs 28G 0 28G 0% /dev/shm

tmpfs 28G 15M 28G 1% /run

tmpfs 28G 0 28G 0% /sys/fs/cgroup

/dev/sda5 50G 3.2G 47G 7% /

/dev/sda6 321G 3.3G 318G 2% /home

/dev/sda2 1014M 178M 837M 18% /boot

tmpfs 5.5G 0 5.5G 0% /run/user/0

nsd_backup 400G 2.8G 398G 1% /backup

nsd_data 400G 2.8G 398G 1% /data

[root@tac2 ~]# ls -l /data

total 1

-rw-r--r--. 1 root root 298 Oct 24 09:49 hosts

참고로 어떤 disk가 GPFS nsd인지는 fdisk 명령으로 아래와 같이 확인하실 수 있습니다. fdisk -l 로 볼 때, 아래와 같이 IBM General Par GPFS라고 나오는 것이 GPFS nsd 입니다.

[root@tac1 ~]# fdisk -l | grep sd

WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 429.5 GB, 429496729600 bytes, 838860800 sectors

/dev/sda1 * 2048 10239 4096 41 PPC PReP Boot

/dev/sda2 10240 2107391 1048576 83 Linux

/dev/sda3 2107392 60829695 29361152 82 Linux swap / Solaris

/dev/sda4 60829696 838860799 389015552 5 Extended

/dev/sda5 60831744 165689343 52428800 83 Linux

/dev/sda6 165691392 838860799 336584704 83 Linux

Disk /dev/sdb: 429.5 GB, 429496729600 bytes, 838860800 sectors

Disk /dev/sdc: 429.5 GB, 429496729600 bytes, 838860800 sectors

Disk /dev/sdd: 429.5 GB, 429496729600 bytes, 838860800 sectors

Disk /dev/sde: 429.5 GB, 429496729600 bytes, 838860800 sectors

Disk /dev/sdf: 429.5 GB, 429496729600 bytes, 838860800 sectors

Disk /dev/sdg: 429.5 GB, 429496729600 bytes, 838860800 sectors

[root@tac1 ~]# fdisk -l /dev/sdb

WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdb: 429.5 GB, 429496729600 bytes, 838860800 sectors

Units = sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk label type: gpt

Disk identifier: 236CE033-C570-41CC-8D2E-E20E6F494C38

# Start End Size Type Name

1 48 838860751 400G IBM General Par GPFS:

[root@tac1 ~]# fdisk -l /dev/sdc

WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdc: 429.5 GB, 429496729600 bytes, 838860800 sectors

Units = sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk label type: gpt

Disk identifier: 507A299C-8E96-49E2-8C25-9D051BC9B935

# Start End Size Type Name

1 48 838860751 400G IBM General Par GPFS:

일반 disk는 아래와 같이 평범하게 나옵니다.

[root@tac1 ~]# fdisk -l /dev/sdd

Disk /dev/sdd: 429.5 GB, 429496729600 bytes, 838860800 sectors

Units = sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

IBM POWER9 Redhat 7에서 Redhat HA Cluster 구성하는 방법

Redhat HA cluster를 IBM POWER9 (ppc64le) 기반의 Redhat 7에서 설치하는 방법입니다.

먼저 firewalld를 stop 시킵니다.

[root@ha1 ~]# systemctl stop firewalld

[root@ha1 ~]# systemctl disable firewalld

아래의 package들을 설치합니다. 이건 Redhat OS DVD에는 없고 별도의 yum repository에 들어있습니다. ppc64le의 경우엔 rhel-ha-for-rhel-7-server-for-power-le-rpms 라는 yum repo에 있습니다.

[root@ha1 ~]# yum install pcs fence-agents-all

설치하면 hacluster라는 user가 자동 생성되는데 여기에 passwd를지정해줘야 합니다.

[root@ha1 ~]# passwd hacluster

그리고 pcsd daemon을 start 합니다. Reboot 후에도 자동 start 되도록 enable도 합니다.

[root@ha1 ~]# systemctl start pcsd.service

[root@ha1 ~]# systemctl enable pcsd.service

참여할 node에 아래와 같이 인증 작업을 합니다.

[root@ha1 ~]# pcs cluster auth ha1 ha2

Username: hacluster

Password:

ha1: Authorized

ha2: Authorized

간단히 아래와 같이 corosysnc.conf 파일을 만듭니다. ha1, ha2 노드는 물론 /etc/hosts에 등록된 IP 주소입니다.

[root@ha1 ~]# vi /etc/corosync/corosync.conf

totem {

version: 2

secauth: off

cluster_name: tibero_cluster

transport: udpu

}

nodelist {

node {

ring0_addr: ha1

nodeid: 1

}

node {

ring0_addr: ha2

nodeid: 2

}

quorum {

provider: corosync_votequorum

two_node: 1

}

logging {

to_syslog: yes

}

Cluster를 전체 node에서 enable합니다.

[root@ha1 ~]# pcs cluster enable --all

ha1: Cluster Enabled

ha2: Cluster Enabled

다음과 같이 cluster start 합니다.

[root@ha1 ~]# pcs cluster start --all

ha1: Starting Cluster (corosync)...

ha2: Starting Cluster (corosync)...

ha1: Starting Cluster (pacemaker)...

ha2: Starting Cluster (pacemaker)...

상태 확인해봅니다.

[root@ha1 ~]# pcs cluster status

Cluster Status:

Stack: unknown

Current DC: NONE

Last updated: Mon Oct 26 10:14:59 2020

Last change: Mon Oct 26 10:14:55 2020 by hacluster via crmd on ha1

2 nodes configured

0 resource instances configured

PCSD Status:

ha1: Online

ha2: Online

이때 ha2 노드에 가보면 corosysnc.conf 파일은 없습니다.

[root@ha2 ~]# ls -l /etc/corosync/

total 12

-rw-r--r--. 1 root root 2881 Jun 5 23:10 corosync.conf.example

-rw-r--r--. 1 root root 767 Jun 5 23:10 corosync.conf.example.udpu

-rw-r--r--. 1 root root 3278 Jun 5 23:10 corosync.xml.example

drwxr-xr-x. 2 root root 6 Jun 5 23:10 uidgid.d

이걸 ha2에서 생성시키려면 cluster를 sync하면 됩니다.

[root@ha1 ~]# pcs cluster sync

ha1: Succeeded

ha2: Succeeded

생성된 것을 확인하실 수 있습니다.

[root@ha2 ~]# ls -l /etc/corosync/

total 16

-rw-r--r--. 1 root root 295 Oct 26 10:17 corosync.conf

-rw-r--r--. 1 root root 2881 Jun 5 23:10 corosync.conf.example

-rw-r--r--. 1 root root 767 Jun 5 23:10 corosync.conf.example.udpu

-rw-r--r--. 1 root root 3278 Jun 5 23:10 corosync.xml.example

drwxr-xr-x. 2 root root 6 Jun 5 23:10 uidgid.d

이제 cluster resource를 확인합니다. 당연히 아직 정의된 것이 없습니다.

[root@ha1 ~]# pcs resource show

NO resources configured

두 node 사이에서 failover 받을 cluster의 virtual IP를 아래와 같이 VirtualIP라는 resource ID 이름으로 등록합니다. 참고로 ha1은 10.1.1.1, ha2는 10.1.1.2이고 모두 eth1에 부여된 IP입니다.

[root@ha1 ~]# pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=10.1.1.11 cidr_netmask=24 nic=eth1 op monitor interval=30s

[root@ha1 ~]# pcs resource enable VirtualIP

이제 다시 resource를 봅니다.

[root@ha1 ~]# pcs resource show

VirtualIP (ocf::heartbeat:IPaddr2): Stopped

아직 VirtualIP가 stopped 상태인데, 이는 아직 STONITH가 enable 되어 있는 default 상태이기 때문입니다. STONITH는 split-brain을 방지하기 위한 장치인데, 지금 당장은 disable 하겠습니다.

[root@ha1 ~]# pcs property set stonith-enabled=false

Verify를 해봅니다. 아무 메시지 없으면 통과입니다.

[root@ha1 ~]# crm_verify -L

이제 다시 status를 보면 VirtualIP가 살아 있는 것을 보실 수 있습니다.

[root@ha1 ~]# pcs status

Cluster name: tibero_cluster

Stack: corosync

Current DC: ha1 (version 1.1.23-1.el7-9acf116022) - partition with quorum

Last updated: Mon Oct 26 11:18:31 2020

Last change: Mon Oct 26 11:17:31 2020 by root via cibadmin on ha1

2 nodes configured

1 resource instance configured

Online: [ ha1 ha2 ]

Full list of resources:

VirtualIP (ocf::heartbeat:IPaddr2): Started ha1

Daemon Status:

corosync: active/enabled

pacemaker: active/enabled

pcsd: active/enabled

또 IP address를 보면 10.1.1.11이 eth1에 붙은 것도 보실 수 있습니다.

[root@ha1 ~]# ip a | grep 10.1.1

inet 10.1.1.1/24 brd 10.1.1.255 scope global noprefixroute eth1

inet 10.1.1.11/24 brd 10.1.1.255 scope global secondary eth1

다른 node에서 10.1.1.11 (havip)로 ping을 해보면 잘 됩니다.

[root@gw ~]# ping havip

PING havip (10.1.1.11) 56(84) bytes of data.

64 bytes from havip (10.1.1.11): icmp_seq=1 ttl=64 time=0.112 ms

64 bytes from havip (10.1.1.11): icmp_seq=2 ttl=64 time=0.040 ms

이 resource가 failover된 이후 죽었던 node가 되살아나면 원래의 node로 failback 하게 하려면 아래와 같이 합니다.

[root@ha1 ~]# pcs resource defaults resource-stickiness=100

Warning: Defaults do not apply to resources which override them with their own defined values

[root@ha1 ~]# pcs resource defaults

resource-stickiness=100

이제 sdb disk를 이용하여 LVM 작업을 합니다. 여기서는 /data와 /backup이 VirtualIP와 함께 ha1에 mount 되어 있다가 유사시 ha2로 failover 되도록 하고자 합니다.

[root@ha1 ~]# pvcreate /dev/sdb

[root@ha1 ~]# vgcreate datavg /dev/sdb

[root@ha1 ~]# lvcreate -L210000 -n datalv datavg

[root@ha1 ~]# lvcreate -L150000 -n backuplv datavg

[root@ha1 ~]# vgs

VG #PV #LV #SN Attr VSize VFree

datavg 1 2 0 wz--n- <400.00g 48.43g

[root@ha1 ~]# lvs

LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert

backuplv datavg -wi-a----- 146.48g

datalv datavg -wi-a----- <205.08g

[root@ha1 ~]# mkfs.ext4 /dev/datavg/datalv

[root@ha1 ~]# mkfs.ext4 /dev/datavg/backuplv

[root@ha1 ~]# mkdir /data

[root@ha1 ~]# mkdir /backup

[root@ha1 ~]# ssh ha2 mkdir /data

[root@ha1 ~]# ssh ha2 mkdir /backup

이제 이 VG와 filesystem들이 한쪽 node에만, 그것도 OS가 아니라 HA cluster (pacemaker)에 의해서만 mount 되도록 설정합니다.

[root@ha1 ~]# grep use_lvmetad /etc/lvm/lvm.conf

use_lvmetad = 1

[root@ha1 ~]# lvmconf --enable-halvm --services --startstopservices

[root@ha1 ~]# grep use_lvmetad /etc/lvm/lvm.conf

use_lvmetad = 0

[root@ha1 ~]# vgs --noheadings -o vg_name

datavg

그리고 이 VG와 LV, filesystem을 pcs에 등록합니다.

[root@ha1 ~]# pcs resource create tibero_vg LVM volgrpname=datavg exclusive=true --group tiberogroup

Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')

[root@ha1 ~]# pcs resource create tibero_data Filesystem device="/dev/datavg/datalv" directory="/data" fstype="ext4" --group tiberogroup

Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')

[root@ha1 ~]# pcs resource create tibero_backup Filesystem device="/dev/datavg/backuplv" directory="/backup" fstype="ext4" --group tiberogroup

Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')

[root@ha1 ~]# pcs resource update VirtualIP --group tiberogroup

그리고 VirtualIP가 항상 이 filesytem들과 함께 움직이도록 colocation constraint를 줍니다.

[root@ha1 ~]# pcs constraint colocation add tiberogroup with VirtualIP INFINITY

그리고 아래 내용은 두 node에서 모두 수행합니다. ha2도 reboot해야 거기서 datavg 및 거기에 든 LV들이 인식됩니다.

[root@ha1 ~]# vi /etc/lvm/lvm.conf

...

volume_list = [ ]

...

[root@ha1 ~]# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)

[root@ha1 ~]# shutdown -r now

Reboot 이후 보면 VirtualIP나 /data, /backup filesystem이 모두 ha1에서 mount 되어 있는 것을 보실 수 있습니다.

[root@ha1 ~]# pcs status

Cluster name: tibero_cluster

Stack: corosync

Current DC: ha1 (version 1.1.23-1.el7-9acf116022) - partition with quorum

Last updated: Mon Oct 26 13:44:36 2020

Last change: Mon Oct 26 13:44:14 2020 by root via cibadmin on ha1

2 nodes configured

4 resource instances configured

Online: [ ha1 ha2 ]

Full list of resources:

VirtualIP (ocf::heartbeat:IPaddr2): Started ha1

Resource Group: tiberogroup

tibero_vg (ocf::heartbeat:LVM): Started ha1

tibero_data (ocf::heartbeat:Filesystem): Started ha1

tibero_backup (ocf::heartbeat:Filesystem): Started ha1

Daemon Status:

corosync: active/enabled

pacemaker: active/enabled

pcsd: active/enabled

ha1 노드를 죽여버리면 곧 VirtualIP와 filesystem들이 자동으로 ha2에 failover 되어 있는 것을 확인하실 수 있습니다.

[root@ha1 ~]# halt -f

Halting.

------------

[root@ha2 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

devtmpfs 28G 0 28G 0% /dev

tmpfs 28G 58M 28G 1% /dev/shm

tmpfs 28G 14M 28G 1% /run

tmpfs 28G 0 28G 0% /sys/fs/cgroup

/dev/sda5 50G 2.6G 48G 6% /

/dev/sda6 321G 8.5G 313G 3% /home

/dev/sda2 1014M 231M 784M 23% /boot

tmpfs 5.5G 0 5.5G 0% /run/user/0

/dev/mapper/datavg-datalv 202G 61M 192G 1% /data

/dev/mapper/datavg-backuplv 145G 61M 137G 1% /backup

[root@ha2 ~]# ip a | grep 10.1.1

inet 10.1.1.2/24 brd 10.1.1.255 scope global noprefixroute eth1

inet 10.1.1.11/24 brd 10.1.1.255 scope global secondary eth1

ha1이 죽은 상태에서의 status는 아래와 같이 나옵니다.

[root@ha2 ~]# pcs status

Cluster name: tibero_cluster

Stack: corosync

Current DC: ha2 (version 1.1.23-1.el7-9acf116022) - partition with quorum

Last updated: Mon Oct 26 13:48:49 2020

Last change: Mon Oct 26 13:47:18 2020 by root via cibadmin on ha1

2 nodes configured

4 resource instances configured

Online: [ ha2 ]

OFFLINE: [ ha1 ]

Full list of resources:

VirtualIP (ocf::heartbeat:IPaddr2): Started ha2

Resource Group: tiberogroup

tibero_vg (ocf::heartbeat:LVM): Started ha2

tibero_data (ocf::heartbeat:Filesystem): Started ha2

tibero_backup (ocf::heartbeat:Filesystem): Started ha2

Daemon Status:

corosync: active/enabled

pacemaker: active/enabled

pcsd: active/enabled

이 상태에서 pcs cluster를 stop 시키면 VirtualIP와 filesystem들이 모두 내려갑니다.

[root@ha2 ~]# pcs cluster stop --force

Stopping Cluster (pacemaker)...

Stopping Cluster (corosync)...

[root@ha2 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

devtmpfs 28G 0 28G 0% /dev

tmpfs 28G 0 28G 0% /dev/shm

tmpfs 28G 14M 28G 1% /run

tmpfs 28G 0 28G 0% /sys/fs/cgroup

/dev/sda5 50G 2.6G 48G 6% /

/dev/sda6 321G 8.5G 313G 3% /home

/dev/sda2 1014M 231M 784M 23% /boot

tmpfs 5.5G 0 5.5G 0% /run/user/0

[root@ha2 ~]# ip a | grep 10.1.1

inet 10.1.1.2/24 brd 10.1.1.255 scope global noprefixroute eth1

tf_cnn_benchmarks를 이용한 GPU 성능 벤치마크 테스트

Deep learning용 GPU 서버의 성능 측정은 역시 deep learning에 가장 많이 사용되는 tensorflow를 이용한 benchmark test를 돌려보는 것입니다. 구체적으로 어떤 GPU에서는 몇 images/sec의 속도가 나와야 한다는 기준은 존재하지 않습니다. 테스트에 사용되는 neural network의 종류와 사소한 parameter, 그리고 어떤 dataset을 이용하느냐에 따라 그 성능은 천차만별이기 때문입니다. 또 테스트를 위해 labeling된 image dataset을 구하는 것도 쉽지 않습니다.

그런데 이런 어려운 문제를 극복하고 GPU를 이용한 tensorflow를 이용한 benchmark test를 수행하는 방법이 있습니다. Tensorflow github에 공개된 tf_cnn_benchmarks를 이용하는 것입니다. 이 테스트는 더 이상 update되고 있지는 않지만 다음과 같은 장점이 있어서 아직도 널리 쓰이고 있습니다.

1) 사용방법이 간단

2) GPU 개수를 조절해가며 테스트 가능

3) CPU만으로도 테스트하여 CPU 대비 GPU 성능도 평가 가능

4) Test용 image dataset을 별도로 준비하지 않아도 python code에서 합성하여 사용

5) 성능 평가 수치를 images/sec로 간단명료하게 제시

이를 POWER9 processor를 장착한 IBM POWER 서버에서 수행하기 위해서는 먼저 다음과 같이 anaconda package를 download 받아 설치합니다.

[cecuser@p1226-kvm1 ~]$ wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-ppc64le.sh

[cecuser@p1226-kvm1 ~]$ chmod a+x Anaconda3-2020.07-Linux-ppc64le.sh

[cecuser@p1226-kvm1 ~]$ ./Anaconda3-2020.07-Linux-ppc64le.sh

Do you accept the license terms? [yes|no]

[no] >>> yes

[/home/cecuser/anaconda3] >>>

Do you wish the installer to initialize Anaconda3

by running conda init? [yes|no]

[no] >>> yes

Anaconda 설치가 끝난 뒤 profile을 수행하면 conda environment가 가동됩니다.

[cecuser@p1226-kvm1 ~]$ . ~/.bashrc

(base) [cecuser@p1226-kvm1 ~]$

이제 IBM OPEN-CE (구 명칭 IBM PowerAI 또는 Watson Machine Learning Community Edition), 즉 IBM ppc64le 환경에서 쉽게 deep leanring 관련 open source conda package를 설치할 수 있도록 해주는 conda channel을 등록합니다.

(base) [cecuser@p1226-kvm1 ~]$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

이 tf_cnn_benchmarks는 python 3.6 환경에서 수행하므로 다음과 같이 python 3.6용 conda virtual environment를 생성합니다. 여기서는 이름을 wmlce_env로 정했습니다.

(base) [cecuser@p1226-kvm1 ~]$ conda create --name wmlce_env python=3.6

방금 만든 wmlce_env 환경을 activate 합니다.

(base) [cecuser@p1226-kvm1 ~]$ conda activate wmlce_env

이제 prompt 맨 앞부분이 (wmlce_env)로 바뀐 것을 보실 수 있습니다. 이제 OPEN-CE를 설치합니다. 구명칭인 PowerAI라는 이름의 패키지를 설치하시면 됩니다. 네트워크 환경에 따라 시간이 꽤 걸릴 수 있습니다. (30분 ~ 몇 시간)

(wmlce_env) [cecuser@p1226-kvm1 ~]$ conda install powerai

그러고 난 뒤, 다음과 같이 tensorflow의 하위 github인 benchmarks를 clone 합니다.

(wmlce_env) [cecuser@p1226-kvm1 ~]$ git clone https://github.com/tensorflow/benchmarks

이제 그 중 tf_cnn_benchmarks로 들어갑니다.

(wmlce_env) [cecuser@p1226-kvm1 ~]$ cd benchmarks/scripts/tf_cnn_benchmarks

이 directory에 들어있는 tf_cnn_benchmarks.py를 수행하면 됩니다. 이때 다음과 같은 parameter를 주고 수행하면 됩니다.

먼저, GPU를 1개 이용한 테스트입니다. 아래 테스트는 NVIDIA P100 GPU를 이용한 것입니다만 공식적인 수치는 아니며 환경에 따라 또 다른 수치가 나올 수 있으니 참고용으로만 쓰십시요.

(wmlce_env) [cecuser@p1226-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server --data_format=NHWC --num_batches=640

Done warm up

Step Img/sec total_loss

1 images/sec: 224.9 +/- 0.0 (jitter = 0.0) 8.108

10 images/sec: 225.9 +/- 0.2 (jitter = 0.6) 8.122

20 images/sec: 226.0 +/- 0.2 (jitter = 0.4) 7.983

...

620 images/sec: 221.4 +/- 0.6 (jitter = 0.7) 7.772

630 images/sec: 221.4 +/- 0.6 (jitter = 0.7) 7.676

640 images/sec: 221.5 +/- 0.6 (jitter = 0.7) 7.779

----------------------------------------------------------------

total images/sec: 221.40

만약 GPU 2장을 쓰고 싶으시면 아래와 같이 --num_gpus=1 대신 --num_gpus=2를 쓰시면 됩니다.

(wmlce_env) [cecuser@p1226-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=32 --model=resnet50 --variable_update=parameter_server --data_format=NHWC --num_batches=640

GPU의 성능이 제대로 나오고 있는 것인지 보는 가장 직관적이고 좋은 방법은 이 동일한 테스트를 CPU로 돌려보는 것입니다. 아래와 같이 --device=CPU를 쓰시면 CPU로만 수행이 가능합니다. 역시 CPU 성능 및 개수와 밀접한 상관이 있으므로, 아래 수치는 참고로만 쓰십시요.

(wmlce_env) [cecuser@p1226-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --device=CPU --batch_size=32 --model=resnet50 --variable_update=parameter_server --data_format=NHWC --num_batches=640

Done warm up

Step Img/sec total_loss

1 images/sec: 3.0 +/- 0.0 (jitter = 0.0) 8.108

10 images/sec: 3.0 +/- 0.0 (jitter = 0.0) 8.122

20 images/sec: 2.9 +/- 0.0 (jitter = 0.0) 7.983

...

620 images/sec: 2.9 +/- 0.0 (jitter = 0.0) 7.750

630 images/sec: 2.9 +/- 0.0 (jitter = 0.0) 7.667

640 images/sec: 2.9 +/- 0.0 (jitter = 0.0) 7.805

----------------------------------------------------------------

total images/sec: 2.91

----------------------------------------------------------------

참고로 위 GPU 테스트를 수행할 떄의 nvidia-smi tool을 보면 아래와 같이 GPU 사용률을 보여줍니다.

Tue Oct 20 04:23:57 2020

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla P100-SXM2... On | 00000001:00:01.0 Off | 0 |

| N/A 51C P0 240W / 300W | 15677MiB / 16280MiB | 96% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| 0 16243 C python 15667MiB |

+-----------------------------------------------------------------------------+

2020년 7월 6일 월요일

Ubuntu 18.04 (ppc64le, IBM POWER9)에서 잊어버린 root passwd reset 하는 방법

먼저 system booting할 때 petit-boot menu까지 나오면, 거기서 맨 아래줄의 'Exit to shell' 메뉴를 선택합니다.

여기서 'fdisk -l' 명령을 내리면 어떤 disk들이 있는지, 그리고 어느 disk partition에 OS가 들어있는지 보실 수 있습니다. 제가 겪은 경우에는 sda와 sdb의 2개 disk가 있었고, 그 중 sda에서 dm-0, dm-1, dm-2의 3개 device가 보였는데 그 size를 보면 dm-0는 PReP partition, dm-2는 SWAP partition이므로 아마 dm-1이 OS partition이라고 판단되었습니다. 그걸 /mnt에 mount 합니다.

# mount /dev/dm-1 /mnt

이제 이 /mnt 속을 보면 etc나 lib, usr, var 등과 같이 OS가 설치된 것이 보일 것입니다. 이제 chroot 명령으로 /mnt를 /로 바꿉니다.

# chroot /mnt

이제 dm-1 속의 OS image를 /로 mount 한 것입니다. 이제 passwd를 바꿔줍니다.

# passwd

그리고나서 Ctrl-D로 빠져나와 정상적으로 booting하면 됩니다.

2020년 6월 12일 금요일

tf_cnn_benchmarks.py를 이용한 Tensorflow Large Model Support (LMS)의 demo

IBM Watson Machine Learning Community Edition (WML-CE) 1.6.2 속에 포함된 Tensorflow 1.15를 이용하여 large model support (LMS)에 대한 demo를 해보는 방법입니다.

** 아래 환경은 IBM CECC cloud에서 제공되는 가상화 환경의 P100 GPU를 이용했기 때문에, training 성능 자체는 떨어진다는 점을 인지해주시기 바랍니다.

가장 간단한 것은 WML-CE 속에 들어있는 tf_cnn_benchmarks suite를 이용하는 것입니다.

먼저 아래 명령어를 이용하여 tf_cnn_benchmarks suite를 원하는 directory에 copy합니다. (이건 optional step이고, 그냥 해당 directory로 직접 찾아 들어가도 됩니다.)

(wmlce_162) [cecuser@p1290-kvm1 ~]$ ./anaconda3/envs/wmlce_162/bin/install_tf_cnn_benchmarks .

(wmlce_162) [cecuser@p1290-kvm1 ~]$ cd tf_cnn_benchmarks

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ ls
all_reduce_benchmark.py cnn_util.py mlperf_test.py test_data
all_reduce_benchmark_test.py cnn_util_test.py models test_util.py
allreduce.py coco_metric.py platforms tf_cnn_benchmarks.py
allreduce_test.py constants.py preprocessing.py variable_mgr.py
batch_allreduce.py convnet_builder.py __pycache__ variable_mgr_util.py
benchmark_cnn_distributed_test.py datasets.py README.md variable_mgr_util_test.py
benchmark_cnn_distributed_test_runner.py flags.py run_tests.py
benchmark_cnn.py leading_indicators_test.py ssd_constants.py
benchmark_cnn_test.py mlperf.py ssd_dataloader.py

다만 여기서 benchmark_cnn.py 에서 일부 source code를 수정해야 합니다. 이는 source code 안에 들어있는 LMS 관련 parameter인 lms_swapout_threshold 관련 bug 때문입니다. 원래 값인 -1을 그대로 놔두면 원래는 auto-tuning이 되어야 하는데, TF 버전 등과의 호환 문제로 거기서 에러가 나므로, 일단은 그냥 1로 수정합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ vi benchmark_cnn.py

...

#flags.DEFINE_integer('lms_swapout_threshold', -1,

flags.DEFINE_integer('lms_swapout_threshold', 1,

...

그렇게 한 뒤 아래와 같이 tf_cnn_benchmarks.py를 수행해 봅니다. 여기서는 batch_size를 150으로 주고 해봅니다. 일단 잘 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=150 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
I0612 04:14:43.269922 140736229690240 session_manager.py:502] Done running local_init_op.
Running warm up
2020-06-12 04:14:45.534222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:14:45.728889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 255.8 +/- 0.0 (jitter = 0.0) 7.820
10 images/sec: 255.8 +/- 0.1 (jitter = 0.4) 8.082
20 images/sec: 255.7 +/- 0.1 (jitter = 0.4) 7.856
30 images/sec: 255.6 +/- 0.1 (jitter = 0.3) 7.832
40 images/sec: 255.5 +/- 0.0 (jitter = 0.3) 7.879
50 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.701
60 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.918
70 images/sec: 255.5 +/- 0.0 (jitter = 0.2) 7.845
80 images/sec: 255.4 +/- 0.0 (jitter = 0.2) 7.750
90 images/sec: 255.3 +/- 0.0 (jitter = 0.2) 7.806
100 images/sec: 255.3 +/- 0.0 (jitter = 0.3) 7.856
----------------------------------------------------------------
total images/sec: 255.22
----------------------------------------------------------------

이때 OS에서 nmon tool을 이용해서 관찰해보면, host RAM의 메모리 사용량이 5GB 정도에 불과하고 tf_cnn_benchmarks의 data size도 32GB 정도, res set size도 2.1GB 정도에 불과한 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 58734.8 3518.5 x
x Free Percent 92.7% 85.9% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13671 42.1 32142m 2153m 3200 2866m 0 633216 12 0 tf_cnn_benchmar x

이제 batch_size를 200으로 늘려 보겠습니다. 그러면 16GB에 불과한 P100 GPU의 메모리가 꽉 차서 결국 Out-Of-Memory(OOM) error가 발생합니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[main_fetch_group/_566]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[200,2048,7,7] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node tower_0/v/cg/resnet_v113/conv46/conv2d/Conv2D (defined at /home/cecuser/anaconda3/envs/wmlce_162/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
...

하지만 동일한 batch_size를 그대로 주더라도, LMS를 적용하여 --lms=True 옵션을 주고 수행해보면 (비록 많이 느려졌지만) error 없이 수행되는 것을 보실 수 있습니다.

(wmlce_162) [cecuser@p1290-kvm1 tf_cnn_benchmarks]$ python tf_cnn_benchmarks.py --batch_size=200 --num_batches=100 --model=resnet50 --num_gpus=1 --display_every=10 --lms=True
....
I0612 04:27:06.439511 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: -0.09 GiB (memory ratio: 0.9)
I0612 04:27:06.439677 140735558208384 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0612 04:27:06.440271 140735558208384 lms.py:1275] [LMS][0] LMS will use the latest parameter set found by Simulator for the best performance. However, if you encounter an out-of-memory error, please manually use the previous parameter set found by Simulator.
I0612 04:27:06.440359 140735558208384 lms.py:1275] [LMS][0] sync_mode: 3 (Synchronous memory copy between host and device)
I0612 04:27:06.440439 140735558208384 lms.py:1275] [LMS][0] swapout_threshold: 1
I0612 04:27:06.440520 140735558208384 lms.py:1275] [LMS][0] swapin_ahead: -1 (ignored since sync_mode is 3)
I0612 04:27:06.440600 140735558208384 lms.py:1275] [LMS][0] swapin_groupby: -1 (ignored since sync_mode is 3)
I0612 04:27:06.869183 140735558208384 lms.py:1275] [LMS][0] Added 425 operations to the model (180 swap-out operations (20.33 GiB) and 245 swap-in operations (31.36 GiB))
I0612 04:27:06.869335 140735558208384 lms.py:1275] [LMS][0] Editing model for LMS, took: 799.3814945220947 ms
...
Running warm up
2020-06-12 04:27:15.098435: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-12 04:27:15.371592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.988
10 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.900
20 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.914
30 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 8.043
40 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.880
50 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.903
60 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.889
70 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.770
80 images/sec: 21.1 +/- 0.0 (jitter = 0.0) 7.906
90 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.813
100 images/sec: 21.1 +/- 0.0 (jitter = 0.1) 7.824
----------------------------------------------------------------
total images/sec: 21.13
----------------------------------------------------------------

이때 host의 OS를 관찰해보면 host RAM 사용량이 22GB 정도로 대폭 늘었고, tf_cnn_benchmarks의 data size도 50GB 정도, res set size도 19GB 정도로 대폭 늘어난 것을 보실 수 있습니다.

x Memory and Swap qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x PageSize:64KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 63392.2 4094.5 - not in use - not in use x
x Free (MB) 41505.6 3490.5 x
x Free Percent 65.5% 85.2% x

x Top Processes Procs=365-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Argsqqqqqx
x PID %CPU Size Res Res Res Res Shared Faults Command x
x Used KB Set Text Data Lib KB Min Maj x
x 13427 10.4 49577m19322m 3200 3399m 0 176710 0 0 tf_cnn_benchmar x

2020년 4월 8일 수요일

LSF 관련 Q&A : job의 suspend - resume, 방화벽 환경에서 뚫어놓아야 할 port들

Q1. GPU를 사용하는 job의 경우에도 job을 suspend - resume하는 것이 가능한지요 ?

: LSF에서는 훨씬 더 긴급하고 중요한 job B가 생겼는데 이미 수행 중인 기존 job A가 자원을 다 쓰고 있어서 당장 가용한 자원이 없을 경우, 이미 RUNNING 상태로 수행 중인 기존 job A를 잠시 suspend 시키고, 그렇게 풀려난 자원을 이용하여 긴급하고 중요한 job B를 수행시킨 뒤, 그것이 다 끝나면 suspend 되었던 기존 job A를 다시 resume 할 수 있다고 들었습니다. 이것은 CPU 상에서 job이 수행될 때의 이야기일텐데, GPU 상에서 수행되는 job에 대해서도 suspend-resume이 잘 수행되는지요 ?

A1. 가능합니다만 tensorflow와 같은 deep learning training job에 대해서는 현실적인 효용성이 떨어집니다.

: 이미 알고 계시는 것처럼 bkill 명령은 이미 submit 되어 수행 중인 LSF job을 글자 그대로 kill 하는 것이지만, bstop 명령은 중단시키는 것이 아니라 suspend 시키는 명령입니다. (좀더 정확하게 말하자면 사용 옵션에 따라 bkill도 suspend를 시킬 수는 있습니다.) 그렇게 suspend 된 job은 나중에 bresume 명령 (또는 bkill -s CONT 명령)으로 resume할 수 있습니다.
다만, 이때 suspend된 job은 CPU 자원만 release할 뿐이고, memory 상에 올라가 있는 기존 process의 영역을 지우지는 않습니다. 따라서, bstop - bresume 명령을 통해 잠시 suspend 시킨 뒤에 나중에 다시 resume하기 위해서는 서버에 충분한 free memory 용량이 있어야 합니다. 대개의 경우 HPC cluster node들에는 큰 용량의 RAM이 장착되어 있으므로 CPU job에서는 bstop - bresume 명령을 통한 일시 중단 및 재개가 잘 됩니다.

그러나 GPU 상에서는 이야기가 좀더 복잡합니다. GPU 상에서 돌아가는 deep learning application, 가령 tensorflow를 사용하는 python의 경우 효율적인 training을 위해 GPU 상에 존재하는 모든 GPU memory를 다 써버리는 경우가 대부분입니다. 이렇게 tensorflow를 이용한 python process에 대해서도 bstop 명령을 날리면 해당 procecss가 kill 되지는 않고 suspend되는데, 그때 tensorflow가 사용하던 메모리는 그대로 모두 사용된 채로 남아 있게 됩니다. 즉, nvidia-smi 등의 명령으로 GPU 사용량을 확인해보면 GPU compute %는 0%지만 해당 GPU를 여전히 tensorflow python이 쥐고 있고 특히 GPU memory 사용량도 거의 100% 다 사용 중인 것으로 나오는 것을 보실 수 있습니다.

그런 경우에도 suspend - resume이 가능하기는 합니다만, 애초에 그렇게 수행 중이던 job을 suspend 시키는 이유가 더 급한 다른 job을 수행하기 위함인데 정작 그 상태에서는 다른 job을 수행시킬 수가 없습니다. 새로 수행되는 job이 사용할 available GPU memory가 없는 것으로 나올테니까요. 물론 caffe나 h2o 등과 같이, GPU memory를 전체 다 사용해버리지 않고 필요한 만큼만 점거하는 application의 경우, CPU에서와 같이 bstop - bresume 명령을 통해 잠시 suspend 시켜놓고 더 중요한 다른 job을 수행하는 것이 가능합니다.

Q2. LSF가 정상적인 동작을 하기 위해서 사용하는 network port들은 어떤 것들이 있는지요 ?

: 방화벽이 구현되어 있는 네트워크 환경에서 LSF를 구성해서 사용하려고 해보니 잘 되지 않습니다. 아마 LSF가 필요로 하는 network port가 방화벽에 막혀 있어서 그런 것 같은데, ssh에 필요한 22번 외에 어떤 port들을 뚫어놓아야 하는지요 ?

A2. LSF 운용에 필요한 기본 network port들은 아래와 같이 확인하실 수 있습니다.

$ cd $EGO_TOP/conf

$ grep PORT lsf.conf
LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882
LSB_QUERY_PORT=6891

단, LSF_LIM_PORT=7869 은 tcp/udp 둘 다 뚫어놓아야 하고, 나머지는 모두 tcp 입니다. 이에 대해서는 아래 URL을 참조하시기 바랍니다.

https://www.ibm.com/support/pages/configure-firewall-ports-lsf-master-platform-rtm-monitoring
https://www.ibm.com/support/pages/purpose-different-lsf-ports

2020년 4월 7일 화요일

du와 df가 서로 다른 사용량을 보여주는 경우에 대한 설명

du (disk usage)와 df (disk free)에서 측정하는 filesystem의 사용률/사용량에 차이가 발생하는 경우가 많습니다. 이는 du는 fsstat을 이용하고 df는 statfs를 이용하기 때문입니다. 즉, df는 filesystem level에서의 metadata 값을 보기 때문에, 어떤 파일을 지우더라도 만약 어떤 process에서 어떤 방법으로든 아직 그 파일을 쥐고 있다면, disk를 아직 그 파일이 차지하고 있는 것으로 보입니다.

가장 대표적인 케이스가 아래와 같이 tail -f 등으로 파일을 read하고 있는 경우입니다.

먼저, /에 12G가 차있고 실제로 12G의 파일들이 있다는 것을 확인합니다.

bash-4.2# cd /

bash-4.2# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 94G 12G 78G 14% /

bash-4.2# du -sh .
12G .

다음으로 1G짜리 파일을 하나 만듭니다.

bash-4.2# dd if=/dev/zero of=/home/cecuser/111 bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.324121 s, 3.2 GB/s

그 다음에 다시 df와 du로 크기를 확인합니다.

bash-4.2# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 94G 13G 77G 15% /

bash-4.2# du -sh .
13G .

이제 이렇게 만들어진 111이라는 파일에 tail -f 를 백그라운드로 걸어 read를 하도록 합니다.

bash-4.2# tail -f /home/cecuser/111 &
[1] 8624

그리고 111을 삭제합니다.

bash-4.2# rm /home/cecuser/111

그래도 아까 백그라운드로 걸었던 tail -f 는 아직 살아있음을 보실 수 있습니다. 즉, 111 파일은 아직 열려 있습니다.

bash-4.2# jobs

[1]+ Running tail -f /home/cecuser/111 &

이제 df와 du로 filesystem의 사용량을 확인합니다.

bash-4.2# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 94G 13G 77G 15% /

bash-4.2# du -sh .
12G .

즉, 여전히 df로는 13G 사용 중인 것으로 나오지만 du는 12G로 줄어든 것을 정확하게 보여줍니다.

이제 아까 걸어둔 tail -f를 kill 하겠습니다.

bash-4.2# jobs
[1]+ Running tail -f /home/cecuser/111 &

bash-4.2# kill %1
bash-4.2#
[1]+ Terminated tail -f /home/cecuser/111

그러면 이제 df도 제대로 된 값인 12G를 보여줍니다.

bash-4.2# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 94G 12G 78G 14% /

bash-4.2# du -sh .
12G .

2020년 4월 6일 월요일

LSF Data Manager의 사용법 - 간단한 example과 함께

LSF HPC Suite의 좋은 점 중 하나는 LSF Data Manager가 포함되어 있다는 것입니다. 바로 전 posting에서 올린 방법대로 ansible-playbook으로 설치하면 LSF Data Manager도 자동으로 설치됩니다.

LSF Data Manager와 관련한 설정 파일들은 아래와 같은 것들이 있습니다. 기본적으로는 별도로 수정할 필요는 없습니다만, 딱 하나, lsf.datamanager.<cluster_name> 파일에서 STAGING_AREA는 cluster 내의 모든 서버들이 access 할 수 있는 NFS나 GPFS로 바꾸어주셔야 합니다.

[cecuser@p628-kvm1 ~]$ sudo vi /opt/ibm/lsfsuite/lsf/conf/lsbatch/myCluster/configdir/lsb.queues
...
Begin Queue
QUEUE_NAME = dataq
PRIORITY = 33
DATA_TRANSFER = Y
HOSTS = p628-kvm1
End Queue

[cecuser@p628-kvm1 ~]$ sudo vi /opt/ibm/lsfsuite/lsf/conf/lsf.conf
...
LSB_RESOURCE_ENFORCE="memory cpu gpu"
LSF_DATA_PORT=9998
LSF_DATA_HOSTS="p628-kvm1"
LSF_SERVER_HOSTS="p628-kvm1"
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend

[cecuser@p628-kvm1 ~]$ sudo vi /opt/ibm/lsfsuite/lsf/conf/lsf.datamanager.myCluster
...
Begin Parameters
ADMINS = lsfadmin
#STAGING_AREA = /opt/ibm/lsfsuite/lsf/work/myCluster/staging
STAGING_AREA = /home/staging
# CACHE_INPUT_GRACE_PERIOD = 1440
# CACHE_OUTPUT_GRACE_PERIOD = 180
# CACHE_PERMISSIONS = user
# QUERY_NTHREADS = 4
# REMOTE_CACHE_REFRESH_INTERVAL = 15
End Parameters

위에서 지정된 /home/staging은 아래와 같이 NFS export된 directory로서, 아래와 같이 p628-kvm2,3 서버에도 같은 이름으로 mount 되어 있습니다.

[cecuser@p628-kvm1 ~]$ cat /etc/exports
/home/staging p628-kvm2(rw,sync,no_subtree_check) p628-kvm3(rw,sync,no_subtree_check)

[cecuser@p628-kvm2 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 94G 9.2G 81G 11% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 32M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
tmpfs 3.2G 0 3.2G 0% /run/user/1000
p628-kvm1:/home/staging 94G 15G 75G 17% /home/staging
tmpfs 3.2G 0 3.2G 0% /run/user/0

그리고 위와 같이 lsf.datamanager.myCluster 일부 내용을 수정한 경우, 다음과 같이 반드시 bdata admin reconfig 명령을 수행해주어야 그 변경이 효과를 냅니다. 그 외에도, 일부 옵션의 경우 lsf 자체를 restart 해야 하는 경우도 있으니 lsadmin limrestart 명령과 badmin mbdrestart 명령도 해주는 것이 좋습니다.

[cecuser@p628-kvm1 ~]$ su
Password:

bash-4.2# bdata admin reconfig

Checking configuration files ...
No errors found.

LSF data manager daemon (dmd) on <p628-kvm1> is reconfiguring ... initiated.

bash-4.2# lsadmin limrestart

bash-4.2# badmin mbdrestart

이제 설정을 확인하면 아래와 같이 STAGING_AREA가 변경된 것을 보실 수 있습니다.

[cecuser@p628-kvm1 ~]$ bdata showconf
LSF data management configuration at Thu Apr 2 02:30:30 2020
ADMINS = lsfadmin
CACHE_ACCESS_CONTROL = N
CACHE_ACCESSIBLE_FILES = N
CACHE_INPUT_GRACE_PERIOD = 1440 (minutes)
CACHE_OUTPUT_GRACE_PERIOD = 180 (minutes)
CACHE_PERMISSIONS = user
CACHE_REFRESH_INTERVAL = 15 (minutes)
FILE_TRANSFER_CMD = /usr/bin/scp
LSB_TIME_DMD = 0
LSF_DATA_BSUB_CHKSUM = N
LSF_DATA_HOSTS = p628-kvm1
LSF_DATA_PORT = 9998
LSF_LOGDIR = /opt/ibm/lsflogs
LSF_LOG_MASK = LOG_WARNING
PERMISSION_MASK = 000
QUERY_NTHREADS = 4
REMOTE_CACHE_REFRESH_INTERVAL = 15 (seconds)
STAGING_AREA = /home/staging

[cecuser@p628-kvm1 ~]$ bdata connections
LOCAL: (LSF_DATA_PORT=9998)
LSF_DATA_HOSTS
[*]p628-kvm1

CLUSTER MASTER STATUS
myCluster p628-kvm1 ok

이제 간단히 LSF Data Manager의 동작을 테스트해보겠습니다. 먼저 data file을 하나 만듭니다.

[cecuser@p628-kvm1 ~]$ echo "beef jerky" > /tmp/test1.txt

그리고 수행할 job script를 하나 작성합니다. 여기서는 먼저 p628-kvm1 서버의 /tmp/test1.txt 파일을 stage in 한 뒤에 그걸 cat 하는 동작을 해보겠습니다.

[cecuser@p628-kvm1 ~]$ cat job1.sh
bstage in -src p628-kvm1:/tmp/test1.txt
cat test1.txt

이 script가 p628-kvm2 서버에서 수행되도록 해보겠습니다. 분명히 아직 p628-kvm2 서버에는 job1.sh script도, test1이라는 data file도 없습니다.

[cecuser@p628-kvm2 ~]$ pwd
/home/cecuser

[cecuser@p628-kvm2 ~]$ ls
support-scripts

이제 다음과 같이 -m 옵션을 써서 p628-kvm2 서버에서 p628-kvm1 서버의 /tmp/test1.txt 파일을 cat 명령으로 읽는 동작을 수행하겠습니다.

[cecuser@p628-kvm1 ~]$ bsub -I -m p628-kvm2 -data "p628-kvm1:/tmp/test1.txt" < job1.sh
Job <465> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on p628-kvm2.cecc.ihost.com>>
beef jerky

위와 같이 잘 되는 것을 보실 수 있는데, 수행 뒤에 p628-kvm2 서버에 가보면 실제로 test1.txt가 p628-kvm2 서버의 current working directory (CWD)에 복사되어 있는 것을 보실 수 있습니다.

[cecuser@p628-kvm2 ~]$ ls -l
total 8
drwxr-xr-x. 2 cecuser cecuser 4096 Apr 5 03:15 support-scripts
-rw-r--r--. 1 cecuser cecuser 11 Apr 6 01:07 test1.txt

이 LSF Data Manager를 이용해서 각 서버에 장착된 고속 NVMe flash drive를 burst buffer처럼 사용하는 방안에 대해서 살펴보겠습니다.

먼저 각 서버에 고속 NVMe flash drive가 공유되지 않은 일반 directory인 /nvme로 mount되어 있다고 가정해보겠습니다.

[cecuser@p628-kvm1 ~]$ ls -l /nvme
total 0

[cecuser@p628-kvm2 ~]$ ls -l /nvme
total 0

[cecuser@p628-kvm3 ~]$ ls -l /nvme
total 0

그리고 data source는 p628-kvm1 서버의 /tmp/data1 밑에 들어있다고 가정하시지요.

[cecuser@p628-kvm1 ~]$ ls -l /tmp/data1
total 12
-rw-rw-r--. 1 cecuser cecuser 6 Apr 6 02:08 1.txt
-rw-rw-r--. 1 cecuser cecuser 8 Apr 6 02:08 2.txt
-rw-rw-r--. 1 cecuser cecuser 6 Apr 6 02:09 3.txt

[cecuser@p628-kvm1 ~]$ cat /tmp/data1/*.txt
donut
biscuit
bagel

Linux의 paste라는 명령어를 이용하여 이 파일들을 하나로 합쳐서 화면으로 뿌려주는 일을 해보겠습니다. 단, 이 파일들을 합치기 전에 먼저 p628-kvm1~3의 3대 서버 중 한 대의 고속 I/O인 /nvme로 stage-in 해놓은 뒤에 paste 명령을 쓰는 것으로 하겠습니다.

이떄 사용할 것은 bsub 명령어 중 -cwd (current working directory) 옵션입니다. 이 옵션을 써서 각 서버의 /nvme를 현재 작업 directory로 지정할 수 있습니다.

즉, 다음과 같이 하면 됩니다. 먼저 수행할 script를 아래와 같이 작성합니다.

[cecuser@p628-kvm1 ~]$ cat paste.sh
bstage in -src p628-kvm1:/tmp/data1/*
cd data1
paste 1.txt 2.txt 3.txt

그리고 -cwd 옵션을 써서 p628-kvm1 서버의 /tmp/data1/ 디렉토리에서 모든 파일을 execution host로 가져갑니다. 아직 어느 서버가 execution host가 될지는 모릅니다.

[cecuser@p628-kvm1 ~]$ bsub -I -cwd /nvme -data "p628-kvm1:/tmp/data1/*" < paste.sh
Job <493> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on p628-kvm3.cecc.ihost.com>>
donut biscuit bagel

위와 같이 p628-kvm3에서 수행이 된 것을 보실 수 있습니다. 실제로 p628-kvm3 서버의 /nvme 디렉토리를 살펴보면 해당 file들이 복사된 것을 보실 수 있습니다.

[cecuser@p628-kvm3 ~]$ ls -l /nvme/data1/
total 12
-rw-r--r--. 1 cecuser cecuser 6 Apr 6 02:33 1.txt
-rw-r--r--. 1 cecuser cecuser 8 Apr 6 02:33 2.txt
-rw-r--r--. 1 cecuser cecuser 6 Apr 6 02:33 3.txt

그리고 이렇게 한번 staging area로 cache된 파일은 기본적으로 24시간 남아 있기 때문에 다시 부를 경우 stage in 하지 않고 곧장 다시 사용이 가능합니다.

[cecuser@p628-kvm1 ~]$ bdata cache p628-kvm1:/tmp/data1/*
--------------------------------------------------------------------------------
INPUT:
p628-kvm1:/tmp/data1/*

HASH STATUS REF_JOB XFER_JOB GRACE
e9e6f1* TRANSFERRED - 484@myCluster 23hr29min

이제 paste.sh를 약간 수정해서 paste_back.sh을 만들어 보겠습니다. 여기서는 결과물을 파일로 저장해서 -dst 옵션을 통해 특정 위치로 stage out 시키는 것까지 합니다. 여기서는 원래 data source가 있던 p628-kvm1 서버의 /tmp/data1에 보내도록 하겠습니다.

[cecuser@p628-kvm1 ~]$ cat paste_back.sh
bstage in -src p628-kvm1:/tmp/data1/*
cd data1
paste 1.txt 2.txt 3.txt > result.txt
bstage out -src result.txt -dst p628-kvm1:/tmp/data1/result.txt

수행해보면 다음과 같습니다.

[cecuser@p628-kvm1 ~]$ bsub -I -cwd /nvme -data "p628-kvm1:/tmp/data1/*" < paste_back.sh
Job <495> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on p628-kvm3.cecc.ihost.com>>

이제 -dst로 지정한 위치를 살펴보면 정말 result.txt 파일이 들어와 있습니다.

[cecuser@p628-kvm1 ~]$ ls -l /tmp/data1/
total 16
-rw-rw-r--. 1 cecuser cecuser 6 Apr 6 02:08 1.txt
-rw-rw-r--. 1 cecuser cecuser 8 Apr 6 02:08 2.txt
-rw-rw-r--. 1 cecuser cecuser 6 Apr 6 02:09 3.txt
-rw-rw-r--. 1 cecuser cecuser 20 Apr 6 03:33 result.txt

물론 p628-kvm3 서버의 /nvme에도 해당 파일들이 있는데, 여기서 한번 사용되었던 1~3.txt 파일들은 data가 변경되지 않고 그대로인 것을 눈여겨 보시기 바랍니다. 즉, 기존의 파일 그대로이기 때문에 새로 copy하지 않고 재활용하여 I/O 시간을 줄인 것입니다.

[cecuser@p628-kvm3 ~]$ ls -l /nvme/data1/
total 16
-rw-r--r--. 1 cecuser cecuser 6 Apr 6 02:33 1.txt
-rw-r--r--. 1 cecuser cecuser 8 Apr 6 02:33 2.txt
-rw-r--r--. 1 cecuser cecuser 6 Apr 6 02:33 3.txt
-rw-rw-r--. 1 cecuser cecuser 20 Apr 6 03:33 result.txt

만약 이 파일 중 하나의 내용만 살짝 바꾸면 어떻게 될까요 ?

[cecuser@p628-kvm1 ~]$ echo "baguette" > /tmp/data1/1.txt

이제 다시 해보겠습니다. 이번에도 p628-kvm3에서 수행되도록 이번에는 아예 -m 옵션을 쓰겠습니다.

[cecuser@p628-kvm1 ~]$ bsub -I -m p628-kvm3 -cwd /nvme -data "p628-kvm1:/tmp/data1/*" < paste_back.sh
Job <504> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on p628-kvm3.cecc.ihost.com>>

[cecuser@p628-kvm1 ~]$ cat /tmp/data1/result.txt
baguette biscuit bagel

예, 바뀌어 있습니다. p628-kvm3 서버에 가서 보면 1.txt 뿐만 아니라 2.txt와 3.txt까지 파일 3개가 모두 시간이 바뀐 것을 보실 수 있습니다.

[cecuser@p628-kvm3 ~]$ ls -l /nvme/data1/
total 16
-rw-r--r--. 1 cecuser cecuser 9 Apr 6 03:41 1.txt
-rw-r--r--. 1 cecuser cecuser 8 Apr 6 03:41 2.txt
-rw-r--r--. 1 cecuser cecuser 6 Apr 6 03:41 3.txt
-rw-r--r--. 1 cecuser cecuser 23 Apr 6 03:41 result.txt

[cecuser@p628-kvm3 ~]$ cat /nvme/data1/result.txt
baguette biscuit bagel

위에서 수행된 job id 504는 물론 bhist 명령으로 좀 더 자세히 볼 수 있습니다. 그런데, 이렇게 data stage in/out 명령이 수반되는 job은 사실 하나의 job id가 추가로 생성됩니다. 즉, 504 외에 자동으로 505라는 job id도 생성되는데, 이건 우리가 사용한 interactive queue가 아니라 dataq 라는 LSF Data Manager에서만 사용하는 특수 queue로 갑니다.

[cecuser@p628-kvm1 ~]$ bhist -l 504

Job <504>, User <cecuser>, Project <default>, Interactive mode, Command <bstage
in -src p628-kvm1:/tmp/data1/*;cd data1;paste 1.txt 2.txt
3.txt > result.txt;bstage out -src result.txt -dst p628-k
vm1:/tmp/data1/result.txt>
Mon Apr 6 03:41:35: Submitted from host <p628-kvm1>, to Queue <interactive>, C
WD <$HOME>, Specified CWD </nvme>, Specified Hosts <p628-k
vm3>, Data Requirement Requested;
Mon Apr 6 03:41:49: Dispatched 1 Task(s) on Host(s) <p628-kvm3.cecc.ihost.com>
, Allocated 1 Slot(s) on Host(s) <p628-kvm3.cecc.ihost.com
>, Effective RES_REQ <select[type == any] order[r15s:pg] >
;
Mon Apr 6 03:41:49: Starting (Pid 1611);
Mon Apr 6 03:41:55: Done successfully. The CPU time used is 0.1 seconds;
Mon Apr 6 03:41:56: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 3 Mbytes; AVG MEM: 3 Mbytes

Summary of time in seconds spent in various states by Mon Apr 6 03:41:56
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
14 0 6 0 0 0 20

[cecuser@p628-kvm1 ~]$ bhist -l 505

Job <505>, User <cecuser>, Project <default>, Command </opt/ibm/lsfsuite/lsf/10
.1/linux3.10-glibc2.17-ppc64le/etc/dm_stagein_transfer.sh>
, Job Description <4#user#7#cecuser#9#p628-kvm1#12#/tmp/da
ta1/*#32#3e53101b571911387ba9e9f222243bf9#>
Mon Apr 6 03:41:41: Submitted from host <p628-kvm1>, to Queue <dataq>, CWD <$H
OME>, Specified CWD </home/staging/stgin/user/cecuser/p628
-kvm1/tmp/data1/*/3e53101b571911387ba9e9f222243bf9>, Outpu
t File </dev/null>, Specified Clusters <myCluster>;
Mon Apr 6 03:41:41: Dispatched 1 Task(s) on Host(s) <p628-kvm1>, Allocated 1 S
lot(s) on Host(s) <p628-kvm1>, Effective RES_REQ <select[t
ype == local] order[r15s:pg] >;
Mon Apr 6 03:41:41: Starting (Pid 26673);
Mon Apr 6 03:41:41: Running with execution home </home/cecuser>, Execution CWD
</home/staging/stgin/user/cecuser/p628-kvm1/tmp/data1/*/3
e53101b571911387ba9e9f222243bf9>, Execution Pid <26673>;
Mon Apr 6 03:41:44: Done successfully. The CPU time used is 0.3 seconds;
Mon Apr 6 03:41:44: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 10 Mbytes; AVG MEM: 6 Mbytes

Summary of time in seconds spent in various states by Mon Apr 6 03:41:44
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 3 0 0 0 3

여기서 사용된 dataq는 lsfadmin 유저만 사용할 수 있는 특수 queue로서, 일반 사용자는 여기에 job을 submit할 수 없습니다.

[cecuser@p628-kvm1 ~]$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
admin 50 Open:Active - - - - 0 0 0 0
owners 43 Open:Active - - - - 0 0 0 0
priority 43 Open:Active - - - - 0 0 0 0
night 40 Open:Active - - - - 0 0 0 0
short 35 Open:Active - - - - 0 0 0 0
dataq 33 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 0 0 0 0
interactive 30 Open:Active - - - - 0 0 0 0
idle 20 Open:Active - - - - 0 0 0 0

이 dataq에 사용되는 host는 lsb.queues 파일에 HOSTS로 등록된 서버, 즉 STAGE AREA에 read/write를 주관하는 서버 뿐입니다.

[cecuser@p628-kvm1 ~]$ bqueues -l dataq

QUEUE: dataq
-- No description provided.

PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV PJOBS
33 0 Open:Active - - - - 0 0 0 0 0 0 0
Interval for a host to accept two jobs is 0 seconds

SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

USERS: all
HOSTS: p628-kvm1
DATA_TRANSFER: Y

==============================

위에서는 STAGING_AREA를 NFS나 GPFS(Spectrum Scale) 같은 shared filesystem을 썼습니다만, 경우에 따라서는 병렬 파일시스템이 없는 경우도 있습니다. 그런 경우에도 LSF DM을 쓸 수 있는데, 그런 경우에는 STAGING_AREA를 다음과 같이 설정하면 됩니다.

[cecuser@p628-kvm1 ~]$ grep STAGING $EGO_TOP/conf/lsf.datamanager.myCluster
# STAGING_AREA = [host_name:]abs_file_path
STAGING_AREA = p628-kvm1:/opt/ibm/lsfsuite/lsf/work/myCluster/staging

즉, STAGING_AREA로 지정된 directory 앞에 DM master의 hostname을 붙이면 됩니다. 큰 cluster에서는 DM master 외에도 I/O를 담당하는 I/O node들이 있을 수 있는데, 그런 경우에는 STAGING_AREA는 반드시 (execution host들은 아니더라도) DM master와 I/O node들 간에는 NFS나 GPFS와 같은 공유 파일시스템을 이용해서 마운트해야 합니다.

또한, 위와 같이 non-shared STAGING_AREA를 사용하는 경우엔 execution host들, 즉 slave들이 DM master로부터 scp를 이용해서 파일을 copy 해가므로, 해당 user에서 'no-password-access'가 가능하도록 미리 구성을 해놓아야 합니다.

[cecuser@p628-kvm1 ~]$ bdata showconf | grep STAGING
STAGING_AREA = p628-kvm1:/opt/ibm/lsfsuite/lsf/work/myCluster/staging

[cecuser@p628-kvm1 ~]$ cat job1.sh
bstage in -src p628-kvm1:/tmp/test1.txt
cat test1.txt

[cecuser@p628-kvm1 ~]$ bsub -I -m p628-kvm2 -data "p628-kvm1:/tmp/test1.txt" < job1.sh

Job <585> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on p628-kvm2.cecc.ihost.com>>
beef jerky

이렇게 수행이 된 뒤에 p628-kvm2로 가서 확인해보면 current working directory에 test1.txt가 copy되어 있는 것을 보실 수 있습니다.

[cecuser@p628-kvm2 ~]$ ls -ltr
total 4
-rw-r--r--. 1 cecuser cecuser 11 Apr 9 00:44 test1.txt