HW 엔지니어를 위한 Deep Learning: python3

레이블이 python3인 게시물을 표시합니다. 모든 게시물 표시

2019년 5월 16일 목요일

CUDA 9.0 + Python 3.5.3 + Tensorflow 1.12 + Ubuntu 18.04 docker image 최소화하여 만들기

Docker image size를 최소화해서 만들기 위해서는 일단 anaconda를 사용하면 안됩니다. Anaconda를 설치하는데만도 4GB 정도가 들어가니까요. 또한 anaconda의 python 3.5.2는 gcc 4.8로 build된 것에 비해, Ubuntu 18.04에서 나오는 모든 library는 gcc 7.2로 build된 것이라서 CXX_ABI 문제가 발생할 가능성이 큽니다.

이런 이유로, python을 anaconda를 이용하여 설치하지 않고, 그냥 source에서 build했습니다. 원래 Python 3.5.2 환경으로 하려고 했으나, python 3.5.2에는 아래 link에 나오는 pip 관련 error가 있습니다. 이 error는 python 3.5.3에서 fix 되었기 때문에 부득이하게 python 3.5.3을 설치하기로 했습니다.

https://stackoverflow.com/questions/50126814/ignoring-ensurepip-failure-pip-requires-ssl-tls-error-in-ubuntu-18-04

먼저, 지난번에 만들어둔 CUDA 9.0과 Ubuntu 18.04 기반의 docker image를 run 시킵니다.

ibm@uniac922:~/files$ sudo docker run --runtime=nvidia -ti --rm -v ~/files:/mnt bsyu/ubuntu18.04_cuda9-0_ppc64le:v0.3

root@f47afeb949b5:/# cd /mnt

Python 3.5.3의 source를 download 받습니다.

root@75ecf9980173:/mnt# wget https://www.python.org/ftp/python/3.5.3/Python-3.5.3.tgz

root@75ecf9980173:/mnt# ls -l Python-3.5.3.tgz
-rw-rw-r-- 1 1003 1003 20656090 Jan 17 2017 Python-3.5.3.tgz

root@75ecf9980173:/mnt# tar -zxf Python-3.5.3.tgz

root@75ecf9980173:/mnt# cd Python-3.5.3

Python의 build는 간단합니다. configure-make-make install 순입니다.

root@75ecf9980173:/mnt/Python-3.5.3# ./configure

root@75ecf9980173:/mnt/Python-3.5.3# make -j 32

root@75ecf9980173:/mnt/Python-3.5.3# make install

Build된 python은 /usr/local/bin/python3 으로 설치됩니다. 따라서 PATH에 /usr/local/bin을 추가해줘야 합니다. 물론 pip도 /usr/local/bin/pip3로 설치됩니다.

root@75ecf9980173:/mnt/Python-3.5.3# export PATH=/usr/local/bin:$PATH

root@75ecf9980173:/mnt/Python-3.5.3# echo "export PATH=/usr/local/bin:$PATH" >> ~/.bashrc

root@75ecf9980173:/mnt/Python-3.5.3# which python3
/usr/local/bin/python3

root@75ecf9980173:/mnt/Python-3.5.3# which pip3
/usr/local/bin/pip3

root@75ecf9980173:/mnt/Python-3.5.3# cd ..

이제 지난번에 anaconda의 python 3.5.2 환경에서 build했던 tensorflow의 wheel file을 이용해 pip3 명령으로 설치합니다. 이때 numpy나 keras-applications 등과 같은 prerequisite package들도 함께 자동으로 설치됩니다.

root@75ecf9980173:/mnt# pip3 install tensorflow_pkg3/tensorflow-1.12.0-cp35-cp35m-linux_ppc64le.whl

이제 python3를 구동하여 tensorflow를 import하고 GPU를 제대로 물고 오는지 test해봅니다. 아래와 같이 잘 됩니다.

root@6afae4bd06e3:/mnt# python3
Python 3.5.3 (default, May 16 2019, 11:02:36)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

>>> sess=tf.Session()
2019-05-16 11:48:42.110194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.45GiB
2019-05-16 11:48:42.260439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
...
2019-05-16 11:48:44.235074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14943 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)

이제 docker image의 크기를 줄이기 위해 아래와 같이 불필요한 file들과 package들을 삭제합니다.

root@6afae4bd06e3:~# rm -rf /var/lib/apt/lists/* ~/.cache/*

root@6afae4bd06e3:~# apt remove curl fontconfig fontconfig-config fonts-dejavu-core fonts-dejavu-extra git git-man keyboard-configuration less lsb-release make manpages manpages-dev wget cuda-cublas-dev-9-0 cuda-npp-dev-9-0

다른 ssh session을 열어서 parent OS에서 위의 docker container를 image로 commit합니다.

ibm@uniac922:~$ sudo docker commit 6afae4bd06e3 bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:v0.2
sha256:93a146481e72fbc8714ddbb1adfb596c7f20a094bf5788fbd856fc5e2a489b71

Image 크기를 보면 거의 7GB 정도 되는 것을 보실 수 있습니다.

ibm@uniac922:~$ sudo docker images | grep py353
bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le v0.2 93a146481e72 3 seconds ago 6.99GB

이제 이것의 size를 줄이기 위해, 아래와 같이 export 합니다.

ibm@uniac922:~$ sudo docker export 6afae4bd06e3 > ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le_v0.2.tar

여러가지 file들을 지웠기 때문에, export된 tar file의 크기는 4.2GB 정도에 불과합니다.

ibm@uniac922:~$ ls -l ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le_v0.2.tar
-rw-rw-r-- 1 ibm ibm 4204572160 May 16 11:58 ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le_v0.2.tar

이제 이것을 import 합니다. 이때 nvidia driver volume 등을 제대로 가져오기 위해 --change 옵션을 넣어야 하는 것을 잊지 마십시요.

ibm@uniac922:~$ cat ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le_v0.2.tar | sudo docker import --change "ENV NVIDIA_VISIBLE_DEVICES=all" --change "ENV NVIDIA_DRIVER_CAPABILITIES compute,utility" --change "ENV LD_LIBRARY_PATH /usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/lib64:/usr/lib:/usr/lib64:/lib:/lib64:/usr/local/lib:/usr/local/lib64" - bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:v0.3

이렇게 import된 docker image의 크기는 4.2GB 정도입니다. 원래보다 거의 3GB 정도 줄었습니다.

ibm@uniac922:~$ sudo docker images | grep py353
bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le v0.3 be956177f42a 6 seconds ago 4.16GB
bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le v0.2 93a146481e72 10 minutes ago 6.99GB

이 새로운 이미지에서 tensorflow가 잘 동작하는지 확인합니다. 아래와 같이 잘 됩니다.

ibm@uniac922:~$ sudo docker run --runtime=nvidia -ti --rm -v ~/files:/mnt bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:v0.3 bash

root@a8d890e2a25a:/# python3
Python 3.5.3 (default, May 16 2019, 11:02:36)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

>>> sess=tf.Session()
2019-05-16 03:07:31.017607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
...
2019-05-16 03:07:33.042540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14938 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)

이제 이 v0.3 image를 latest라는 tag로 tagging하고, docker hub로 push 합니다.

ibm@uniac922:~$ sudo docker images | grep py353
bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le latest be956177f42a 2 minutes ago 4.16GB
bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le v0.3 be956177f42a 2 minutes ago 4.16GB

ibm@uniac922:~$ sudo docker push bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:v0.3

이것을 사용하시기 위해서는 아래 명령으로 pull 하시면 됩니다.

$ sudo docker pull bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:v0.3

또는

$ sudo docker pull bsyu/ubuntu18.04_cuda9-0_py353_tf1.12_ppc64le:latest

2018년 8월 9일 목요일

AC922 CUDA 9.2 설치 및 PowerAI 5.2 설치

새로 NVIDIA 홈페이지에서 새로 cuda 9.2의 rpm 파일들을 download 받습니다. 2018.8.9 현재 아래 version이 최신입니다.

# wget https://developer.nvidia.com/compute/cuda/9.2/Prod2/local_installers/cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le

# wget https://developer.nvidia.com/compute/cuda/9.2/Prod2/patches/1/cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.ppc64le

이렇게 download 받은 파일을 rpm으로 설치합니다. 이 작업으로 CUDA 관련 local yum repository가 생성됩니다.

# rpm -Uvh cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.ppc64le

이제 local에 생성된 CUDA local yum repository로부터 cuda를 설치합니다. 버전 등은 신경쓰지 마시고 그냥 다음과 같이 하면 최신으로 설치합니다.

# yum install cuda

설치가 끝나면 역시 NVIDIA 홈페이지에서 NCCL과 CUDNN library들을 download 받아서 아래와 같이 설치합니다. 설치라기 보다는 tar ball을 /usr/local 밑에 풀어넣는 작업입니다.

https://developer.nvidia.com/nccl/nccl2-download-survey, https://developer.nvidia.com/rdp/cudnn-download

# tar -xvf nccl_2.2.13-1+cuda9.2_ppc64le.solitairetheme8 -C /usr/local

# tar -xvf cudnn-9.2-linux-ppc64le-v7.2.1.38.solitairetheme8 -C /usr/local

이렇게 설치가 끝나면, 아래 URL에 적힌 내용대로 일부 파일들을 수정해주어야 합니다.

# dracut --force

# vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

# vi /usr/lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

# systemctl enable nvidia-persistenced

# nvidia-smi -pm 1

# vi /lib/udev/rules.d/40-redhat.rules (아래 줄을 #으로 comment-out)
...
#SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"
...

이제 rebooting 합니다.

리부팅이 끝나면 PowerAI 5.2의 rpm을 설치합니다. 이걸 설치하면 mld이라는 local repository가 생깁니다.

[root@3eb2adfbf7f2 test]# rpm -Uvh mldl-repo-local-5.2.0-201806110545.c2f9a0f.ppc64le.rpm
Preparing... ################################# [100%]
Updating / installing...
1:mldl-repo-local-5.2.0-20180611054################################# [100%]

그리고 환경에 따라 다음과 같이 redhat 관련 repos.conf를 수정해줘야 하는 경우가 있습니다.

[root@3eb2adfbf7f2 test]# vi /etc/yum/pluginconf.d/search-disabled-repos.conf
...
#notify_only=1
notify_only=0

먼저 Anaconda2 5.1과 Anaconda3 5.1을 미리 설치해두어야 합니다. (5.2가 최신 Anaconda인데, 이걸 설치해도 문제는 없는 듯 합니다.) Python2를 쓰실 거면 Anaconda2 환경에서 다음과 같이 이제 yum으로 power-mldl을 설치합니다.

# which python
/opt/anaconda2/bin/python

# yum install power-mldl

설치가 끝나면 license를 accept 합니다.

# IBM_POWERAI_LICENSE_ACCEPT=yes /opt/DL/license/bin/accept-powerai-license.sh

caffe나 tensorflow 등 각 framework을 사용하기 위해서는 다음과 같이 먼저 환경변수를 설정해줘야 합니다.

# source /opt/DL/tensorflow/bin/tensorflow-activate
Missing dependencies
Run "/opt/DL/tensorflow/bin/install_dependencies" to resolve this problem.

특히 tensorflow의 경우는 위와 같이 install_dependencies를 수행하여 internet으로부터 추가 python package들을 설치해야 합니다.

[root@3eb2adfbf7f2 test]# /opt/DL/tensorflow/bin/install_dependencies
Fetching package metadata ...............

이제 다시 tensorflow-activate를 수행하시면 tensorflow를 사용하실 준비가 된 것입니다.

기본적으로 PowerAI는 python2를 기본으로 합니다만, python3에서 tensorflow나 pytorch를 쓰시고자 하는 경우도 있습니다.

그럴 경우, 먼저 기본 환경이 Anaconda3 환경임을 확인하신 뒤, 아래와 같이 -py3를 붙여서 설치하시면 됩니다.

# which python
/opt/anaconda3/bin/python

# yum install power-mldl-py3

# rpm -qa | grep tensorflow
tensorflow-py3-1.8.0-31721.7987738.ppc64le
tensorflow-performance-models-5.2.0-383.668a313.ppc64le
tensorflow-1.8.0-31721.7987738.ppc64le

Python2 환경에서와 동일하게 여기서도 license를 accept하고 install_dependencies를 수행하는 것은 동일합니다.

# source /opt/DL/tensorflow/bin/tensorflow-activate
Missing dependencies
Run "/opt/DL/tensorflow/bin/install_dependencies" to resolve this problem.

# /opt/DL/tensorflow/bin/install_dependencies

참고로 아래와 같은 python package들을 새로 설치하거나 upgrade합니다. 이때 일부는 internet에서 긁어와야 하는 것들도 있습니다.

The following NEW packages will be INSTALLED:

absl-py: 0.1.10-py36_0 file://opt/DL/conda-pkgs
astor: 0.6.2-py_0 file://opt/DL/conda-pkgs
blas: 1.0-openblas
blosc: 1.14.3-hdbcaa40_0
bzip2: 1.0.6-h14c3975_5
ca-certificates: 2018.03.07-hf82bc7d_0
gast: 0.2.0-py36_0 file://opt/DL/conda-pkgs
glib: 2.53.6-h000015b_2
gmp: 6.1.2-h7f7056e_2
gmpy2: 2.0.8-py36h10f8cd9_2
grpcio: 1.10.0-py36hf484d3e_0 file://opt/DL/conda-pkgs
icu: 58.2-h64fc554_1
kiwisolver: 1.0.1-py36hf484d3e_0
libedit: 3.1.20170329-h6b74fdf_2
libgcc-ng: 7.2.0-h7cc24e2_2
libgfortran-ng: 7.2.0-h9f7466a_2
libopenblas: 0.2.20-h9ac9557_7
libprotobuf: 3.5.0-hf484d3e_0 file://opt/DL/conda-pkgs
libstdcxx-ng: 7.2.0-h7a57d05_2
libxcb: 1.13-h1bed415_0
lzo: 2.10-h0dabc4d_2
mpc: 1.1.0-h10f8cd9_1
mpfr: 4.0.1-hdf1c602_3
ncurses: 6.1-hf484d3e_0
openblas-devel: 0.2.20-7
powerai-tensorflow-prereqs: 1.8.0_31721.7987738-py36_0 file:///opt/DL/tensorflow/conda-pkgs
protobuf: 3.5.0-py36_0 file://opt/DL/conda-pkgs
readline: 7.0-h1bed415_4
snappy: 1.1.7-h1532aa0_3
termcolor: 1.1.0-py36_0 file://opt/DL/conda-pkgs
toposort: 1.5-py36_0 file://opt/DL/conda-pkgs
typing: 3.6.4-py36_0

The following packages will be UPDATED:

anaconda: 5.0.0-py36h39d2194_0 c3i_test --> custom-py36_0
cairo: 1.14.8-0 --> 1.14.10-h77bcde2_6
conda: 4.3.27-py36_0 --> 4.5.9-py36_0
conda-env: 2.6.0-0 --> 2.6.0-1
expat: 2.1.0-0 --> 2.2.5-hbd03837_0
fontconfig: 2.12.1-3 --> 2.12.6-h49f89f6_0
freetype: 2.5.5-2 --> 2.8-hadd163a_1
h5py: 2.7.0-np113py36_1 --> 2.8.0-py36h8d01980_0
hdf5: 1.8.17-2 --> 1.10.2-hba1933b_1
libpng: 1.6.30-2 --> 1.6.32-h288d48a_4
libtiff: 4.0.6-3 --> 4.0.9-he85c1e1_1
matplotlib: 2.0.2-np113py36_0 --> 2.2.2-py36hbc4b006_0
numpy: 1.13.1-py36_1 --> 1.13.3-py36h7cdd4dd_0
openblas: 0.2.19-0 --> 0.2.20-7
openssl: 1.0.2l-0 --> 1.0.2o-h14c3975_1
pillow: 4.2.1-py36_1 --> 5.0.0-py36h3deb7b8_0
pycosat: 0.6.2-py36_0 --> 0.6.3-py36h14c3975_0
pytables: 3.4.2-np113py36_0 --> 3.4.4-py36ha205bf6_0
python: 3.6.2-0 --> 3.6.5-hc3d631a_2
pyyaml: 3.12-py36_0 --> 3.13-py36h14c3975_0
ruamel_yaml: 0.11.14-py36_1 --> 0.15.46-py36h14c3975_0
scikit-learn: 0.19.0-np113py36_1 --> 0.19.1-py36h6cfcb94_0
scipy: 0.19.1-np113py36_1 --> 1.1.0-py36h9c1e066_0
sqlite: 3.13.0-0 --> 3.24.0-h84994c4_0
tk: 8.5.18-0 --> 8.6.7-hb4a6f0b_3
yaml: 0.1.6-0 --> 0.1.7-h1bed415_2

이제 다시 다음을 수행하시면 python3 환경에서 tensorflow를 사용하실 준비가 된 것입니다.

# source /opt/DL/tensorflow/bin/tensorflow-activate