HW 엔지니어를 위한 Deep Learning: h2o4gpu

레이블이 h2o4gpu인 게시물을 표시합니다. 모든 게시물 표시

2018년 8월 6일 월요일

H2O DriverlessAI에 포함된 h2o4gpu를 python과 R에서 사용하는 방법

H2O DriverlessAI를 ppc64le 아키텍처인 AC922에 설치하는 것은 매우 간단합니다. 다음 manual대로 하시면 되는데, 여기서는 매뉴얼 보시기 귀찮으신 분들을 위해 초간단으로 정리했습니다.

http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/UsingDriverlessAI.pdf

먼저 DriverlessAI의 rpm package를 다음과 같이 download 받습니다.

[root@ING data]# wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-6/ppc64le-centos7/dai-1.2.2-1.ppc64le.rpm

rpm 명령으로 설치합니다.

[root@ING data]# rpm -Uvh dai-1.2.2-1.ppc64le.rpm
Preparing... ################################# [100%]
Updating / installing...
1:dai-1.2.2-1 ################################# [100%]
User configuration file /etc/dai/User.conf already exists.
Group configuration file /etc/dai/Group.conf already exists.
Configured user in /etc/dai/User.conf is 'dai'.
Configured group in /etc/dai/Group.conf is 'dai'.
Group 'dai' already exists.
User 'dai' already exists.
Adding systemd configuration files in /etc/systemd/system...
Created symlink from /etc/systemd/system/dai.service.wants/dai-dai.service to /usr/lib/systemd/system/dai-dai.service.
Created symlink from /etc/systemd/system/dai.service.wants/dai-h2o.service to /usr/lib/systemd/system/dai-h2o.service.
Created symlink from /etc/systemd/system/dai.service.wants/dai-procsy.service to /usr/lib/systemd/system/dai-procsy.service.
Calling 'systemctl enable dai'...
Created symlink from /etc/systemd/system/multi-user.target.wants/dai.service to /usr/lib/systemd/system/dai.service.
Installation complete.

DAI의 구동도 매우 간단합니다. 아래와 같이 dai.service만 start 해주면 main process와 보조 h2o process, proxy process인 dai-procsy까지 모두 자동으로 뜹니다.

[root@ING ~]# systemctl start dai

[root@ING ~]# systemctl status dai-dai
● dai-dai.service - Driverless AI (Main Application Process)
Loaded: loaded (/usr/lib/systemd/system/dai-dai.service; enabled; vendor preset: disabled)

[root@ING ~]# systemctl status dai-h2o
● dai-h2o.service - Driverless AI (H2O Process)
Loaded: loaded (/usr/lib/systemd/system/dai-h2o.service; enabled; vendor preset: disabled)

[root@ING ~]# systemctl status dai-procsy
● dai-procsy.service - Driverless AI (Procsy Process)
Loaded: loaded (/usr/lib/systemd/system/dai-procsy.service; enabled; vendor preset: disabled)

그리고 test1이라는 user를 만들고, 그 user를 DAI의 기본 user/group인 dai라는 group에 포함시킵니다.

[root@ING ~]# usermod -a -G dai test1
[root@ING ~]# cat /etc/group | grep dai
dai:x:980:test1

다음과 같이 t1.py와 t1.R을 준비합니다. 이것들은 python과 R에서 h2o4gpu를 사용할 수 있는지 확인하는 python 및 R script입니다.

[test1@ING ~]$ cat t1.py
import h2o4gpu
import numpy as np
X = np.array([[1.,1.], [1.,4.], [1.,0.]])
model = h2o4gpu.KMeans(n_clusters=2,random_state=1234).fit(X)
model.cluster_centers_

[test1@ING ~]$ cat t1.R
library(reticulate)
library(h2o4gpu)
use_python("/opt/h2oai/dai/python/bin/python")
x <- iris[1:4]
y <- as.integer(iris$Species)
model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)
pred <- model %>% predict(x)
library(Metrics)
ce(actual = y, predicted = pred)

이제 test1에서 DAI에 포함된 h2o4gpu를 사용하는 방법입니다. 한줄 요약하면, PATH 등 환경변수를 설정하여 DAI에서 제공하는 python 및 PYTHONPATH를 사용하기만 하면 됩니다.

[test1@ING ~]$ export PATH=/opt/h2oai/dai/python/bin:$PATH
[test1@ING ~]$ export LD_LIBRARY_PATH=/opt/h2oai/dai/python/lib:/opt/h2oai/dai/lib:$LD_LIBRARY_PATH
[test1@ING ~]$ export PYTHONPATH=/opt/h2oai/dai/cuda-9.2/lib/python3.6/site-packages

[test1@ING ~]$ pip list | grep h2o
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
h2o (3.20.0.2)
h2o4gpu (0.2.0.9999+master.eb6295c)
h2oai (1.2.2)
h2oai-client (1.2.2)
h2oaicore (1.2.2)

DAI에서 제공하는 python은 3.6이며, 일반 anaconda에서 제공되는 것과 동일합니다. 제가 source로부터 build한 tensorflow 1.8도 pip로 정상적으로 설치해서 동일하게 사용할 수 있습니다.

[test1@ING ~]$ pip install /tmp/tensorflow-1.8.0-cp36-cp36m-linux_ppc64le.whl

[test1@ING ~]$ pip list | grep tensorflow
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
tensorflow (1.8.0)

[test1@ING ~]$ which python
/opt/h2oai/dai/python/bin/python

이제 python에서 저 위의 t1.py를 수행하겠습니다. 여기서는 그냥 line by line으로 copy & paste 했습니다.

[test1@ING ~]$ python
Python 3.6.4 (default, Jun 30 2018, 13:42:46)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import h2o4gpu
>>> import numpy as np
>>> X = np.array([[1.,1.], [1.,4.], [1.,0.]])
>>> model = h2o4gpu.KMeans(n_clusters=2,random_state=1234).fit(X)
>>> model.cluster_centers_
array([[1. , 0.5],
[1. , 4. ]])
>>>

다음으로 R에서 저 위의 t1.R을 수행하겠습니다. 여기서는 그냥 line by line으로 copy & paste 했습니다.

[test1@ING ~]$ R

R version 3.4.1 (2017-06-30) -- "Single Candle"

> library(reticulate)
y <- as.integer(iris$Species)
model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)
pred <- model %>% predict(x)
library(Metrics)
ce(actual = y, predicted = pred)> library(h2o4gpu)

Attaching package: ‘h2o4gpu’

The following object is masked from ‘package:base’:

transform

> use_python("/opt/h2oai/dai/python/bin/python")
> x <- iris[1:4]
> y <- as.integer(iris$Species)
> model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)
> pred <- model %>% predict(x)
/opt/h2oai/dai/python/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
> library(Metrics)
> ce(actual = y, predicted = pred)
[1] 0.02666667
>

또한 tensorflow가 제대로 GPU를 물고 올라오는지 시험해보겠습니다. 물론 잘 됩니다.

[test1@ING ~]$ python
Python 3.6.4 (default, Jun 30 2018, 13:42:46)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf
/opt/h2oai/dai/python/lib/python3.6/site-packages/h5py-2.7.1-py3.6-linux-ppc64le.egg/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters

>>> sess=tf.Session()
2018-08-06 10:21:44.054091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-06 10:21:44.568612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-06 10:21:45.033482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-06 10:21:45.461878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-06 10:21:45.462104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-08-06 10:21:47.427824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 10:21:47.427960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1 2 3
2018-08-06 10:21:47.427989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N Y Y Y
2018-08-06 10:21:47.428012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: Y N Y Y
2018-08-06 10:21:47.428034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2: Y Y N Y
2018-08-06 10:21:47.428055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3: Y Y Y N
2018-08-06 10:21:47.431083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14857 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-08-06 10:21:47.987194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14857 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-08-06 10:21:48.813286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14856 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2018-08-06 10:21:49.397252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14861 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
>>>

2018년 7월 26일 목요일

ppc64le 환경에서 pyarrow wheel 파일 build하기

h2o 및 h2o4gpu를 open source로부터 build하려면 xgboost4j_gpu.so가 필요합니다. 그런데 그걸 build하려면 또 pyarrow가 필요하지요. 하지만, ppc64le 아키텍처에서 pyarrow를 설치하려면 다음과 같이 error가 나는 것을 보셨을 것입니다.

$ pip install pyarrow
Collecting pyarrow
Using cached https://files.pythonhosted.org/packages/be/2d/11751c477e4e7f4bb07ac7584aafabe0d0608c170e4bff67246d695ebdbe/pyarrow-0.9.0.tar.gz
...
[ 66%] Building CXX object CMakeFiles/lib.dir/lib.cxx.o
/tmp/pip-install-kil31a/pyarrow/build/temp.linux-ppc64le-2.7/lib.cxx:592:35: fatal error: arrow/python/platform.h: No such file or directory
#include "arrow/python/platform.h"
^
compilation terminated.
make[2]: *** [CMakeFiles/lib.dir/lib.cxx.o] Error 1
make[1]: *** [CMakeFiles/lib.dir/all] Error 2
make: *** [all] Error 2
error: command 'make' failed with exit status 2

이 문제에 대해서 최근에 arrow community 도움을 받아 해결을 했습니다.

https://github.com/apache/arrow/issues/2281

좀더 간단하게는 다음과 같이 정리할 수 있습니다.

먼저, Redhat에서는 다음과 같이 사전 필요 fileset들을 설치합니다.

[dhkim@ING ~]$ sudo yum install jemalloc jemalloc-devel boost boost-devel flex flex-devel bison bison-devel

[dhkim@ING ~]$ mkdir imsi
[dhkim@ING ~]$ cd imsi

이 error를 해결하는 핵심은 먼저 arrow와 parquet-cpp를 source에서 build 하는 것입니다.

[dhkim@ING imsi]$ git clone https://github.com/apache/arrow.git

[dhkim@ING imsi]$ git clone https://github.com/apache/parquet-cpp.git

[dhkim@ING imsi]$ which python
~/anaconda2/bin/python

여기서는 anaconda2를 사용하는데, anaconda3도 동일하게 build할 수 있습니다. 먼저 conda 명령어로 다음과 같은 package들을 설치합니다.

[dhkim@ING imsi]$ conda install numpy six setuptools cython pandas pytest cmake flatbuffers rapidjson boost-cpp thrift snappy zlib gflags brotli lz4-c zstd -c conda-forge

여기서는 user home directory 밑에 dist라는 directory에 arrow와 parquet-cpp를 설치하겠습니다.

[dhkim@ING imsi]$ mkdir dist
[dhkim@ING imsi]$ export ARROW_BUILD_TYPE=release
[dhkim@ING imsi]$ export ARROW_HOME=$(pwd)/dist
[dhkim@ING imsi]$ export PARQUET_HOME=$(pwd)/dist

[dhkim@ING imsi]$ mkdir arrow/cpp/build && cd arrow/cpp/build

[dhkim@ING build]$ cmake3 -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=OFF -DARROW_PARQUET=ON ..

[dhkim@ING build]$ make -j 8

[dhkim@ING build]$ make install
...
-- Installing: /home/dhkim/imsi/dist/include/arrow/python/platform.h
-- Installing: /home/dhkim/imsi/dist/include/arrow/python/pyarrow.h
-- Installing: /home/dhkim/imsi/dist/include/arrow/python/type_traits.h
-- Installing: /home/dhkim/imsi/dist/lib64/pkgconfig/arrow-python.pc

[dhkim@ING build]$ cd ~/imsi/arrow/python

[dhkim@ING python]$ MAKEFLAGS=-j8 ARROW_HOME=/home/dhkim/imsi/dist PARQUET_HOME=/home/dhkim/imsi/dist python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet --inplace

[dhkim@ING python]$ export LD_LIBRARY_PATH=/home/dhkim/imsi/dist/lib64:$LD_LIBRARY_PATH

이제 pyarrow를 build할 준비가 끝났습니다. 다만, arrow 쪽의 사소한 bug로 인해, 다음과 같이 끝에 점(.)이 달린 *.so. 라는 soft link들을 만들어주어야 합니다.

[dhkim@ING python]$ ln -s /home/dhkim/imsi/dist/lib64/libarrow_python.so.11.0.0 /home/dhkim/imsi/dist/lib64/libarrow_python.so.
[dhkim@ING python]$ ln -s /home/dhkim/imsi/dist/lib64/libarrow.so.11.0.0 /home/dhkim/imsi/dist/lib64/libarrow.so.
[dhkim@ING python]$ ln -s /home/dhkim/imsi/dist/lib64/libparquet.so.1.4.1 /home/dhkim/imsi/dist/lib64/libparquet.so.

이제 wheel file을 build 합니다.

[dhkim@ING python]$ python setup.py build_ext --build-type=release --with-parquet --bundle-arrow-cpp bdist_wheel

다음과 같이 dist directory 밑에 만들어집니다.

[dhkim@ING python]$ ls -l dist/pyarrow-0.10.1.dev687+g18a61f6-cp36-cp36m-linux_ppc64le.whl
-rw-rw-r-- 1 dhkim dhkim 7195829 Jul 26 16:03 dist/pyarrow-0.10.1.dev687+g18a61f6-cp36-cp36m-linux_ppc64le.whl

이걸 pip로 설치하고, import까지 잘 되는 것을 확인하실 수 있습니다.

[dhkim@ING python]$ pip install dist/pyarrow-0.10.1.dev687+g18a61f6-cp36-cp36m-linux_ppc64le.whl

[dhkim@ING python]$ pip list | grep pyarrow
pyarrow 0.10.1.dev687+g18a61f6

[dhkim@ING python]$ python

Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 23:32:32)

[GCC 7.2.0] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import pyarrow

>>>

이 과정 안 겪으셔도 되도록, 아래에 pyarrow의 python2.7용 whl과 python3.6용 whl을 google drive에 올려두겠습니다.

python2.7용 wheel

python3.6용 wheel

For some gentlemen who got errors like "ImportError: libarrow.so.10: cannot open shared object file: No such file or directory" from the wheel file I uploaded here...

What we need is just perseverance.

1. First, you need to install the pyarrow*.whl in my blog, and then...

2. Make soft links as needed. My wheel file places awkward names like "libarrow.so." due to a bug of https://github.com/apache/arrow/issues/2281 .

[u0017649@sys-96013 pyarrow]$ ln -s /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow.so. /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow.so.10

[u0017649@sys-96013 ~]$ ln -s /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow_python.so. /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow_python.so.10

[u0017649@sys-96013 ~]$ ln -s /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libplasma.so. /home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/libplasma.so.10

3. Still you might get some more errors. These will be addressed by installing OS packages.

ImportError: libboost_system-mt.so.1.53.0: cannot open shared object file: No such file or directory
ImportError: libboost_filesystem-mt.so.1.53.0: cannot open shared object file: No such file or directory

[u0017649@sys-96013 ~]$ sudo yum install boost-system

[u0017649@sys-96013 ~]$ sudo yum install boost-filesystem

[u0017649@sys-96013 ~]$ sudo yum install boost-regex

4. You might and might not get the following weird error. This can be addressed by upgrading numpy. Pls refer to
https://issues.apache.org/jira/browse/ARROW-3141 .

>>> import pyarrow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/u0017649/anaconda3/lib/python3.6/site-packages/pyarrow/__init__.py", line 50, in <module>
import pyarrow.compat as compat
AttributeError: module 'pyarrow' has no attribute 'compat'

[u0017649@sys-96013 ~]$ pip install numpy --upgrade
Collecting numpy
Downloading https://files.pythonhosted.org/packages/2d/80/1809de155bad674b494248bcfca0e49eb4c5d8bee58f26fe7a0dd45029e2/numpy-1.15.4.zip (4.5MB)
100% |████████████████████████████████| 4.5MB 271kB/s
Building wheels for collected packages: numpy
Running setup.py bdist_wheel for numpy ... done
Stored in directory: /home/u0017649/.cache/pip/wheels/13/6b/70/4b5d7861227307f91716c31698240e08c6ec5486d9ee82a97b
Successfully built numpy
Installing collected packages: numpy
Found existing installation: numpy 1.13.1
Uninstalling numpy-1.13.1:
Successfully uninstalled numpy-1.13.1
Successfully installed numpy-1.15.4

5. And finally, Voila !

[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>>