2018년 3월 13일 화요일

AC922에서 bvlc caffe 빌드하고 cifar10 돌려보기


(Pycaffe를 위해) Anaconda2가 설치된 환경을 가정하겠습니다.  먼저 아래 package들부터 설치합니다.

[user1@ac922 ~]$ sudo yum install git gcc gcc-c++ python-devel python-enum34 numpy cmake automake snappy.ppc64le boost-python.ppc64le libgfortran4.ppc64le gtk+.ppc64le gtk+-devel.ppc64le gtk2.ppc64le gtk3.ppc64le gstreamer.ppc64le gstreamer-tools.ppc64le libdc1394.ppc64le libdc1394-tools.ppc64le

1. HDF5를 설치합니다.

[user1@ac922 ~]$ wget https://support.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.10.1.tar
[user1@ac922 ~]$ tar -xf hdf5-1.10.1.tar
[user1@ac922 ~]$ cd hdf5-1.10.1
[user1@ac922 hdf5-1.10.1]$ ./configure --prefix=/usr/local --enable-fortran --enable-cxx --build=powerpc64le-linux-gnu
[user1@ac922 hdf5-1.10.1]$ make && sudo make install

2. boost를 설치합니다.

[user1@ac922 ~]$ wget https://dl.bintray.com/boostorg/release/1.66.0/source/boost_1_66_0.tar.gz
[user1@ac922 ~]$ tar -zxf boost_1_66_0.tar.gz
[user1@ac922 ~]$ cd boost_1_66_0
[user1@ac922 boost_1_66_0]$ ./bootstrap.sh --prefix=/usr/local
[user1@ac922 boost_1_66_0]$ ./b2
[user1@ac922 boost_1_66_0]$ sudo ./b2 install

3.  GFLAGS를 설치합니다.

[user1@ac922 ~]$ wget https://github.com/schuhschuh/gflags/archive/master.zip
[user1@ac922 ~]$ unzip master.zip && cd gflags-master
[user1@ac922 gflags-master]$ mkdir build && cd build
[user1@ac922 build]$ cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_STATIC_LIBS=ON -DBUILD_gflags_LIB=ON
[user1@ac922 build]$ make && sudo make install

4.  GLOG를 설치합니다.

[user1@ac922 ~]$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/google-glog/glog-0.3.3.tar.gz
[user1@ac922 ~]$ tar zxvf glog-0.3.3.tar.gz
[user1@ac922 ~]$ cd glog-0.3.3
[user1@ac922 glog-0.3.3]$ ./configure --build=powerpc64le-redhat-linux-gnu
[user1@ac922 glog-0.3.3]$ make && sudo make install


5.  LMDB를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/LMDB/lmdb
[[user1@ac922 ~]$ cd lmdb/libraries/liblmdb
[user1@ac922 liblmdb]$ make && sudo make install

6.  LEVELDB를 설치합니다.

[user1@ac922 files]$ wget https://rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ wget https://www.rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-devel-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-devel-1.12.0-11.el7.ppc64le.rpm

7.  OpenBLAS를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/xianyi/OpenBLAS.git
[user1@ac922 ~]$ cd OpenBLAS
[user1@ac922 OpenBLAS]$ git checkout power8
[user1@ac922 OpenBLAS]$ make TARGET=POWER8 LDFLAGS="-fopenmp"
[user1@ac922 OpenBLAS]$ sudo make TARGET=POWER8 LDFLAGS="-fopenmp" install
...
Copying the static library to /opt/OpenBLAS/lib
Copying the shared library to /opt/OpenBLAS/lib
Generating OpenBLASConfig.cmake in /opt/OpenBLAS/lib/cmake/openblas
Generating OpenBLASConfigVersion.cmake in /opt/OpenBLAS/lib/cmake/openblas
Install OK!
make[1]: Leaving directory `/home/user1/OpenBLAS'


8.  OpenCV를 설치합니다.   

[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv.git
[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv_contrib.git
[user1@ac922 ~]$ cd opencv
[user1@ac922 opencv]$ git checkout tags/3.4.1
[user1@ac922 opencv]$ mkdir build && cd build
[user1@ac922 build]$ which protoc
~/anaconda2/bin/protoc
[user1@ac922 build]$ export PROTOBUF_PROTOC_EXECUTABLE="~/anaconda2/bin/protoc"
[user1@ac922 build]$ cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DOPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules  -D WITH_EIGEN=OFF -DBUILD_LIBPROTOBUF_FROM_SOURCES=ON  ..
[user1@ac922 build]$ make && sudo make install
...
-- Installing: /usr/local/bin/opencv_visualisation
-- Set runtime path of "/usr/local/bin/opencv_visualisation" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_interactive-calibration
-- Set runtime path of "/usr/local/bin/opencv_interactive-calibration" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_version
-- Set runtime path of "/usr/local/bin/opencv_version" to "/usr/local/lib64:/usr/local/cuda/lib64"

9.  NCCL을 빌드합니다.

[user1@ac922 ~]$ git clone https://github.com/NVIDIA/nccl
[user1@ac922 ~]$ cd nccl
[user1@ac922 nccl]$ make
[user1@ac922 nccl]$ sudo make install

10.  이제 비로소 caffe를 빌드할 수 있습니다.

[user1@ac922 ~]$ git clone https://github.com/BVLC/caffe.git
[user1@ac922 ~]$ cd caffe
[user1@ac922 caffe]$ cp Makefile.config.example Makefile.config
[user1@ac922 caffe]$ vi Makefile.config
...
# USE_CUDNN := 1
USE_CUDNN := 1
...
# OPENCV_VERSION := 3
OPENCV_VERSION := 3
...
#CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \
                -gencode arch=compute_20,code=sm_21 \
                -gencode arch=compute_30,code=sm_30 \
                -gencode arch=compute_35,code=sm_35 \
                -gencode arch=compute_50,code=sm_50 \
                -gencode arch=compute_52,code=sm_52 \
                -gencode arch=compute_60,code=sm_60 \
                -gencode arch=compute_61,code=sm_61 \
                -gencode arch=compute_61,code=compute_61
CUDA_ARCH := -gencode arch=compute_60,code=sm_60 \
                -gencode arch=compute_61,code=sm_61 \
                -gencode arch=compute_61,code=compute_61
...
# BLAS := atlas
BLAS := open
...
#PYTHON_INCLUDE := /usr/include/python2.7 \
                /usr/lib/python2.7/dist-packages/numpy/core/include
ANACONDA_HOME := $(HOME)/anaconda2
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
                 $(ANACONDA_HOME)/include/python2.7 \
                 $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include
...
# PYTHON_LIB := /usr/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib
...
# WITH_PYTHON_LAYER := 1
WITH_PYTHON_LAYER := 1
...
# USE_NCCL := 1
USE_NCCL := 1

LINKFLAGS := -Wl,-rpath,$(HOME)/anaconda2/lib   # added to prevent "~/anaconda2/lib/libpng16.so.16 undefined reference to `inflateValidate@ZLIB_1.2.9" error


여기서 아래와 같이 soft link를 걸어주어야  cannot find -lsnappy 등의 error를 피할 수 있습니다.

[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libsnappy.so.1 /usr/lib64/libsnappy.so
[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libboost_python.so.1.53.0 /usr/lib64/libboost_python.so

[user1@ac922 caffe]$ make all
...
CXX/LD -o .build_release/examples/cpp_classification/classification.bin
CXX examples/mnist/convert_mnist_data.cpp
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX examples/siamese/convert_mnist_siamese_data.cpp
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin

[user1@ac922 caffe]$ sudo mkdir /opt/caffe
[user1@ac922 caffe]$ sudo cp -r build/* /opt/caffe

11.  이제 cifar10을 수행해봅니다.

[user1@ac922 caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

[user1@ac922 caffe]$ export CAFFE_HOME=/home/user1/caffe/build/tools

[user1@ac922 caffe]$ cd data/cifar10/

[user1@ac922 cifar10]$ ./get_cifar10.sh
Downloading...
--2018-03-13 14:13:14--  http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[==================================================================>] 170,052,171 11.0MB/s   in 15s

2018-03-13 14:13:29 (11.0 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

[user1@ac922 cifar10]$ ls -la
total 180084
drwxrwxr-x. 2 user1 user1      213 Mar 13 14:13 .
drwxrwxr-x. 5 user1 user1       50 Mar 13 10:50 ..
-rw-r--r--. 1 user1 user1       61 Jun  5  2009 batches.meta.txt
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_1.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_2.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_3.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_4.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_5.bin
-rwxrwxr-x. 1 user1 user1      506 Mar 13 10:50 get_cifar10.sh
-rw-r--r--. 1 user1 user1       88 Jun  5  2009 readme.html
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 test_batch.bin

[user1@ac922 caffe]$ ./examples/cifar10/create_cifar10.sh

[user1@ac922 caffe]$ ls -l examples/cifar10/*_lmdb
examples/cifar10/cifar10_test_lmdb:
total 35656
-rw-rw-r--. 1 user1 user1 36503552 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1     8192 Mar 13 14:14 lock.mdb

examples/cifar10/cifar10_train_lmdb:
total 177992
-rw-rw-r--. 1 user1 user1 182255616 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1      8192 Mar 13 14:14 lock.mdb

[user1@ac922 caffe]$ vi ./examples/cifar10/train_full.sh
#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver.prototxt $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
    --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@
#    --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate $@ 

# reduce learning rate by factor of 10
$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
    --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@
#    --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate $@

# 무슨 이유에선지 cifar10_full_iter_60000.solverstate 대신 cifar10_full_iter_60000.solverstate.h5 이라는 파일이 생기므로 그에 따라 파일 이름 변경

[user1@ac922 caffe]$ time ./examples/cifar10/train_full.sh
I0313 14:15:55.463438 114263 caffe.cpp:204] Using GPUs 0
I0313 14:15:55.529319 114263 caffe.cpp:209] GPU 0: Tesla V100-SXM2-16GB
...
I0313 14:34:16.333791 126407 solver.cpp:239] Iteration 69800 (177.976 iter/s, 1.12375s/200 iters), loss = 0.332006
I0313 14:34:16.333875 126407 solver.cpp:258]     Train net output #0: loss = 0.332006 (* 1 = 0.332006 loss)
I0313 14:34:16.333892 126407 sgd_solver.cpp:112] Iteration 69800, lr = 1e-05
I0313 14:34:17.436130 126413 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.453459 126407 solver.cpp:478] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_70000.caffemodel.h5
I0313 14:34:17.458664 126407 sgd_solver.cpp:290] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_70000.solverstate.h5
I0313 14:34:17.461360 126407 solver.cpp:331] Iteration 70000, loss = 0.294117
I0313 14:34:17.461383 126407 solver.cpp:351] Iteration 70000, Testing net (#0)
I0313 14:34:17.610864 126424 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.612763 126407 solver.cpp:418]     Test net output #0: accuracy = 0.8169
I0313 14:34:17.612794 126407 solver.cpp:418]     Test net output #1: loss = 0.533315 (* 1 = 0.533315 loss)
I0313 14:34:17.612810 126407 solver.cpp:336] Optimization Done.
I0313 14:34:17.612821 126407 caffe.cpp:250] Optimization Done.

real    6m51.615s
user    7m30.483s
sys     1m5.158s

댓글 없음:

댓글 쓰기