레이블이 NCCL인 게시물을 표시합니다. 모든 게시물 표시
레이블이 NCCL인 게시물을 표시합니다. 모든 게시물 표시

2018년 3월 13일 화요일

AC922에서 bvlc caffe 빌드하고 cifar10 돌려보기


(Pycaffe를 위해) Anaconda2가 설치된 환경을 가정하겠습니다.  먼저 아래 package들부터 설치합니다.

[user1@ac922 ~]$ sudo yum install git gcc gcc-c++ python-devel python-enum34 numpy cmake automake snappy.ppc64le boost-python.ppc64le libgfortran4.ppc64le gtk+.ppc64le gtk+-devel.ppc64le gtk2.ppc64le gtk3.ppc64le gstreamer.ppc64le gstreamer-tools.ppc64le libdc1394.ppc64le libdc1394-tools.ppc64le

1. HDF5를 설치합니다.

[user1@ac922 ~]$ wget https://support.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.10.1.tar
[user1@ac922 ~]$ tar -xf hdf5-1.10.1.tar
[user1@ac922 ~]$ cd hdf5-1.10.1
[user1@ac922 hdf5-1.10.1]$ ./configure --prefix=/usr/local --enable-fortran --enable-cxx --build=powerpc64le-linux-gnu
[user1@ac922 hdf5-1.10.1]$ make && sudo make install

2. boost를 설치합니다.

[user1@ac922 ~]$ wget https://dl.bintray.com/boostorg/release/1.66.0/source/boost_1_66_0.tar.gz
[user1@ac922 ~]$ tar -zxf boost_1_66_0.tar.gz
[user1@ac922 ~]$ cd boost_1_66_0
[user1@ac922 boost_1_66_0]$ ./bootstrap.sh --prefix=/usr/local
[user1@ac922 boost_1_66_0]$ ./b2
[user1@ac922 boost_1_66_0]$ sudo ./b2 install

3.  GFLAGS를 설치합니다.

[user1@ac922 ~]$ wget https://github.com/schuhschuh/gflags/archive/master.zip
[user1@ac922 ~]$ unzip master.zip && cd gflags-master
[user1@ac922 gflags-master]$ mkdir build && cd build
[user1@ac922 build]$ cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_STATIC_LIBS=ON -DBUILD_gflags_LIB=ON
[user1@ac922 build]$ make && sudo make install

4.  GLOG를 설치합니다.

[user1@ac922 ~]$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/google-glog/glog-0.3.3.tar.gz
[user1@ac922 ~]$ tar zxvf glog-0.3.3.tar.gz
[user1@ac922 ~]$ cd glog-0.3.3
[user1@ac922 glog-0.3.3]$ ./configure --build=powerpc64le-redhat-linux-gnu
[user1@ac922 glog-0.3.3]$ make && sudo make install


5.  LMDB를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/LMDB/lmdb
[[user1@ac922 ~]$ cd lmdb/libraries/liblmdb
[user1@ac922 liblmdb]$ make && sudo make install

6.  LEVELDB를 설치합니다.

[user1@ac922 files]$ wget https://rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ wget https://www.rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-devel-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-devel-1.12.0-11.el7.ppc64le.rpm

7.  OpenBLAS를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/xianyi/OpenBLAS.git
[user1@ac922 ~]$ cd OpenBLAS
[user1@ac922 OpenBLAS]$ git checkout power8
[user1@ac922 OpenBLAS]$ make TARGET=POWER8 LDFLAGS="-fopenmp"
[user1@ac922 OpenBLAS]$ sudo make TARGET=POWER8 LDFLAGS="-fopenmp" install
...
Copying the static library to /opt/OpenBLAS/lib
Copying the shared library to /opt/OpenBLAS/lib
Generating OpenBLASConfig.cmake in /opt/OpenBLAS/lib/cmake/openblas
Generating OpenBLASConfigVersion.cmake in /opt/OpenBLAS/lib/cmake/openblas
Install OK!
make[1]: Leaving directory `/home/user1/OpenBLAS'


8.  OpenCV를 설치합니다.   

[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv.git
[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv_contrib.git
[user1@ac922 ~]$ cd opencv
[user1@ac922 opencv]$ git checkout tags/3.4.1
[user1@ac922 opencv]$ mkdir build && cd build
[user1@ac922 build]$ which protoc
~/anaconda2/bin/protoc
[user1@ac922 build]$ export PROTOBUF_PROTOC_EXECUTABLE="~/anaconda2/bin/protoc"
[user1@ac922 build]$ cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DOPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules  -D WITH_EIGEN=OFF -DBUILD_LIBPROTOBUF_FROM_SOURCES=ON  ..
[user1@ac922 build]$ make && sudo make install
...
-- Installing: /usr/local/bin/opencv_visualisation
-- Set runtime path of "/usr/local/bin/opencv_visualisation" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_interactive-calibration
-- Set runtime path of "/usr/local/bin/opencv_interactive-calibration" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_version
-- Set runtime path of "/usr/local/bin/opencv_version" to "/usr/local/lib64:/usr/local/cuda/lib64"

9.  NCCL을 빌드합니다.

[user1@ac922 ~]$ git clone https://github.com/NVIDIA/nccl
[user1@ac922 ~]$ cd nccl
[user1@ac922 nccl]$ make
[user1@ac922 nccl]$ sudo make install

10.  이제 비로소 caffe를 빌드할 수 있습니다.

[user1@ac922 ~]$ git clone https://github.com/BVLC/caffe.git
[user1@ac922 ~]$ cd caffe
[user1@ac922 caffe]$ cp Makefile.config.example Makefile.config
[user1@ac922 caffe]$ vi Makefile.config
...
# USE_CUDNN := 1
USE_CUDNN := 1
...
# OPENCV_VERSION := 3
OPENCV_VERSION := 3
...
#CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \
                -gencode arch=compute_20,code=sm_21 \
                -gencode arch=compute_30,code=sm_30 \
                -gencode arch=compute_35,code=sm_35 \
                -gencode arch=compute_50,code=sm_50 \
                -gencode arch=compute_52,code=sm_52 \
                -gencode arch=compute_60,code=sm_60 \
                -gencode arch=compute_61,code=sm_61 \
                -gencode arch=compute_61,code=compute_61
CUDA_ARCH := -gencode arch=compute_60,code=sm_60 \
                -gencode arch=compute_61,code=sm_61 \
                -gencode arch=compute_61,code=compute_61
...
# BLAS := atlas
BLAS := open
...
#PYTHON_INCLUDE := /usr/include/python2.7 \
                /usr/lib/python2.7/dist-packages/numpy/core/include
ANACONDA_HOME := $(HOME)/anaconda2
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
                 $(ANACONDA_HOME)/include/python2.7 \
                 $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include
...
# PYTHON_LIB := /usr/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib
...
# WITH_PYTHON_LAYER := 1
WITH_PYTHON_LAYER := 1
...
# USE_NCCL := 1
USE_NCCL := 1

LINKFLAGS := -Wl,-rpath,$(HOME)/anaconda2/lib   # added to prevent "~/anaconda2/lib/libpng16.so.16 undefined reference to `inflateValidate@ZLIB_1.2.9" error


여기서 아래와 같이 soft link를 걸어주어야  cannot find -lsnappy 등의 error를 피할 수 있습니다.

[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libsnappy.so.1 /usr/lib64/libsnappy.so
[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libboost_python.so.1.53.0 /usr/lib64/libboost_python.so

[user1@ac922 caffe]$ make all
...
CXX/LD -o .build_release/examples/cpp_classification/classification.bin
CXX examples/mnist/convert_mnist_data.cpp
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX examples/siamese/convert_mnist_siamese_data.cpp
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin

[user1@ac922 caffe]$ sudo mkdir /opt/caffe
[user1@ac922 caffe]$ sudo cp -r build/* /opt/caffe

11.  이제 cifar10을 수행해봅니다.

[user1@ac922 caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

[user1@ac922 caffe]$ export CAFFE_HOME=/home/user1/caffe/build/tools

[user1@ac922 caffe]$ cd data/cifar10/

[user1@ac922 cifar10]$ ./get_cifar10.sh
Downloading...
--2018-03-13 14:13:14--  http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[==================================================================>] 170,052,171 11.0MB/s   in 15s

2018-03-13 14:13:29 (11.0 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

[user1@ac922 cifar10]$ ls -la
total 180084
drwxrwxr-x. 2 user1 user1      213 Mar 13 14:13 .
drwxrwxr-x. 5 user1 user1       50 Mar 13 10:50 ..
-rw-r--r--. 1 user1 user1       61 Jun  5  2009 batches.meta.txt
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_1.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_2.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_3.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_4.bin
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 data_batch_5.bin
-rwxrwxr-x. 1 user1 user1      506 Mar 13 10:50 get_cifar10.sh
-rw-r--r--. 1 user1 user1       88 Jun  5  2009 readme.html
-rw-r--r--. 1 user1 user1 30730000 Jun  5  2009 test_batch.bin

[user1@ac922 caffe]$ ./examples/cifar10/create_cifar10.sh

[user1@ac922 caffe]$ ls -l examples/cifar10/*_lmdb
examples/cifar10/cifar10_test_lmdb:
total 35656
-rw-rw-r--. 1 user1 user1 36503552 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1     8192 Mar 13 14:14 lock.mdb

examples/cifar10/cifar10_train_lmdb:
total 177992
-rw-rw-r--. 1 user1 user1 182255616 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1      8192 Mar 13 14:14 lock.mdb

[user1@ac922 caffe]$ vi ./examples/cifar10/train_full.sh
#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver.prototxt $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
    --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@
#    --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate $@ 

# reduce learning rate by factor of 10
$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
    --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@
#    --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate $@

# 무슨 이유에선지 cifar10_full_iter_60000.solverstate 대신 cifar10_full_iter_60000.solverstate.h5 이라는 파일이 생기므로 그에 따라 파일 이름 변경

[user1@ac922 caffe]$ time ./examples/cifar10/train_full.sh
I0313 14:15:55.463438 114263 caffe.cpp:204] Using GPUs 0
I0313 14:15:55.529319 114263 caffe.cpp:209] GPU 0: Tesla V100-SXM2-16GB
...
I0313 14:34:16.333791 126407 solver.cpp:239] Iteration 69800 (177.976 iter/s, 1.12375s/200 iters), loss = 0.332006
I0313 14:34:16.333875 126407 solver.cpp:258]     Train net output #0: loss = 0.332006 (* 1 = 0.332006 loss)
I0313 14:34:16.333892 126407 sgd_solver.cpp:112] Iteration 69800, lr = 1e-05
I0313 14:34:17.436130 126413 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.453459 126407 solver.cpp:478] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_70000.caffemodel.h5
I0313 14:34:17.458664 126407 sgd_solver.cpp:290] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_70000.solverstate.h5
I0313 14:34:17.461360 126407 solver.cpp:331] Iteration 70000, loss = 0.294117
I0313 14:34:17.461383 126407 solver.cpp:351] Iteration 70000, Testing net (#0)
I0313 14:34:17.610864 126424 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.612763 126407 solver.cpp:418]     Test net output #0: accuracy = 0.8169
I0313 14:34:17.612794 126407 solver.cpp:418]     Test net output #1: loss = 0.533315 (* 1 = 0.533315 loss)
I0313 14:34:17.612810 126407 solver.cpp:336] Optimization Done.
I0313 14:34:17.612821 126407 caffe.cpp:250] Optimization Done.

real    6m51.615s
user    7m30.483s
sys     1m5.158s