HW 엔지니어를 위한 Deep Learning

2018년 3월 23일 금요일

V100에 추가된 Tensor Core라는 것은 무엇인가 ?

Tesla V100 GPU에 새로 추가된 Tensor Core라는 것은 기존의 Tesla P100 GPU에는 없던 것입니다. 이 Tensor Core라는 것은 무엇이고 어떤 일에 사용할 수 있으며, 그로 인한 성능 향상은 어느 정도일까요 ?

1. Tensor Core라는 것은 무엇인가 ?

NVIDIA의 사전적 설명에 따르면, Tensor Core는 matrix-multiply-and-accumulate (행렬 곱셈+덧셈) 장치로서 Tensor Core 1개당 하나의 사이클에서 64개의 floating point FMA mixed-precision operation을 수행할 수 있습니다. FMA란 fused multiply–add로서, 곱셈과 덧셈을 하나의 operation에서 한꺼번에 수행하는 것입니다. V100 GPU에는 SM당 8개씩의 Tensor Core가 장착되어 있으므로, 하나의 SM에서는 64 x 8 x 2(곱셈 + 덧셈) = 1024번의 floating point 연산을 하나의 사이클에서 수행가능합니다. V100에는 80개의 SM이 장착되어 있으므로 한 사이클에 80 * 1024 = 8196번의 floating point 연산이 가능합니다.

Tensor Core의 특징은 저 위에서 말한 대로, deep learning에서 많이 사용되는 D = A * B + C와 같은 행렬의 곱셈과 덧셈을 한번에 수행해버린다는 점입니다. 그 외에도 매우 중요한 점이 있습니다. 바로 mixed precision 연산이 매우 쉽게, 그것도 한 사이클에서 가능하다는 것입니다. 즉, D = A * B + C의 연산에서, A와 B는 FP16(floating point 16-bit, half-precision)으로, C와 D는 FP32(floating point 32-bit, single-precision)으로 한번에 처리하는 것이 가능합니다.

2. Tensor Core는 어떤 일에 사용되나 ?

한줄 요약하면 deep learning에서 빠르면서도 높은 정확도(accuracy)의 모델을 만들 때 매우 유용하게 사용됩니다.

Deep learning training phase에서 하는 일은 수십~수백만장의 이미지에서 공통되는 feature를 추출하는 것이므로, input data에 굳이 FP32를 사용할 필요가 없고 FP16 정도면 충분한 경우가 많습니다. 그럴 경우, P100 Pascal GPU부터는 FP32 장치 하나로 FP16 연산 2개를 수행할 수 있으므로 성능도 2배로 높일 수 있고, 특히 GPU 메모리를 절약할 수 있습니다. 그러나 deep learning의 목표는 빠른 training이 아니라 더 높은 정확도(accuracy)입니다. 더 높은 accuracy를 위해서는, 그렇게 FP16의 행렬 data를 곱한 결과를 합산할 때 FP32로 해야 합니다. 즉, deep learning에서는 이런 mixed precision의 곱셈과 덧셈이 매우 자주 사용됩니다. 이런 경우에 Tensor Core를 사용하는 것이 특히 좋습니다.

3. Tensor Core는 어떻게 사용하면 되는가 ?

물론 기존의 application을 그대로 V100에서 수행한다고 해서 Tensor Core를 자동으로 쓰는 것은 아닙니다. SW가 Tensor Core를 사용하도록 개발되어야 하지요. 2018년 3월인 지금 현재로서는 CUDA library 중 cuBLAS와 cuDNN이 Tensor Core를 지원합니다. cuBLAS와 cuDNN은 모두 caffe와 tensorflow 등 주요 deep learning framework에서 모두 사용하는 library입니다. cuBLAS는 GEMM(행렬끼리의 곱셈) 연산을 수행할 때, 그리고 cuDNN은 CNN 및 RNN을 가속하는데 Tensor Core를 사용합니다.

그렇다고 cuBLAS 및 cuDNN을 쓰는 모든 application이 Tensor Core에 의한 성능 가속 효과를 보는 것은 아닙니다. 가령 다음과 같이 cuBLAS를 사용할 때 Tensor Core를 쓰도록 application을 수정해야 합니다.

// Set the math mode to allow cuBLAS to use Tensor Cores:
cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
...
// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8,
// and m is a multiple of 4:
cublasStat = cublasGemmEx(handle, transa, transb, m, n, k, alpha,
A, CUDA_R_16F, lda,
B, CUDA_R_16F, ldb,
beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);

cuBLAS에서 GEMM 연산을 한다고 해도, 위에서처럼 math mode가 CUBLAS_TENSOR_OP_MATH로 맞춰져 있어야 하고, k, lda, ldb, ldc 등이 모두 8의 배수여야 합니다. 이런 조건에 맞지 않으면 GEMM 연산은 Tensor Core를 이용하지 않고 기존 CUDA core를 이용합니다.

현재는 tensorflow, pytorch, caffe2 등의 framework이 Tensor Core의 혜택을 보도록 적용이 되어 있습니다.

4. Tensor Core로 인한 성능 향상은 어느 정도인가 ?

GPU에는 single-precision(FP32) 연산장치와 double-precision(FP16) 연산장치가 각각 따로 있으며, V100에는 그 비율이 2:1로 되어 있습니다. 그래서 V100의 FP32 성능은 15.7 TFLOPS인데, FP64 성능은 7.8 TFLOPS라고 하는 것입니다. 여태까지는 보통 deep learning에서는 (더 높은 정확도를 위해서) FP32를 사용하고, 과학기술 연산이나 금융 연산에서는 FP64를 사용합니다. 그런데 NVIDIA의 자료에 V100의 "Deep Learning" 성능은 125 TFLOPS라고 되어 있습니다. 이건 실제 deep learning 성능이라기보다는, tensor operation 성능을 말하는 것입니다. 위에서 V100은 한 사이클에 8196번의 floating point 연산이 가능하다고 했고 V100의 GPU boost clock speed가 1.53GHz이므로 8196 * 1530 million = 125 TFLOPS가 나오는 것입니다.

NVIDIA에서 내놓은 자료에 따르면 단순 GEMM 연산의 경우 V100에서 Tensor Core를 이용하면 기존 P100 대비 4~9배의 성능을 냅니다.

역시 NVIDIA에서 내놓은 자료에 따르면 cuDNN convolution 속도가 V100에서 Tensor Core를 이용하면 기존 P100 대비 4~5배 빨라집니다.

그러나 실제로 tensorflow를 이용하여 training을 해보면 언제나 그렇게 빨라지는 것은 아닙니다. 일단 deep learning 전체 과정이 모두 GEMM으로 이루어진 것도 아니요 cuDNN convolution으로만 된 것도 아니기 때문입니다. 가령 아래의 tensorflow github에 올라온 issue를 보십시요.

Tensor Core support for NVIDIA Volta architecture #15897

이 URL의 내용을 요약하면 "Volta GPU에서 FP16으로 tensorflow를 수행할 때, NVIDIA의 광고에 따르면 tensor core 덕분에 4배~8배의 성능이 나와야 하지만 실제로는 2배의 성능만 나오더라, 어떻게 된 것인가 ?"라는 이슈 제기에 대해, tensorflow 개발측이 "tensorflow는 어떤 내부 알고리즘을 사용하는 것이 가장 좋은 성능을 낼지 찾아내는 자동 튜닝 장치를 갖추고 있고, 여기에 이미 tensor core를 사용하도록 적용이 되어 있다, 그러나 자동 튜닝 결과가 언제나 항상 tensor core를 사용하도록 나오는 것은 아니다"라고 답한 것입니다. 특히 이 질문 속 트레이닝의 경우 NCHW에서 NHWC로의 이미지 포맷 전환에서 많은 시간을 소비하므로 특히 그런 것으로 나옵니다.

* 용어 해설
NCHW : Num_samples x Channels x Height x Width
NHWC : Num_samples x Height x Width x Channels

윗 글의 대부분은 아래 URL에 나온 내용을 알기 쉽게 풀어 놓은 것이며 성능 자료들의 출처도 아래 URL입니다.

https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

2018년 3월 20일 화요일

구글 기계 번역을 위한 RNN에 IBM POWER9이 사용되고 있을까 ?

어제밤, Forbes지에 구글(Google)이 POWER9 프로세서를 자사 데이터센터에 활용하고 있음을 공식적으로 발표했다는 기사가 실렸습니다.

원래 자사 데이터센터에 대해서는 거의 모든 것을 비밀에 부치는 구글이 뒤늦게나마 특정 프로세서를 사용 중임을 밝힌 것은 이례적인 일입니다. 라스베가스에서 열린 OpenPOWER Summit 2018 행사에서 나온 이 발표는 굉장히 놀라운 소식은 아니고, 이미 작년부터 예고되어 있었던 것입니다. 이미 작년에 구글은 클라우드 업체인 Rackspace와 함께 POWER9 프로세서를 장착한 자체 서버 아키텍처를 OCP(Open Compute Project) 기반으로 설계 중임을 밝혔고, 그 코드네임이 자이우스(Zaius)라는 것도 공개했었습니다.

구글 부사장인 메어 머호니(Maire Mahoney)는 올해 행사에서 Zaius가 구글 데이터센터에 프로덕션용으로 배치되어 있으며, 그 숫자를 계속 늘려가고 있다고 밝혔습니다. 또한 이 플랫폼이 "Google Strong"하다라고 평가했는데, 이 표현은 안정적이고 견고하다는 뜻이라고 설명했습니다.

구글이 POWER9의 어떤 점을 마음에 들어하는지에 대해서 머호니는 3가지 사항을 이야기했습니다.

1) 핵심 구글 검색 업무를 위한 더 많은 코어와 thread
: 이는 Intel x86의 hyperthreading에서는 HW thread가 2개 밖에 나오지 않는 것에 비해 POWER9의 SMT는 4개 혹은 8개의 HW thread를 제공하는 것을 의미합니다. 이는 다수의 사용자에게 서비스를 제공해야 하는 구글 검색 업무에 있어 특히 도움이 됩니다.

2) RNN 기계 학습을 위한 더 큰 메모리 대역폭
: 가령 Intel E5-2698 v4(Broadwell)의 경우 소켓당 76.8GB/s의 메모리 대역폭을 가지지만 (여기를 클릭), 동급 2-socket용 POWER9 프로세서는 소켓당 120GB/s의 메모리 대역폭을 자랑(여기를 클릭)합니다. 이 부분이 특히 기계 번역 등에 사용되는 RNN 기계 학습에서 매우 유효하다고 구글은 판단하는 것입니다.

3) OpenCAPI 가속 bus에 자리잡은 더 빠르고 더 개방적인 flash NAND
: 'OpenCAPI 가속 bus'라고 하는 것은 POWER9에 on-chip 형태로 내장된 25G link들을 이야기하는 것입니다. 원래 IBM 내에서의 코드명이 Bluelink였던 이 HW 기술은 PCIe를 대체하여 GPU, FPGA, ASIC 및 storage-class memory 등을 low latency high bandwidth로 연결하기 위한 하드웨어 버스 기술이라고 할 수 있습니다. POWER9 프로세서에는 이런 25G link들이 최대 48-lane까지 내장되는데, 이들은 PCIe Gen4 lane 외에 추가로 장착되는 것이며, 특히 GPU나 FPGA와 같은 가속기를 위한 최고의 프로세서를 자처하는 POWER9에서 가장 자랑하는 기술이기도 합니다. 이 기술은 AMD, NVIDIA, HPE 및 Dell-EMC 등 주요 벤더들이 모두 참여하는 OpenCAPI.org를 통해 모든 CPU 아키텍처에서 표준적으로 사용하는 개방형 기술로 발전되고 있습니다.

1번은 원래부터 전통적인 POWER 아키텍처의 장점이었으니 별로 특별한 점은 없습니다.

2번의 경우 구글이 기계 번역(machine translation) 서비스를 제공한다는 점에서 특히 흥미롭습니다. 구글은 2016년에 RNN을 이용한 기계 번역 연구에 착수했다고 밝힌 바 (여기를 클릭) 있습니다. RNN은 이미지 인식에 많이 쓰이는 CNN과는 달리 CPU-memory 간의 대역폭이 많이 사용되는데 (여기를 클릭), Intel x86 대비 CPU-memory 대역폭이 월등하다는 점에서 구글이 POWER9에 주목한 것으로 보입니다.

3번은 더욱 흥미롭습니다. 구글이 OpenCAPI를 이용한 flash NAND 사용에 관심이 있거나 또는 이미 사용하고 있다는 것을 밝힌 것이기 때문입니다. OpenCAPI 컨소시엄에 참가한 Mellanox, Nallatech, Xilinx 등의 여러 업체들이 이번 OpenPOWER Summit 2018 행사에서 다양한 FPGA 등의 관련 제품을 내놓기도 했습니다만, 이렇게 구글이 실질적인 use case를 가지고 있다는 것은 의미하는 바가 큽니다.

OpenCAPI 주요 회원사들 중 일부 (출처 http://opencapi.org/membership/current-members)

OpenCAPI 주요 속성 (출처 http://opencapi.org/wp-content/uploads/2016/09/OpenCAPI-Exhibit-SC17.pdf)

자세한 소식은 아래의 포브스지 본문을 읽어보시기 바랍니다.

https://www.forbes.com/sites/patrickmoorhead/2018/03/19/headed-into-its-fifth-year-openpower-has-momentum-into-the-power9-generation

2018년 3월 15일 목요일

ddl-tensorflow에 포함된 DDL을 이용한 example python code : ddl-mnist.py

전의 posting에서는 DDL (distributed deep learning) option을 이용한 caffe-ibm 사용법에 대해 적었습니다. 이번에는 ddl-tensorflow에 대한 내용입니다.

Caffe와는 달리 tensorflow는 python으로 app code를 짜야 하는데, DDL을 이용한 python code 작성을 위한 example code도 일부 제공됩니다. 가장 간단한 MNIST training을 위한 python code가 ddl-tensorflow에 포함되어 있습니다.

먼저, PowerAI toolkit을 설치합니다. (여기서는 최신 v5가 아니라 기존 v4를 썼습니다.)

u0017649@sys-92312:~$ dpkg -l | grep mldl
ii mldl-repo-local 4.0.0 ppc64el IBM repository for Deep Learning tools for POWER linux

PowerAI에서 deb 형태로 제공되는 tensorflow와 ddl-tensorflow를 확인하고, apt-get 명령으로 설치합니다.

u0017649@sys-92312:~$ apt-cache pkgnames | grep tensor
ddl-tensorflow
tensorflow

u0017649@sys-92312:~$ sudo apt-get install tensorflow ddl-tensorflow

관련 example code는 아래 directory에 있습니다. mnist와 slim 관련 2가지가 있습니다.

u0017649@sys-92312:~$ cd /opt/DL/ddl-tensorflow/examples

u0017649@sys-92312:/opt/DL/ddl-tensorflow/examples$ ls -ltr
total 8
drwxr-xr-x 7 root root 4096 Mar 15 08:18 slim
drwxr-xr-x 2 root root 4096 Mar 15 08:18 mnist

u0017649@sys-92312:/opt/DL/ddl-tensorflow/examples$ cd mnist

u0017649@sys-92312:/opt/DL/ddl-tensorflow/examples/mnist$ ls -ltr
total 16
-rw-r--r-- 1 root root 240 Aug 2 2017 README.md
-rw-r--r-- 1 root root 8681 Aug 2 2017 ddl_mnist.py

MNIST를 위한 README.md를 읽어보면, 그냥 이 code를 mpirun을 이용하여 어떻게 돌리느냐에 대한 사용방법 안내입니다. 여기서는 single-node에 GPU 2장이 설치된 경우이므로 -rf 옵션을 통해 별도의 rank file(rf)을 지정할 필요는 없습니다. OpenMPI 특성상, 모든 GPU는 독립적인 learner로 처리되므로, 한대의 서버에 장착된 GPU 2장이나, 두대의 서버에 각각 1장씩 장착된 GPU 총 2장이나 topology가 다를 뿐 동일한 방식으로 처리됩니다.

u0017649@sys-92312:/opt/DL/ddl-tensorflow/examples/mnist$ vi README.md
# HOW TO RUN

To run the IBM PowerAI Distributed Deep Learning MNIST training example:

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

$ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -n 2 python ddl_mnist.py

아래는 ddl_mnist.py 전체 내용입니다.

u0017649@sys-92312:/opt/DL/ddl-tensorflow/examples/mnist$ vi ddl_mnist.py

'''

Based on https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py:

A Convolutional Network implementation example using TensorFlow library.
This example is using the MNIST database of handwritten digits
(http://yann.lecun.com/exdb/mnist/)

Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/

Modifications:

*****************************************************************

Licensed Materials - Property of IBM

(C) Copyright IBM Corp. 2017. All Rights Reserved.

US Government Users Restricted Rights - Use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

*****************************************************************
'''
import tensorflow as tf
import numpy as np

############################################################################
# IBM PowerAI Distributed Deep Learning (DDL) setup
############################################################################

# Disable GPU memory preallocation
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

############################################################################
# DDL Initialize BEGIN
############################################################################
# Load DDL operator
ddl = tf.load_op_library('/opt/DL/ddl-tensorflow/lib/ddl_MDR.so')

# DDL initializes MPI on CPU
# ddl.init takes two inputs
# 1) the number of GPUs to utilize on each host in training.
# this number is not the number of GPUs to use for each leaner. It simply tells DDL that there are X GPUs in each host to be used for training
# 2) DDL options (refer to README for details)
with tf.Session(config=config) as sess:
with tf.device('/cpu:0'):
rank, size, gpuid = sess.run(ddl.init(2, mode = '-mode r:2 -dump_iter 100'))

# MPI info and assigned GPU
print [rank, size, gpuid]
############################################################################
# DDL Initialize END
############################################################################

# Perform all TensorFlow computation within gpuid
with tf.device('/gpu:%d' %gpuid):
##############################################################################
# Import MNIST data

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

# Parameters
learning_rate = 0.001
training_iters = 200000
batch_size = 100
display_step = 1

# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# tf Graph input
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32) #dropout (keep probability)

# Create some wrappers for simplicity
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)

def maxpool2d(x, k=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],
padding='SAME')

# Create model
def conv_net(x, weights, biases, dropout):
# Reshape input picture
x = tf.reshape(x, shape=[-1, 28, 28, 1])

# Convolution Layer
conv1 = conv2d(x, weights['wc1'], biases['bc1'])
# Max Pooling (down-sampling)
conv1 = maxpool2d(conv1, k=2)

# Convolution Layer
conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
# Max Pooling (down-sampling)
conv2 = maxpool2d(conv2, k=2)

# Fully connected layer
# Reshape conv2 output to fit fully connected layer input
fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
fc1 = tf.nn.relu(fc1)
# Apply Dropout
fc1 = tf.nn.dropout(fc1, dropout)

# Output, class prediction
out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
return out

# Store layers weight & bias
weights = {
############################################################################
# DDL BROADCAST BEGIN
############################################################################
# This step ensures that all learners start with the same initial parameters

# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 1, 32]))),
# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(ddl.bcast(tf.random_normal([5, 5, 32, 64]))),
# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(ddl.bcast(tf.random_normal([7*7*64, 1024]))),
# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(ddl.bcast(tf.random_normal([1024, n_classes])))
############################################################################
# DDL BROADCAST END
############################################################################
}

biases = {
'bc1': tf.Variable(ddl.bcast(tf.random_normal([32]))),
'bc2': tf.Variable(ddl.bcast(tf.random_normal([64]))),
'bd1': tf.Variable(ddl.bcast(tf.random_normal([1024]))),
'out': tf.Variable(ddl.bcast(tf.random_normal([n_classes])))
}

# Construct model
pred = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

############################################################################
# DDL ALLREDUCE BEGIN
############################################################################

# Collect the gradients and the corresponding parameters w.r.t the given cost
grads_and_vars = optimizer.compute_gradients(cost)

# Separate out the tuple
grads, vars = zip(*grads_and_vars)

# This step takes the average of the gradients on all the learners
grads_and_vars_ddl = zip(ddl.all_reduce_n(grads, op='avg'), vars)

# Update the parameters with the averaged gradient
objective = optimizer.apply_gradients(grads_and_vars_ddl)

############################################################################
# DDL ALLREDUCE END
############################################################################

# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
##############################################################################

def split(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in xrange(n))

# Launch the graph
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
step = 1
# Keep training until reach max iterations
while step * batch_size < training_iters:

# Each learner will read batch_size*size samples and
# use only the portion correspoding to the current learner (or rank)

batch_x, batch_y = mnist.train.next_batch(batch_size*size)

batch_x = np.split(batch_x,size)[rank]
batch_y = np.split(batch_y,size)[rank]

# Run optimization op (backprop)
sess.run(objective, feed_dict={x: batch_x, y: batch_y,
keep_prob: dropout})
if step % display_step == 0:
# Calculate batch loss and accuracy
loss, acc = sess.run([cost, accuracy], feed_dict={x: batch_x,
y: batch_y,
keep_prob: 1.})
print("MPI "+str(rank)+"] Iter " + str(step*batch_size) + ", Minibatch Loss= " + \
"{:.6f}".format(loss) + ", Training Accuracy= " + \
"{:.5f}".format(acc))
step += 1

print("MPI "+str(rank)+"] Optimization Finished!")

# Calculate accuracy for 256 mnist test images
print("MPI "+str(rank)+"] Testing Accuracy:", \
sess.run(accuracy, feed_dict={x: mnist.test.images[:256],
y: mnist.test.labels[:256],
keep_prob: 1.}))

위에서 언급된 또다른 example인 slim을 포함한 example code들 directory의 tar 파일을 아래 link에 올려두었습니다.

https://drive.google.com/open?id=1u4HT0kJa5YpURy_hyBQMQfik63jRbMIa

위 링크에 올린 파일 내용은 아래의 ddl-examples.tgz이며, 그 속에 들어있는 파일은 아래와 같습니다.

u0017649@sys-92312:/opt/DL/ddl-tensorflow$ sudo tar -zcvf ddl-examples.tgz doc examples

doc/

doc/README-API.md

doc/README.md

doc/LICENSE.pdf

doc/images/

doc/images/clones2.png

doc/images/cifar10_overview.png

examples/

examples/slim/

examples/slim/WORKSPACE

examples/slim/__init__.py

examples/slim/nets/

examples/slim/nets/__init__.py

examples/slim/nets/resnet_v1_test.py

examples/slim/nets/nets_factory_test.py

examples/slim/nets/alexnet.py

examples/slim/nets/inception_utils.py

examples/slim/nets/vgg.py

examples/slim/nets/mobilenet_v1.png

examples/slim/nets/vgg_test.py

examples/slim/nets/inception_v4_test.py

examples/slim/nets/resnet_utils.py

examples/slim/nets/inception_v2.py

examples/slim/nets/nets_factory.py

examples/slim/nets/mobilenet_v1.py

examples/slim/nets/inception_v1.py

examples/slim/nets/inception_resnet_v2.py

examples/slim/nets/inception_v2_test.py

examples/slim/nets/inception_v1_test.py

examples/slim/nets/resnet_v2.py

examples/slim/nets/alexnet_test.py

examples/slim/nets/inception_v4.py

examples/slim/nets/inception_v3.py

examples/slim/nets/inception_resnet_v2_test.py

examples/slim/nets/inception_v3_test.py

examples/slim/nets/resnet_v1.py

examples/slim/nets/inception.py

examples/slim/nets/mobilenet_v1_test.py

examples/slim/nets/overfeat.py

examples/slim/nets/overfeat_test.py

examples/slim/nets/cifarnet.py

examples/slim/nets/resnet_v2_test.py

examples/slim/nets/lenet.py

examples/slim/nets/mobilenet_v1.md

examples/slim/train-inception_v3.sh

examples/slim/download_and_convert_data.py

examples/slim/train-cifar10.sh

examples/slim/preprocessing/

examples/slim/preprocessing/__init__.py

examples/slim/preprocessing/preprocessing_factory.py

examples/slim/preprocessing/lenet_preprocessing.py

examples/slim/preprocessing/cifarnet_preprocessing.py

examples/slim/preprocessing/inception_preprocessing.py

examples/slim/preprocessing/vgg_preprocessing.py

examples/slim/README.md

examples/slim/eval_image_classifier.py

examples/slim/scripts/

examples/slim/scripts/train_lenet_on_mnist.sh

examples/slim/scripts/finetune_resnet_v1_50_on_flowers.sh

examples/slim/scripts/finetune_inception_v1_on_flowers.sh

examples/slim/scripts/finetune_inception_resnet_v2_on_flowers.sh

examples/slim/scripts/finetune_inception_v3_on_flowers.sh

examples/slim/scripts/train_cifarnet_on_cifar10.sh

examples/slim/export_inference_graph_test.py

examples/slim/slim_walkthrough.ipynb

examples/slim/deployment/

examples/slim/deployment/__init__.py

examples/slim/deployment/model_deploy_test.py

examples/slim/deployment/model_deploy.py

examples/slim/train-alexnet.sh

examples/slim/BUILD

examples/slim/datasets/

examples/slim/datasets/__init__.py

examples/slim/datasets/cifar10.py

examples/slim/datasets/dataset_utils.py

examples/slim/datasets/download_and_convert_flowers.py

examples/slim/datasets/download_and_convert_cifar10.py

examples/slim/datasets/dataset_factory.py

examples/slim/datasets/imagenet.py

examples/slim/datasets/mnist.py

examples/slim/datasets/flowers.py

examples/slim/datasets/download_and_convert_mnist.py

examples/slim/setup.py

examples/slim/train_image_classifier.py

examples/slim/export_inference_graph.py

examples/mnist/

examples/mnist/README.md

examples/mnist/ddl_mnist.py

2018년 3월 14일 수요일

ppc64le에서 PyCaffe를 이용한 Image Classification demo

여기서는 caffe가 이미 설치되어 있고, anaconda2도 설치된 상태부터 시작합니다.

[user1@gpusvr ~]$ which python
~/anaconda2/bin/python

먼저, caffe의 source code가 있는 CAFFE_ROOT 디렉토리로 가서 make pycaffe를 수행합니다.

[user1@gpusvr caffe]$ make pycaffe
CXX/LD -o python/caffe/_caffe.so python/caffe/_caffe.cpp
touch python/caffe/proto/__init__.py
PROTOC (python) src/caffe/proto/caffe.proto

이렇게 빌드된 pycaffe는 python/caffe 디렉토리 밑에 다음과 같이 설치됩니다.

[user1@gpusvr caffe]$ ls -l python/caffe
total 2016
-rw-rw-r--. 1 user1 user1 21363 Mar 13 10:50 _caffe.cpp
-rwxrwxr-x. 1 user1 user1 1897912 Mar 14 15:17 _caffe.so
-rw-rw-r--. 1 user1 user1 3546 Mar 13 10:50 classifier.py
-rw-rw-r--. 1 user1 user1 3278 Mar 14 15:53 classifier.pyc
-rw-rw-r--. 1 user1 user1 6721 Mar 13 10:50 coord_map.py
-rw-rw-r--. 1 user1 user1 8549 Mar 13 10:50 detector.py
-rw-rw-r--. 1 user1 user1 7335 Mar 14 15:53 detector.pyc
-rw-rw-r--. 1 user1 user1 11174 Mar 13 10:50 draw.py
drwxrwxr-x. 2 user1 user1 34 Mar 13 10:50 imagenet
-rw-rw-r--. 1 user1 user1 552 Mar 13 10:50 __init__.py
-rw-rw-r--. 1 user1 user1 1206 Mar 14 15:16 __init__.pyc
-rw-rw-r--. 1 user1 user1 13079 Mar 13 10:50 io.py
-rw-rw-r--. 1 user1 user1 13576 Mar 14 15:28 io.pyc
-rw-rw-r--. 1 user1 user1 8277 Mar 13 10:50 net_spec.py
-rw-rw-r--. 1 user1 user1 10008 Mar 14 15:53 net_spec.pyc
drwxrwxr-x. 2 user1 user1 86 Mar 14 15:53 proto
-rw-rw-r--. 1 user1 user1 11615 Mar 13 10:50 pycaffe.py
-rw-rw-r--. 1 user1 user1 12179 Mar 14 15:16 pycaffe.pyc
drwxrwxr-x. 2 user1 user1 256 Mar 13 10:50 test

이걸 다음과 같이 PYTHONPATH로 지정된 ~/anaconda2/lib/python2.7/site-packages 밑으로 copy 합니다. 그냥 현재의 저 python/caffe directory를 PYTHONPATH에 추가하는 방법도 있겠습니다만, 저는 해보니 왜인지는 모르겠으나 자꾸 no module named caffe 라는 error가 나더라구요.

[user1@gpusvr caffe]$ cp -r python/caffe /home/user1/anaconda2/lib/python2.7/site-packages

그리고 libcaffe.so file not found 등의 error를 피하기 위해서 다음과 같이 LD_LIBRARY_PATH를 제대로 지정해줍니다.

[user1@gpusvr caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/caffe/.build_release/lib:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

이제 jupyter notebook을 설치하고, 설정도 해준 뒤 구동합니다.

[user1@gpusvr ~]$ conda install jupyter

[user1@gpusvr ~]$ jupyter notebook --generate-config
Writing default config to: /home/user1/.jupyter/jupyter_notebook_config.py

[user1@gpusvr ~]$ vi /home/user1/.jupyter/jupyter_notebook_config.py
...
#c.NotebookApp.ip = 'localhost'
c.NotebookApp.ip = '*'

[user1@gpusvr ~]$ jupyter notebook &
...
http://localhost:8888/?token=9b9c684173d7883ae7431c02d3e8dfe831cf790322737d16

이제 다른 PC의 laptop의 웹브라우저에서 http://GPU서버주소:8888/?token=9b9c684173d7883ae7431c02d3e8dfe831cf790322737d16 를 입력창에 넣고 접속합니다.

접속해보면 jupyter를 구동한 그 홈디렉토리가 보일 겁니다. 여기서 click-click하여 caffe/examples 디렉토리를 찾아들어간 뒤, 00-classification.ipynb 라는 파일을 클릭합니다.

Jupyter 메뉴바의 play (오른쪽 삼각형) 버튼을 클릭하여 각 섹션을 넘어가다보면 다음처럼 "6. Try your own image"라는 섹션이 나옵니다. 여기서 코드 속의 my_image_url = 오른쪽 부분에 인터넷에서 구글링으로 찾은 적절한 jpg 파일의 URL을 복사해 붙여봅니다. 저는 전함 미주리호의 사진 주소를 붙여보았습니다.

# download an image
my_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/USS_Missouri_HNL.jpg/220px-USS_Missouri_HNL.jpg" # paste your URL here

Jupyter 메뉴바의 play 버튼을 클릭하면 다음과 같이 49.4%의 확률로 항공모함, 11.7%의 확률로 스페이스 셔틀, 9.2% 확률로 군용 항공기라고 나옵니다. 아마 갑판에서 바라본 포탑과 함교의 모습을 전함으로 인식하지는 못하나봅니다.

[(0.49434182,
'n02687172 aircraft carrier, carrier, flattop, attack aircraft carrier'),
(0.11762773, 'n04266014 space shuttle'),
(0.092569701, 'n04552348 warplane, military plane'),
(0.081569709, 'n04008634 projectile, missile'),
(0.078422301,
'n04389033 tank, army tank, armored combat vehicle, armoured combat vehicle')]

다음과 같이 전체 전함의 모습을 담은 사진을 넣어도 여전히 99.9% 확률로 항공모함이라고 나옵니다. 아마도 ImageNet 2012 training dataset으로 훈련된 이 모델에는 전함과 항공모함을 구별해서 labeling이 되어있지는 않은 모양입니다.

이 데모의 웹 버전은 다음 URL에 준비되어 있으니 언제든 테스트해보실 수 있습니다.

http://demo.caffe.berkeleyvision.org/

Deep Learning용 GPU 서버 성능 측정을 위한 benchmark test 방법 몇가지

Deep Learning을 위한 GPU 서버들은 대부분 NVIDIA GPU를 사용하니까, 그 서버의 성능은 NVIDIA GPU 중 어떤 것을 몇 개 장착했느냐가 가장 중요합니다. 하지만 서버 벤더별 모델별로 그 특장점이 다 다르고 또 어떤 서버들은 NVLink 등 GPU와 GPU, GPU와 CPU의 연결 방식에 있어 차별점을 두고 있습니다. 어떤 경우엔 값비싼 최신 GPU를 잔뜩 달았지만 시스템 대역폭이 부족하여 제 성능을 내지 못할 수도 있고요.

이런 점을 확인해보려면 그냥 자신의 data를 이용하여 자신의 신경망을 직접 돌려보고 성능을 비교하는 것이 좋습니다만, 거기에는 많은 시간과 돈, 노력이 필요하지요.

여기서는 그런 점들을 비교적 쉽게 테스트해볼 수 있는 사실상 표준적인 벤치마크 방법론 몇가지를 소개합니다. 어느 경우든 linux의 time 명령을 이용해서 전체 수행 시간만 측정하면 되니까 매우 편리합니다. 평가 대상이 신경망 자체가 아니라 GPU 서버 하드웨어이고 다 같은 data와 같은 신경망을 사용하니까 테스트 결과의 accuracy 등은 비교해 보실 필요가 없습니다.

아래 테스트들은 tensorflow 1.4.1, 그리고 Anaconda3에서 제공되는 python3.6을 사용해서 수행한 것입니다.

[user1@gpusvr ~]$ which python
~/anaconda3/bin/python

먼저, 다음과 같이 github로부터 tensorflow models를 clone 합니다.

[user1@gpusvr ~]$ git clone https://github.com/tensorflow/models.git

1. CIFAR10

이중 가장 많이 사용되는 것이 cifar10입니다. 수행할 cifar10_multi_gpu_train.py code를 열어보면 조정 가능한 parameter가 무엇이고 그 default 값이 무엇인지 보실 수 있습니다. 그를 명령어에서 적절히 수정하여 수행하시면 됩니다.

[user1@gpusvr ~]$ cd models/tutorials/image/cifar10

[user1@gpusvr cifar10]$ vi cifar10_multi_gpu_train.py
...
parser.add_argument('--max_steps', type=int, default=1000000,
...
parser.add_argument('--num_gpus', type=int, default=1,

이제 아래와 같이 수행합니다. 여기서는 # max_steps=10000 num_gpus=4 으로 수행합니다. 다만, 처음에 이 테스트를 수행할 때는 internet에서 dataset을 download 받는 것부터 시작합니다. 그 부분은 당연히 테스트 수행 시간에서 제외해야 합니다. 따라서 처음 테스트는 측정하지 마시고, 2번째 테스트 이후를 측정하시면 됩니다. 그때는 이미 download 받아놓은 data를 사용하거든요.

[user1@gpusvr cifar10]$ time python cifar10_multi_gpu_train.py --max_steps=10000 --num_gpus=4 --batch_size 512
...
2018-03-09 11:36:47.084991: step 9990, loss = 0.58 (25710.1 examples/sec; 0.020 sec/batch)

이렇게 수행해보면 GPU 사용량은 17~22% 정도로 상당히 낮습니다. batch_size를 8배인 4096으로 키워도 되는데, 이때 그에 따라 max_steps도 1/8로 줄여야 합니다. 아래는 # max_steps=1250 num_gpus=4 으로 수행한 것입니다.

[user1@gpusvr cifar10]$ time python cifar10_multi_gpu_train.py --max_steps=1250 --num_gpus=4 --batch_size 4096
...
2018-03-09 13:11:57.589629: step 1240, loss = 2.08 (28222.4 examples/sec; 0.145 sec/batch)

성능이 조금 나아지긴 합니다만, 극적으로 나아지진 않습니다.

2. Alexnet

여기에 포함된 alexnet_benchmark.py은 single GPU만 이용합니다.

[user1@gpusvr alexnet]$ time python alexnet_benchmark.py --batch_size=1024 --num_batches=1000
...
2018-03-09 14:32:56.566184: step 990, duration = 0.522
2018-03-09 14:33:01.260806: Forward-backward across 1000 steps, 0.521 +/- 0.002 sec / batch

그러나 여기서도 다음과 같이 script를 짜서 여러개의 GPU를 사용하는 벤치마크를 할 수 있습니다. CUDA_VISIBLE_DEVICES 환경변수를 이용하여 각 세션마다 특정 GPU를 할당한 뒤, GPU 개수만큼 alexnet_benchmark.py를 병렬로 수행하는 것입니다. 이것도 의미가 있는 테스트입니다. 실제 대부분의 고객들이 여러개의 GPU가 달린 서버를 사용할 때, 여러개의 GPU를 이용하여 하나의 model을 training하는 경우보다는 아래 script처럼 여러 연구원이 1개씩의 GPU를 가지고 각자의 training을 수행하는 경우가 대부분이기 때문입니다. 그런 경우에도 CPU-GPU 간의 병목 없이 원활한 성능이 나오는지 확인하는 것도 중요합니다.

[user1@gpusvr alexnet]$ vi alexrun.sh
echo "Starting !"
CUDA_VISIBLE_DEVICES=0 python alexnet_benchmark.py --batch_size=1024 --num_batches=1000 &
CUDA_VISIBLE_DEVICES=1 python alexnet_benchmark.py --batch_size=1024 --num_batches=1000 &
CUDA_VISIBLE_DEVICES=2 python alexnet_benchmark.py --batch_size=1024 --num_batches=1000 &
CUDA_VISIBLE_DEVICES=3 python alexnet_benchmark.py --batch_size=1024 --num_batches=1000 &
wait
echo "Completed !"

[user1@gpusvr alexnet]$ chmod a+x alexrun.sh
[user1@gpusvr alexnet]$ time ./alexrun.sh
Starting !
....
2018-03-09 14:50:15.533991: step 990, duration = 0.523
2018-03-09 14:50:17.971840: step 990, duration = 0.521
2018-03-09 14:50:18.197058: step 990, duration = 0.524
2018-03-09 14:50:20.202855: step 990, duration = 0.525
2018-03-09 14:50:20.231726: Forward-backward across 1000 steps, 0.522 +/- 0.001 sec / batch
2018-03-09 14:50:22.691048: Forward-backward across 1000 steps, 0.524 +/- 0.002 sec / batch
2018-03-09 14:50:22.908471: Forward-backward across 1000 steps, 0.523 +/- 0.002 sec / batch
2018-03-09 14:50:24.927234: Forward-backward across 1000 steps, 0.525 +/- 0.002 sec / batch
Completed !

실제 수행 결과를 보면 아무래도 1개 GPU만 사용했을 때보다 약간 더 느리게 나오는 것을 보실 수 있습니다.

3. RNN PTB

이것도 1개의 GPU만 이용하는 benchmark test입니다. 다른 것과는 달리 image에 대한 CNN 트레이닝이 아니라 text에 대한 RNN 트레이닝이라는 점이 주목할 만 합니다.

[user1@gpusvr ptb]$ pwd
/home/user1/models/tutorials/rnn/ptb

이 테스트를 위한 sample data는 아래에서 따로 download 받아야 합니다.

[user1@gpusvr ptb]$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

[user1@gpusvr ptb]$ tar xvfz simple-examples.tgz -C $HOME

[user1@gpusvr ptb]$ du ~/simple-examples
17256 /home/user1/simple-examples/data
74840 /home/user1/simple-examples/models
516 /home/user1/simple-examples/rnnlm-0.2b
12 /home/user1/simple-examples/1-train
12 /home/user1/simple-examples/3-combination
0 /home/user1/simple-examples/2-nbest-rescore/lattices/nbest
2860 /home/user1/simple-examples/2-nbest-rescore/lattices
2900 /home/user1/simple-examples/2-nbest-rescore
12 /home/user1/simple-examples/5-one-iter
12 /home/user1/simple-examples/6-recovery-during-training
12 /home/user1/simple-examples/7-dynamic-evaluation
0 /home/user1/simple-examples/temp
12 /home/user1/simple-examples/8-direct
12 /home/user1/simple-examples/4-data-generation
12 /home/user1/simple-examples/9-char-based-lm
95608 /home/user1/simple-examples

Training에 사용되는 data는 총 94MB 정도로서 작은 편입니다.

[user1@gpusvr ptb]$ du -sm ~/simple-examples
94 /home/user1/simple-examples

Data 내용은 아래에 보시다시피 text들입니다.

[user1@gpusvr ~]$ cd /home/user1/simple-examples/data

[user1@gpusvr data]$ head ptb.train.txt
aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
pierre <unk> N years old will join the board as a nonexecutive director nov. N
mr. <unk> is chairman of <unk> n.v. the dutch publishing group
rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate
a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported
the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said
<unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N
although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem
a <unk> <unk> said this is an old story
we 're talking about years ago before anyone heard of asbestos having any questionable properties

[user1@gpusvr data]$ head ptb.char.train.txt
a e r _ b a n k n o t e _ b e r l i t z _ c a l l o w a y _ c e n t r u s t _ c l u e t t _ f r o m s t e i n _ g i t a n o _ g u t e r m a n _ h y d r o - q u e b e c _ i p o _ k i a _ m e m o t e c _ m l x _ n a h b _ p u n t s _ r a k e _ r e g a t t a _ r u b e n s _ s i m _ s n a c k - f o o d _ s s a n g y o n g _ s w a p o _ w a c h t e r
p i e r r e _ _ N _ y e a r s _ o l d _ w i l l _ j o i n _ t h e _ b o a r d _ a s _ a _ n o n e x e c u t i v e _ d i r e c t o r _ n o v . _ N
m r . _ _ i s _ c h a i r m a n _ o f _ _ n . v . _ t h e _ d u t c h _ p u b l i s h i n g _ g r o u p
r u d o l p h _ _ N _ y e a r s _ o l d _ a n d _ f o r m e r _ c h a i r m a n _ o f _ c o n s o l i d a t e d _ g o l d _ f i e l d s _ p l c _ w a s _ n a m e d _ a _ n o n e x e c u t i v e _ d i r e c t o r _ o f _ t h i s _ b r i t i s h _ i n d u s t r i a l _ c o n g l o m e r a t e
a _ f o r m _ o f _ a s b e s t o s _ o n c e _ u s e d _ t o _ m a k e _ k e n t _ c i g a r e t t e _ f i l t e r s _ h a s _ c a u s e d _ a _ h i g h _ p e r c e n t a g e _ o f _ c a n c e r _ d e a t h s _ a m o n g _ a _ g r o u p _ o f _ w o r k e r s _ e x p o s e d _ t o _ i t _ m o r e _ t h a n _ N _ y e a r s _ a g o _ r e s e a r c h e r s _ r e p o r t e d
t h e _ a s b e s t o s _ f i b e r _ _ i s _ u n u s u a l l y _ _ o n c e _ i t _ e n t e r s _ t h e _ _ w i t h _ e v e n _ b r i e f _ e x p o s u r e s _ t o _ i t _ c a u s i n g _ s y m p t o m s _ t h a t _ s h o w _ u p _ d e c a d e s _ l a t e r _ r e s e a r c h e r s _ s a i d
 _ i n c . _ t h e _ u n i t _ o f _ n e w _ y o r k - b a s e d _ _ c o r p . _ t h a t _ m a k e s _ k e n t _ c i g a r e t t e s _ s t o p p e d _ u s i n g _ _ i n _ i t s _ _ c i g a r e t t e _ f i l t e r s _ i n _ N
a l t h o u g h _ p r e l i m i n a r y _ f i n d i n g s _ w e r e _ r e p o r t e d _ m o r e _ t h a n _ a _ y e a r _ a g o _ t h e _ l a t e s t _ r e s u l t s _ a p p e a r _ i n _ t o d a y _ ' s _ n e w _ e n g l a n d _ j o u r n a l _ o f _ m e d i c i n e _ a _ f o r u m _ l i k e l y _ t o _ b r i n g _ n e w _ a t t e n t i o n _ t o _ t h e _ p r o b l e m
a _ _ _ s a i d _ t h i s _ i s _ a n _ o l d _ s t o r y
w e _ ' r e _ t a l k i n g _ a b o u t _ y e a r s _ a g o _ b e f o r e _ a n y o n e _ h e a r d _ o f _ a s b e s t o s _ h a v i n g _ a n y _ q u e s t i o n a b l e _ p r o p e r t i e s

이제 training을 해보겠습니다. 아래와 같이 전체 training 및 test에 걸린 시간을 측정하면 됩니다.

[user1@gpusvr ptb]$ time python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small
...
Epoch: 1 Learning rate: 1.000
...
Epoch: 1 Train Perplexity: 268.322
Epoch: 1 Valid Perplexity: 178.848
Epoch: 2 Learning rate: 1.000
...
Epoch: 13 Train Perplexity: 40.549
Epoch: 13 Valid Perplexity: 119.536
Test Perplexity: 114.159

위의 Alexnet 테스트에서처럼, 이 test도 여러개의 세션을 동시에 수행함으로써 시스템 대역폭이 충분하여 multi-user 동시 사용시에도 충분한 성능을 내는지 확인해보실 수 있습니다.

[user1@gpusvr ptb]$ vi ptbrun.sh
echo "Starting !"
CUDA_VISIBLE_DEVICES=0 python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small &
CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small &
CUDA_VISIBLE_DEVICES=2 python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small &
CUDA_VISIBLE_DEVICES=3 python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small &
wait
echo "Completed !"

실제로 아래와 같이 수행해보면 single session으로 single GPU를 이용했을 때에 비해 약간 성능이 떨어지는 것을 보실 수 있습니다.

[user1@gpusvr ptb]$ time ./ptbrun.sh
Starting !
...
Epoch: 13 Valid Perplexity: 119.785
...
Test Perplexity: 113.431
Completed !

2018년 3월 13일 화요일

AC922에서 bvlc caffe 빌드하고 cifar10 돌려보기

(Pycaffe를 위해) Anaconda2가 설치된 환경을 가정하겠습니다. 먼저 아래 package들부터 설치합니다.

[user1@ac922 ~]$ sudo yum install git gcc gcc-c++ python-devel python-enum34 numpy cmake automake snappy.ppc64le boost-python.ppc64le libgfortran4.ppc64le gtk+.ppc64le gtk+-devel.ppc64le gtk2.ppc64le gtk3.ppc64le gstreamer.ppc64le gstreamer-tools.ppc64le libdc1394.ppc64le libdc1394-tools.ppc64le

1. HDF5를 설치합니다.

[user1@ac922 ~]$ wget https://support.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.10.1.tar
[user1@ac922 ~]$ tar -xf hdf5-1.10.1.tar
[user1@ac922 ~]$ cd hdf5-1.10.1
[user1@ac922 hdf5-1.10.1]$ ./configure --prefix=/usr/local --enable-fortran --enable-cxx --build=powerpc64le-linux-gnu
[user1@ac922 hdf5-1.10.1]$ make && sudo make install

2. boost를 설치합니다.

[user1@ac922 ~]$ wget https://dl.bintray.com/boostorg/release/1.66.0/source/boost_1_66_0.tar.gz
[user1@ac922 ~]$ tar -zxf boost_1_66_0.tar.gz
[user1@ac922 ~]$ cd boost_1_66_0
[user1@ac922 boost_1_66_0]$ ./bootstrap.sh --prefix=/usr/local
[user1@ac922 boost_1_66_0]$ ./b2
[user1@ac922 boost_1_66_0]$ sudo ./b2 install

3. GFLAGS를 설치합니다.

[user1@ac922 ~]$ wget https://github.com/schuhschuh/gflags/archive/master.zip
[user1@ac922 ~]$ unzip master.zip && cd gflags-master
[user1@ac922 gflags-master]$ mkdir build && cd build
[user1@ac922 build]$ cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_STATIC_LIBS=ON -DBUILD_gflags_LIB=ON
[user1@ac922 build]$ make && sudo make install

4. GLOG를 설치합니다.

[user1@ac922 ~]$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/google-glog/glog-0.3.3.tar.gz
[user1@ac922 ~]$ tar zxvf glog-0.3.3.tar.gz
[user1@ac922 ~]$ cd glog-0.3.3
[user1@ac922 glog-0.3.3]$ ./configure --build=powerpc64le-redhat-linux-gnu
[user1@ac922 glog-0.3.3]$ make && sudo make install

5. LMDB를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/LMDB/lmdb
[[user1@ac922 ~]$ cd lmdb/libraries/liblmdb
[user1@ac922 liblmdb]$ make && sudo make install

6. LEVELDB를 설치합니다.

[user1@ac922 files]$ wget https://rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ wget https://www.rpmfind.net/linux/epel/7/ppc64le/Packages/l/leveldb-devel-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-1.12.0-11.el7.ppc64le.rpm
[user1@ac922 files]$ sudo rpm -Uvh leveldb-devel-1.12.0-11.el7.ppc64le.rpm

7. OpenBLAS를 설치합니다.

[user1@ac922 ~]$ git clone https://github.com/xianyi/OpenBLAS.git
[user1@ac922 ~]$ cd OpenBLAS
[user1@ac922 OpenBLAS]$ git checkout power8
[user1@ac922 OpenBLAS]$ make TARGET=POWER8 LDFLAGS="-fopenmp"
[user1@ac922 OpenBLAS]$ sudo make TARGET=POWER8 LDFLAGS="-fopenmp" install
...
Copying the static library to /opt/OpenBLAS/lib
Copying the shared library to /opt/OpenBLAS/lib
Generating OpenBLASConfig.cmake in /opt/OpenBLAS/lib/cmake/openblas
Generating OpenBLASConfigVersion.cmake in /opt/OpenBLAS/lib/cmake/openblas
Install OK!
make[1]: Leaving directory `/home/user1/OpenBLAS'

8. OpenCV를 설치합니다.

[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv.git
[user1@ac922 ~]$ git clone --recursive https://github.com/opencv/opencv_contrib.git
[user1@ac922 ~]$ cd opencv
[user1@ac922 opencv]$ git checkout tags/3.4.1
[user1@ac922 opencv]$ mkdir build && cd build
[user1@ac922 build]$ which protoc
~/anaconda2/bin/protoc
[user1@ac922 build]$ export PROTOBUF_PROTOC_EXECUTABLE="~/anaconda2/bin/protoc"
[user1@ac922 build]$ cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DOPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules -D WITH_EIGEN=OFF -DBUILD_LIBPROTOBUF_FROM_SOURCES=ON ..
[user1@ac922 build]$ make && sudo make install
...
-- Installing: /usr/local/bin/opencv_visualisation
-- Set runtime path of "/usr/local/bin/opencv_visualisation" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_interactive-calibration
-- Set runtime path of "/usr/local/bin/opencv_interactive-calibration" to "/usr/local/lib64:/usr/local/cuda/lib64"
-- Installing: /usr/local/bin/opencv_version
-- Set runtime path of "/usr/local/bin/opencv_version" to "/usr/local/lib64:/usr/local/cuda/lib64"

9. NCCL을 빌드합니다.

[user1@ac922 ~]$ git clone https://github.com/NVIDIA/nccl
[user1@ac922 ~]$ cd nccl
[user1@ac922 nccl]$ make
[user1@ac922 nccl]$ sudo make install

10. 이제 비로소 caffe를 빌드할 수 있습니다.

[user1@ac922 ~]$ git clone https://github.com/BVLC/caffe.git
[user1@ac922 ~]$ cd caffe
[user1@ac922 caffe]$ cp Makefile.config.example Makefile.config
[user1@ac922 caffe]$ vi Makefile.config
...
# USE_CUDNN := 1
USE_CUDNN := 1
...
# OPENCV_VERSION := 3
OPENCV_VERSION := 3
...
#CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_20,code=sm_21 \
-gencode arch=compute_30,code=sm_30 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=sm_50 \
-gencode arch=compute_52,code=sm_52 \
-gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_61,code=compute_61
CUDA_ARCH := -gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_61,code=compute_61
...
# BLAS := atlas
BLAS := open
...
#PYTHON_INCLUDE := /usr/include/python2.7 \
/usr/lib/python2.7/dist-packages/numpy/core/include
ANACONDA_HOME := $(HOME)/anaconda2
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
$(ANACONDA_HOME)/include/python2.7 \
$(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include
...
# PYTHON_LIB := /usr/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib
...
# WITH_PYTHON_LAYER := 1
WITH_PYTHON_LAYER := 1
...
# USE_NCCL := 1
USE_NCCL := 1

LINKFLAGS := -Wl,-rpath,$(HOME)/anaconda2/lib # added to prevent "~/anaconda2/lib/libpng16.so.16 undefined reference to `inflateValidate@ZLIB_1.2.9" error

여기서 아래와 같이 soft link를 걸어주어야 cannot find -lsnappy 등의 error를 피할 수 있습니다.

[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libsnappy.so.1 /usr/lib64/libsnappy.so
[user1@ac922 caffe]$ sudo ln -s /usr/lib64/libboost_python.so.1.53.0 /usr/lib64/libboost_python.so

[user1@ac922 caffe]$ make all
...
CXX/LD -o .build_release/examples/cpp_classification/classification.bin
CXX examples/mnist/convert_mnist_data.cpp
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX examples/siamese/convert_mnist_siamese_data.cpp
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin

[user1@ac922 caffe]$ sudo mkdir /opt/caffe
[user1@ac922 caffe]$ sudo cp -r build/* /opt/caffe

11. 이제 cifar10을 수행해봅니다.

[user1@ac922 caffe]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/lib:/usr/lib64

[user1@ac922 caffe]$ export CAFFE_HOME=/home/user1/caffe/build/tools

[user1@ac922 caffe]$ cd data/cifar10/

[user1@ac922 cifar10]$ ./get_cifar10.sh
Downloading...
--2018-03-13 14:13:14-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

100%[==================================================================>] 170,052,171 11.0MB/s in 15s

2018-03-13 14:13:29 (11.0 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

[user1@ac922 cifar10]$ ls -la
total 180084
drwxrwxr-x. 2 user1 user1 213 Mar 13 14:13 .
drwxrwxr-x. 5 user1 user1 50 Mar 13 10:50 ..
-rw-r--r--. 1 user1 user1 61 Jun 5 2009 batches.meta.txt
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_1.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_2.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_3.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_4.bin
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 data_batch_5.bin
-rwxrwxr-x. 1 user1 user1 506 Mar 13 10:50 get_cifar10.sh
-rw-r--r--. 1 user1 user1 88 Jun 5 2009 readme.html
-rw-r--r--. 1 user1 user1 30730000 Jun 5 2009 test_batch.bin

[user1@ac922 caffe]$ ./examples/cifar10/create_cifar10.sh

[user1@ac922 caffe]$ ls -l examples/cifar10/*_lmdb
examples/cifar10/cifar10_test_lmdb:
total 35656
-rw-rw-r--. 1 user1 user1 36503552 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1 8192 Mar 13 14:14 lock.mdb

examples/cifar10/cifar10_train_lmdb:
total 177992
-rw-rw-r--. 1 user1 user1 182255616 Mar 13 14:14 data.mdb
-rw-rw-r--. 1 user1 user1 8192 Mar 13 14:14 lock.mdb

[user1@ac922 caffe]$ vi ./examples/cifar10/train_full.sh
#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver.prototxt $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@
# --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate $@

# reduce learning rate by factor of 10
$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
--snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@
# --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate $@

# 무슨 이유에선지 cifar10_full_iter_60000.solverstate 대신 cifar10_full_iter_60000.solverstate.h5 이라는 파일이 생기므로 그에 따라 파일 이름 변경

[user1@ac922 caffe]$ time ./examples/cifar10/train_full.sh
I0313 14:15:55.463438 114263 caffe.cpp:204] Using GPUs 0
I0313 14:15:55.529319 114263 caffe.cpp:209] GPU 0: Tesla V100-SXM2-16GB
...
I0313 14:34:16.333791 126407 solver.cpp:239] Iteration 69800 (177.976 iter/s, 1.12375s/200 iters), loss = 0.332006
I0313 14:34:16.333875 126407 solver.cpp:258] Train net output #0: loss = 0.332006 (* 1 = 0.332006 loss)
I0313 14:34:16.333892 126407 sgd_solver.cpp:112] Iteration 69800, lr = 1e-05
I0313 14:34:17.436130 126413 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.453459 126407 solver.cpp:478] Snapshotting to HDF5 file examples/cifar10/cifar10_full_iter_70000.caffemodel.h5
I0313 14:34:17.458664 126407 sgd_solver.cpp:290] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_full_iter_70000.solverstate.h5
I0313 14:34:17.461360 126407 solver.cpp:331] Iteration 70000, loss = 0.294117
I0313 14:34:17.461383 126407 solver.cpp:351] Iteration 70000, Testing net (#0)
I0313 14:34:17.610864 126424 data_layer.cpp:73] Restarting data prefetching from start.
I0313 14:34:17.612763 126407 solver.cpp:418] Test net output #0: accuracy = 0.8169
I0313 14:34:17.612794 126407 solver.cpp:418] Test net output #1: loss = 0.533315 (* 1 = 0.533315 loss)
I0313 14:34:17.612810 126407 solver.cpp:336] Optimization Done.
I0313 14:34:17.612821 126407 caffe.cpp:250] Optimization Done.

real 6m51.615s
user 7m30.483s
sys 1m5.158s

2018년 2월 28일 수요일

AC922의 기본 setup 및 anaconda virutalenv, tensorflow 등의 설치

먼저, AC922에서 CUDA 버전이 아직 불안정한 관계로 일부 손볼 부분이 있습니다. 혹시 그것이 아직 설정되어 있지 않다면 아래 link를 참조하시어 설정하시기 바랍니다. 설정이 제대로 되었다면 "nvidia-smi -l 3" 명령으로 볼 때 아무 문제가 없을 것입니다. 혹시 GPU 사용%에 UNKNOW ERROR 등이 표시된다면 그 설정이 안 된 것입니다.

https://hwengineer.blogspot.kr/2018/01/ac922-ubuntu-container-image-caffe.html

ppc64le 환경에서 EPEL의 rpm은 다음과 같이 download 받습니다.

[root@ac922 ~]$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

이걸 다음과 같이 설치하면 /etc/yum.repos.d/epel.repo 등 epel repository 관련 file들이 설치됩니다.

[root@ac922 ~]# rpm -Uvh epel-release-latest-7.noarch.rpm

[root@ac922 ~]# rpm -ql epel-release-7-11.noarch
/etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
/etc/yum.repos.d/epel-testing.repo
/etc/yum.repos.d/epel.repo
/usr/lib/systemd/system-preset/90-epel.preset
/usr/share/doc/epel-release-7
/usr/share/doc/epel-release-7/GPL

/etc/yum.repos.d/epel.repo의 내용을 보면 아래와 같습니다.

[root@ac922 ~]# vi /etc/yum.repos.d/epel.repo
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch/debug
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/7/SRPMS
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-source-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1

이제 yum update를 수행한 뒤 yum list를 해보면 epel repository에 python34.ppc64le 등이 보입니다. 이걸 설치하시면 됩니다.

[root@ac922 ~]# yum update

[root@ac922 ~]# yum list | grep python34 | more
python34.ppc64le 3.4.5-5.el7 epel
python34-libs.ppc64le 3.4.5-5.el7 epel
libpeas-loader-python34.ppc64le 1.20.0-1.el7 epel
python34-Cython.ppc64le 0.23.5-1.el7 epel
...

[root@ac922 ~]# yum install python34
...
Running transaction
Installing : python34-3.4.5-5.el7.ppc64le 1/2
Installing : python34-libs-3.4.5-5.el7.ppc64le 2/2
Verifying : python34-libs-3.4.5-5.el7.ppc64le 1/2
Verifying : python34-3.4.5-5.el7.ppc64le 2/2

Installed:
python34.ppc64le 0:3.4.5-5.el7

Dependency Installed:
python34-libs.ppc64le 0:3.4.5-5.el7

Complete!

다만 보통 Redhat OS에 포함된 python34를 이용하기보다는 Anaconda를 많이 이용합니다.
Anaconda 설치는 이미 되어 있다고 가정하겠습니다. 안 되어 있다면 다음과 같이 받아서 저 file을 수행하시면 됩니다.

$ wget https://repo.continuum.io/archive/Anaconda2-5.1.0-Linux-ppc64le.sh
$ wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-ppc64le.sh

여기서는 Anaconda2를 사용한다고 가정하겠습니다.

[user1@ac922 ~]$ which conda
~/anaconda2/bin/conda

virutalenv는 다음과 같이 x86과 동일하게 생성하시면 됩니다. 여기서는 python 3.6 환경을 만들겠습니다.

[user1@ac922 ~]$ conda create -n env_py36 python=3.6 anaconda
...
isort 4.2.15: ###################################################################################### | 100%
six 1.11.0: ######################################################################################## | 100%
conda-build 3.4.1: ################################################################################# | 100%
numpy 1.13.3: ###################################################################################### | 100%
click 6.7: ######################################################################################### | 100%
send2trash 1.4.2: ################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate env_py36
#
# To deactivate an active environment, use:
# > source deactivate
#

virtualenv의 사용법도 x86에서와 동일합니다. 다만 pip 등의 이름이 pip3가 아니라 그냥 pip일 뿐입니다. pip 버전은 물론 python 3.6에 딸린 버전입니다.

[user1@ac922 ~]$ source activate env_py36

(env_py36) [user1@ac922 ~]$ which python
~/anaconda2/envs/env_py36/bin/python

(env_py36) [user1@ac922 ~]$ which pip
~/anaconda2/envs/env_py36/bin/pip

(env_py36) [user1@ac922 ~]$ pip --version
pip 9.0.1 from /home/user1/anaconda2/envs/env_py36/lib/python3.6/site-packages (python 3.6)

이제 jupyterhub나 sudospawner 등 필요하신 것을 그대로 설치하시면 됩니다.

(env_py36) [user1@ac922 ~]$ pip install jupyterhub
...
Successfully built alembic python-oauth2 Mako python-editor
Installing collected packages: pamela, Mako, python-editor, alembic, python-oauth2, jupyterhub
Successfully installed Mako-1.0.7 alembic-0.9.8 jupyterhub-0.8.1 pamela-0.3.0 python-editor-1.0.3 python-oauth2-1.1.0

(env_py36) [user1@ac922 ~]$ pip install sudospawner
...
Requirement already satisfied: parso==0.1.* in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from jedi>=0.10->ipython>=4.0.0->ipykernel->notebook->sudospawner)
Requirement already satisfied: wcwidth in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from prompt_toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipykernel->notebook->sudospawner)
Installing collected packages: sudospawner
Successfully installed sudospawner-0.5.1

다만, tensorflow도 그대로 설치하시려 하면 아래처럼 error가 날 겁니다.

(env_py36) [user1@ac922 ~]$ pip install tensorflow
Collecting tensorflow
Could not find a version that satisfies the requirement tensorflow (from versions: )
No matching distribution found for tensorflow

이는 아직 ppc64le 환경에서는 CUDA 9을 위한 tensorflow 1.4.1은 없기 때문입니다. 대신 제가 source에서 빌드해놓은 wheel file이 있습니다. (혹시 필요하실 경우) tensorflow 1.4.1을 source에서 빌드하는 방법은 이 link를 참조하시면 됩니다. 제가 빌드한 tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl 파일은 여기서 download 받으시면 됩니다.

(env_py36) [user1@ac922 ~]$ ls files
...
tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl

이 wheel 파일을 다음과 같이 pip로 설치하시면 됩니다.

(env_py36) [user1@ac922 ~]$ pip install ./files/tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl
Processing ./files/tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl
Requirement already satisfied: six>=1.10.0 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Requirement already satisfied: numpy>=1.12.1 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Collecting protobuf>=3.3.0 (from tensorflow==1.4.1)
Downloading protobuf-3.5.1-py2.py3-none-any.whl (388kB)
100% |████████████████████████████████| 389kB 1.5MB/s
Collecting tensorflow-tensorboard<0.5.0,>=0.4.0rc1 (from tensorflow==1.4.1)
Downloading tensorflow_tensorboard-0.4.0-py3-none-any.whl (1.7MB)
100% |████████████████████████████████| 1.7MB 629kB/s
Collecting enum34>=1.1.6 (from tensorflow==1.4.1)
Downloading enum34-1.1.6-py3-none-any.whl
Requirement already satisfied: wheel>=0.26 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Requirement already satisfied: setuptools in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from protobuf>=3.3.0->tensorflow==1.4.1)
Collecting bleach==1.5.0 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
Downloading bleach-1.5.0-py2.py3-none-any.whl
Collecting html5lib==0.9999999 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
Downloading html5lib-0.9999999.tar.gz (889kB)
100% |████████████████████████████████| 890kB 1.2MB/s
Collecting markdown>=2.6.8 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
Downloading Markdown-2.6.11-py2.py3-none-any.whl (78kB)
100% |████████████████████████████████| 81kB 1.5MB/s
Requirement already satisfied: werkzeug>=0.11.10 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
Building wheels for collected packages: html5lib
Running setup.py bdist_wheel for html5lib ... done
Stored in directory: /home/user1/.cache/pip/wheels/6f/85/6c/56b8e1292c6214c4eb73b9dda50f53e8e977bf65989373c962
Successfully built html5lib
Installing collected packages: protobuf, html5lib, bleach, markdown, tensorflow-tensorboard, enum34, tensorflow
Found existing installation: html5lib 1.0.1
Uninstalling html5lib-1.0.1:
Successfully uninstalled html5lib-1.0.1
Found existing installation: bleach 2.1.2
Uninstalling bleach-2.1.2:
Successfully uninstalled bleach-2.1.2
Successfully installed bleach-1.5.0 enum34-1.1.6 html5lib-0.9999999 markdown-2.6.11 protobuf-3.5.1 tensorflow-1.4.1 tensorflow-tensorboard-0.4.0

(env_py36) [user1@ac922 ~]$ pip list | grep tensorflow
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
tensorflow (1.4.1)
tensorflow-tensorboard (0.4.0)

이 tensorflow의 이름이 tensorflow-gpu가 아니라고 해서 이것이 CPU 버전의 tensorlfow인 것은 아닙니다. Source로부터 build한 것은 tensorlfow-gpu와 tensorflow가 따로 있진 않고, 이것도 GPU를 이용하도록 opt=cuda로 build한 것입니다. 다음과 같이 간단한 test로 tensorlfow가 V100 GPU를 제대로 물고 오는 것을 확인하실 수 있습니다.

(env_py36) [user1@ac922 ~]$ python
Python 3.6.4 |Anaconda, Inc.| (default, Feb 11 2018, 08:19:13)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess=tf.Session()
2018-02-28 15:08:07.063120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.33GiB
2018-02-28 15:08:07.432906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-02-28 15:08:07.856193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.33GiB
2018-02-28 15:08:08.288151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-02-28 15:08:08.292038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-02-28 15:08:08.295822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3
2018-02-28 15:08:08.295838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y N N
2018-02-28 15:08:08.295851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y N N
2018-02-28 15:08:08.295863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: N N Y Y
2018-02-28 15:08:08.295874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: N N Y Y
2018-02-28 15:08:08.295894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)

이때 nvidia-smi로 다음과 같이 tensorflow가 GPU를 점유하는 것을 보실 수 있습니다.

Wed Feb 28 15:08:17 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.36 Driver Version: 387.36 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 51W / 300W | 15340MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 52W / 300W | 15345MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 39C P0 51W / 300W | 15343MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 43C P0 52W / 300W | 15344MiB / 16128MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10141 C python 15336MiB |
| 1 10141 C python 15334MiB |
| 2 10141 C python 15332MiB |
| 3 10141 C python 15338MiB |
+-----------------------------------------------------------------------------+