HW 엔지니어를 위한 Deep Learning: Tensorflow LMS를 이용한 python coding guide (README-LMS.md)

PowerAI 5.2에 포함된 Tensorflow 1.8에는 IBM이 자랑하는 LMS (Large Model Support) 기능이 통합되어 있습니다. 즉, GPU 메모리가 부족할 경우 host 서버의 RAM을 GPU 메모리처럼 사용할 수 있도록 해주는 기능으로서, 마치 real memory가 부족할 때 disk 상의 swap space로 swap in/out 하는 것과 비슷하다고 생각하시면 됩니다.

Caffe-ibm에는 이것이 -lms 옵션으로 구현되어 있으므로 별다른 코딩 없이 사용하시면 되는데, tensorflow는 그 자체가 원래 python의 library이므로 LMS를 사용하기 위해서는 python coding이 약간 필요합니다. 그에 대한 가이드는 아직 internet 상에는 없고, PowerAI 5.2를 설치하면 딸려오는 /opt/DL/tensorflow/doc/README-LMS.md에 비교적 상세한 설명이 들어 있습니다.

IBM PowerAI 5.2는 Minsky 또는 AC922 서버를 소유하고 계신 고객께서는 추가 비용없이 주문하실 수 있는 무료 SW입니다. (그러나 공식적으로 주문을 하셔야 받으실 수 있습니다.) 고객 여러분의 편의를 위해 그 README-LMS.md를 여기에 올려놓습니다.

아래 문서에 자세히 나와 있습니다만, tensorflow에서 training하는 방법에는 크게 2가지 방법이 있고, LMS 구현도 그에 따라 2가지 방법이 있습니다. 제가 얼치기로 이해한 것을 요약하면 이렇습니다.

1) Session-based training :
기존 tensorflow python code에 graph라는 단어가 나오면 training 부분 전에 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph()) # 파란색 부분을 삽입

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})

2) Estimator-based training :
기존 tensorflow python code에 hook이라는 단어가 나오면 아래 block을 집어넣어 LMS를 구현합니다.

from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'}) # 파란색 부분을 삽입
...

mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook]) #lms_hook을 삽입

그리고 LMS 튜닝에 있어서는 2가지 parameter가 중요합니다. 실제로는 n_tensor를 code에서 직접 쓰실 일은 거의 없는 것 같고, lb는 가끔 사용하는 것 같습니다.

n_tensor : host 서버의 RAM으로 swap-out 했다가 필요시 swap-in 할 tensor의 개수입니다. 당연히 많을 수록 GPU memory를 적게 쓰지만, 대신 training 성능은 나빠집니다. Default 값은 -1로서, 모든 tensor를 다 swap 대상으로 삼습니다만, tensorflow LMS가 초기 estimation 후 자동으로 적절한 값을 산정하여 사용합니다.

lb : Lower Bound의 약자로서, 미리 swap-in 시켜놓을 tensor의 개수입니다. 많을 수록 성능은 좋아지겠지만, 대신 GPU memory를 많이 쓰게 됩니다. Default 값은 1입니다.

전체 내용은 아래 README-LMS.md를 참조하십시요.

[bsyu@p57a22 doc]$ vi /opt/DL/tensorflow/doc/README-LMS.md

# TFLMS: Graph Editing Library for Large Model Support (LMS) in TensorFlow

This library provides an approach to training large models that cannot be fit into GPU memory.
It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa.
The computational graph is statically modified. Hence, it needs to be done before a session actually starts.

## How to use
TFLMS needs to know some information about user-defined models.
There is one requirement for a user-defined model: it must have scopes for the optimizers/solvers.

Enabling LMS for a model depends on how users write their training. Followings are guidelines for two ways: [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training and [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training.

### [Session](https://www.tensorflow.org/programmers_guide/graphs)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
```
#### Step 2: define an LMS object and run it
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())
```
The above lines must be put before starting a training session, for example:
- Before inserting LMS code
```python
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
- After inserting LMS code
```python
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
```
For a working example of LMS integration with Session based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/mnist_deep_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/mnist/mnist_deep.py`.

### [Estimator](https://www.tensorflow.org/programmers_guide/estimators)-based training
#### Step 1: define optimizer/solver scopes
```python
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
```
#### Step 2: define an LMSHook (LMSHook and LMS share the same set of parameters)
```python
# Hook for Large Model Support
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'adam_optimizer'})
```
#### Step 3: add the LMSHook into Estimator's hook list
```python
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000
hooks=[logging_hook, lms_hook])
```

For a working example of LMS integration with Estimator based training see:
`/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/contrib/lms/examples/cnn_mnist_lms.py`
which is an LMS enabled version of `/opt/DL/tensorflow/lib/python*/site-packages/tensorflow/examples/tutorials/layers/cnn_mnist.py`.

### High-Performance Models
A version of TensorFlow High-Performance Models which includes options to use Distributed Deep Learning is included in the tensorflow-performance-models package. These models also have integrated
Large Model Support. For more information, see:

`/opt/DL/tensorflow-performance-models/tf_cnn_benchmarks/README.md`

### GPU scaling tips

To achieve better scaling performance with LMS on multiple GPUs,
the train script should updated to use PowerAI Distributed Deep Learning and
the training script should be run with the `ddlrun` command. Additionally,
if running on a single system without an Infiniband set up,
the `--mpiarg -pami_noib` parameter must be added to the ddlrun command
line, for example:

```bash
ddlrun --mpiarg -pami_noib -H host1 python tf_cnn_benchmarks.py --batch_size=512 --num_batches=100 --model=resnet50 --gpu_thread_mode=gpu_shared --num_gpus=1 --display_every=10 --lms=True --lms_lb=3 --lms_n_tensors=80 --variable_update=ddl
```

For more information on using ddlrun, see: `/opt/DL/ddl/doc/README.md` and `/opt/DL/ddl-tensorflow/doc/README.md`.

### Parameters for LMS/LMSHook
#### Required parameters
_graph_ :: the graph we will modify for LMS. This should be the graph of user-defined neural network. (not required in LMSHook)

_optimizer_scopes_ :: scopes for the optimizers/solvers.

#### Optional parameters
_starting_scope_ :: Tensors that are reachable from the operations in this scope will be swapped for LMS. Set this to the scope of the first layer if we would like to modify the whole graph. Default `None`.

_starting_op_names_ :: Tensors that are reachable from the operations with these names will be swapped for LMS. Default `None`.

_excl_scopes_ :: a set of scopes for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_scopes_ :: a set of scopes for operations whose tensors will be swapped out to the host. Default `empty`.

_excl_types_ :: a set of types for operations whose tensors will not be swapped out to the host. Default `empty`.

_incl_types_ :: a set of types for operations whose tensors will be swapped out to the host. Default `empty`.

_n_tensors_ :: The number of tensors for LMS, counting from the `starting_scope`. To turn off LMS, set `n_tensors` to `0`. Default `-1` (all reachable tensors will be swapped for LMS).

_lb_ :: Lowerbound value for LMS. A tensor will be swapped in during the backward phase at least `lb` nodes before it in the graph. Default `1`.

_ub_ :: Upperbound value for LMS. Default `10000`.

_fuse_swapins_ :: Fuse "close" swap-in operations into one operation. This may improve the performance. Default `False`.

_ctrld_strategy_ :: Two strategies to find control dependency ops for swapin ops: `chain_rule` and `direct_order`. `chain_rule` strategy starts from a forward operation, goes forward and finds a corresponding backward operation to be a control dependency opepartion. `direct_order` strategy directly gets a backward ops in the topological order to be a control dependency operation. Both strategies depend on `lb` and `ub` to choose a control dependency operation. While the `direct_order` is more exact than `chain_rule` in relation to `lb` and `ub`, it experimentally often results in smaller maximum batch size than `chain_rule`. Default `chain_rule`.

_swap_branches_ :: If True, LMS will swap tensors in branches in the forward phase. Default `False`.

_branch_threshold_ :: If `swap_branches` is enabled and the topological-sort distance between the consuming operation and generating operation of a tensor is greater than `branch_threshold`, then swap the tensor. Default `0`.

_debug_ :: Debug mode for LMS. Default `False`.

_debug_level_ :: Debug level for LMS (1 or 2). Default `1`.

### Performance Tuning LMS

Once you have enabled LMS graph modification in your code you will want to find the combination of tuning parameters that gives the fastest training time and best accuracy with your model. The goal of the performance tuning is to swap out enough tensors to allow your training to run without hitting out of memory errors, while not swapping too many such that the extra swapping communication overhead degrades performance.

The two tuning parameters you should focus on are `n_tensors` and `lb`. Since `n_tensors` controls the number of tensors that will be swapped, the higher this is set, the lower the peak GPU memory usage will be. The `lb` controls how soon the tensor is swapped back in before use. A low value of `lb` can make the training on the GPU pause and wait while the swap in finishes. This will degrade performance. A higher value of `lb` can allow the tensor swap in to finish before it's needed and allow training to run without pause. The downside to swapping in too early is that more tensors will be in GPU memory at any point in time, resulting in higher peak GPU memory usage.

The tuning thus becomes finding the correct balance between `n_tensors` and `lb` that provides the best performance for given model. To start the performance tuning it's suggested that `n_tensors` be set to -1 which will swap all reachable tensors. The `lb` should be set to the default 1, which is the latest possible swap in. If `tf.logging` verbosity is set to `tf.logging.INFO`, LMS will output a log statement with a count of the number of tensors swapped. It is useful to run with `n_tensors=-1` for the first run to find this maximum value and then adjust it downward. If your model has branches like some UNet models do, you will likely want to set `swap_branches=True` and tune the branch threshold as well.

By default LMS will analyze your graph to find the starting operations to use for finding tensor swap candidates. You can bypass this analysis by placing your starting operations in a named scope and providing the scope on the `starting_scope` parameter, or by providing the names of the starting operations on the `starting_op_names` parameter. This can speed up repeated runs of LMS during tuning. Furthermore, you can enable `debug=True` and `debug_level=1` and LMS will print out the name and type of the starting operations it finds. These names could be passed in on the `starting_op_names` parameter on subsequent runs.

It is recommended that you start with tuning training on a single GPU before enabling your code for multi-GPU with DDL.

HW 엔지니어를 위한 Deep Learning

2018년 8월 21일 화요일

Tensorflow LMS를 이용한 python coding guide (README-LMS.md)

댓글 없음:

댓글 쓰기