앞선 포스팅과 마찬가지로, Spark를 Ubuntu 16.04 LTS ppc64le 환경에서 build한 뒤, 동일한 OS의 다른 POWER 서버로 그 binary들을 가져가서 제대로 구동되는지 보겠습니다.
다음과 같이 필요한 기본 OS fileset을 설치합니다.
u0017496@sys-85548:~/spark$ sudo apt install openjdk-8-jdk cmake automake autoconf texlive-latex-base maven
그리고, 먼저 R을 설치해야 합니다. 그에 대해서는 앞의 포스팅을 참조하십시요. R 설치가 끝나면, 다음과 같이 SparkR build를 위한 R package들을 download 받습니다. (앞선 포스팅에 올려진 R.tgz 속에는 이 R package들이 이미 포함되어 있으니, 그냥 그걸 쓰셔도 됩니다.)
u0017496@sys-85548:/usr/local/lib/R$ sudo R
> install.packages("knitr")
Installing package into '/usr/local/lib/R/site-library'
> install.packages("e1071")
Installing package into '/usr/local/lib/R/site-library'
> install.packages('survival')
Installing package into '/usr/local/lib/R/site-library'
> install.packages('rmarkdown')
Installing package into '/usr/local/lib/R/site-library'
> install.packages('testthat')
Installing package into '/usr/local/lib/R/site-library'
> install.packages("rJava")
Installing package into '/usr/local/lib/R/site-library'
u0017496@sys-85548:/usr/local/lib/R$ ls -l site-library/
total 120
drwxrwxr-x 7 root staff 4096 Feb 6 23:54 mime
drwxrwxr-x 8 root staff 4096 Feb 6 23:56 stringi
drwxrwxr-x 7 root staff 4096 Feb 6 23:56 magrittr
drwxrwxr-x 9 root staff 4096 Feb 6 23:56 digest
drwxrwxr-x 7 root staff 4096 Feb 6 23:56 highr
drwxrwxr-x 8 root staff 4096 Feb 6 23:56 yaml
drwxrwxr-x 11 root staff 4096 Feb 6 23:56 markdown
drwxrwxr-x 9 root staff 4096 Feb 6 23:56 stringr
drwxrwxr-x 6 root staff 4096 Feb 6 23:56 evaluate
drwxrwxr-x 14 root staff 4096 Feb 6 23:57 knitr
drwxrwxr-x 10 root staff 4096 Feb 7 02:12 MASS
drwxrwxr-x 8 root staff 4096 Feb 7 02:12 class
drwxrwxr-x 8 root staff 4096 Feb 7 02:12 e1071
drwxrwxr-x 10 root staff 4096 Feb 7 04:33 lattice
drwxrwxr-x 12 root staff 4096 Feb 7 04:34 Matrix
drwxrwxr-x 9 root staff 4096 Feb 7 04:35 survival
drwxrwxr-x 16 root staff 4096 Feb 7 04:37 Rcpp
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 bitops
drwxrwxr-x 6 root staff 4096 Feb 7 04:37 backports
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 base64enc
drwxrwxr-x 8 root staff 4096 Feb 7 04:37 jsonlite
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 htmltools
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 caTools
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 rprojroot
drwxrwxr-x 8 root staff 4096 Feb 7 04:37 rmarkdown
drwxrwxr-x 6 root staff 4096 Feb 7 07:03 crayon
drwxrwxr-x 6 root staff 4096 Feb 7 07:03 praise
drwxrwxr-x 7 root staff 4096 Feb 7 07:03 R6
drwxrwxr-x 9 root staff 4096 Feb 7 07:04 testthat
drwxrwxr-x 10 root staff 4096 Feb 7 21:19 rJava
그 다음에 spark build를 시작합니다.
u0017496@sys-85548:~$ git clone https://github.com/apache/spark.git
u0017496@sys-85548:~$ cd spark
먼저 다음과 같이 JAVA_HOME을 선언해줍니다.
u0017496@sys-85548:~/spark$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el
Distribution을 위한 package를 만드려면 dev 밑에 있는 make-distribution.sh을 사용합니다. 먼저, MAKE_R이 default로는 false이므로, Spark에서 R과 함께 사용하려고 한다면 --r option을 쓰셔야 합니다. 그리고 미리 다음과 같이 tar directory 이름을 간단한 것으로 바꿔 둡니다.
u0017496@sys-85548:~/spark$ vi dev/make-distribution.sh
...
TARDIR_NAME=spark-bin
# TARDIR_NAME=spark-$VERSION-bin-$NAME
...
tar czf "spark-$VERSION-bin.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
# tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME"
...
그리고, hive와의 연동에 사용되는 zinc 서버의 실행 파일인 ng가 다음과 같이 x86용 binary로 들어있기 때문에 이대로 수행하면 error가 발생합니다.
root@sys-85548:/home/u0017496/spark# file ./build/zinc-0.3.11/bin/ng/linux32/ng
./build/zinc-0.3.11/bin/ng/linux64/ng: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.18, BuildID[sha1]=8946a7b64cafd2b99faac05b088b8943aa0ec2e6, stripped
이를 ppc64le용으로 새로 build해야 합니다. 그런데 그를 위해서는 또 sbt부터 설치해야 합니다. 이는 다음과 같이 하시면 됩니다.
u0017496@sys-85548:~$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.13/sbt-0.13.13.tgz
u0017496@sys-85548:~$ tar -zxvf sbt-0.13.13.tgz
u0017496@sys-85548:~$ export PATH=$PATH:~/sbt-launcher-packaging-0.13.13/bin
u0017496@sys-85548:~$ which sbt
/home/u0017496/sbt-launcher-packaging-0.13.13/bin/sbt
이제 sbt가 사용 준비 되었습니다. 이제 zinc를 설치합니다.
u0017496@sys-85548:~$ git clone https://github.com/typesafehub/zinc.git
u0017496@sys-85548:~$ cd zinc/
u0017496@sys-85548:~/zinc$ sbt universal:packageZipTarball
이렇게 해서 build된 것 중 linux32 directory에 생성된 ng를 spark build directory의 ng에 overwrite합니다.
u0017496@sys-85548:~/zinc$ cp ./src/universal/bin/ng/linux32/ng ~/spark/build/zinc-0.3.11/bin/ng/linux32/ng
u0017496@sys-85548:~$ file ./spark/build/zinc-0.3.11/bin/ng/linux32/ng
/home/u0017496/spark/build/zinc-0.3.11/bin/ng/linux32/ng: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /lib64/ld64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=142230d9ad8099f606dcf8144b308bc938915812, stripped
유의) 이대로 build를 수행하면 도중에 다음의 'Required file not found: scala-compiler-2.11.8.jar'라는 error가 날 수 있습니다. 실제로는 file이 있는데도요. 이는 다음과 같이 zinc 서버를 shutdown 하시면 해결됩니다.
u0017496@sys-85548:/home/u0017496/zinc# ./src/universal/bin/zinc -shutdown
이제 준비가 끝났습니다. 다음과 같이 dev/make-distribution.sh script를 수행하면 됩니다. 이 build 과정은 시간이 매우 오래 걸리는데, virtual CPU 2개짜리인 IBM Power Cloud에서는 무려 140분 정도 걸립니다.
u0017496@sys-85548:~/spark$ ./dev/make-distribution.sh --name spark-ppc64le --tgz -Psparkr -Phive -Phive-thriftserver
build가 끝나면 다음과 같은 tgz file이 생깁니다.
u0017496@sys-85548:~/spark$ ls -l *.tgz
-rw-r--r-- 1 root root 177688849 Feb 9 10:12 spark-2.2.0-SNAPSHOT-bin.tgz
이를 먼저 /usr/local directory에 풀어준 뒤, 그 결과 생긴 spark-bin directory를 SPARK_HOME으로 export 하십시요.
u0017496@sys-85548:/usr/local$ sudo tar -zxvf ~/spark/spark-2.2.0-SNAPSHOT-bin.tgz
u0017496@sys-85548:/usr/local$ tail ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el
export R_HOME=/usr/local/lib/R
export SPARK_HOME=/usr/local/spark-bin
export PATH=$SPARK_HOME/bin:$R_HOME/bin:$PATH
그리고, 원래의 spark directory 밑의 R directory에는 R을 위한 SparkR package가 다음과 같이 gz file로 생성되어 있습니다.
u0017496@sys-85548:~/spark$ ls R/*.gz
R/SparkR_2.2.0-SNAPSHOT.tar.gz
R에 이 SparkR package를 설치하겠습니다.
u0017496@sys-85548:/usr/local/spark-bin$ sudo R
> pkgPath <- '/home/u0017496/spark/R/SparkR_2.2.0-SNAPSHOT.tar.gz'
> install.packages(pkgPath)
Installing package into '/usr/local/lib/R/site-library'
..
그 결과는 /usr/local/lib/R/site-library에서 보실 수 있습니다. 아래 보이는 SparkR 패키지가 설치된 R.tgz가 앞선 포스팅에 올려져 있으니, 이 과정을 굳이 거치실 필요없이 그걸 그대로 쓰셔도 됩니다.
u0017496@sys-85548:/usr/local/spark-bin$ ls -ltr /usr/local/lib/R/site-library
total 124
drwxrwxr-x 7 root staff 4096 Feb 6 23:54 mime
drwxrwxr-x 8 root staff 4096 Feb 6 23:56 stringi
drwxrwxr-x 7 root staff 4096 Feb 6 23:56 magrittr
drwxrwxr-x 9 root staff 4096 Feb 6 23:56 digest
drwxrwxr-x 7 root staff 4096 Feb 6 23:56 highr
drwxrwxr-x 8 root staff 4096 Feb 6 23:56 yaml
drwxrwxr-x 11 root staff 4096 Feb 6 23:56 markdown
drwxrwxr-x 9 root staff 4096 Feb 6 23:56 stringr
drwxrwxr-x 6 root staff 4096 Feb 6 23:56 evaluate
drwxrwxr-x 14 root staff 4096 Feb 6 23:57 knitr
drwxrwxr-x 10 root staff 4096 Feb 7 02:12 MASS
drwxrwxr-x 8 root staff 4096 Feb 7 02:12 class
drwxrwxr-x 8 root staff 4096 Feb 7 02:12 e1071
drwxrwxr-x 10 root staff 4096 Feb 7 04:33 lattice
drwxrwxr-x 12 root staff 4096 Feb 7 04:34 Matrix
drwxrwxr-x 9 root staff 4096 Feb 7 04:35 survival
drwxrwxr-x 16 root staff 4096 Feb 7 04:37 Rcpp
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 bitops
drwxrwxr-x 6 root staff 4096 Feb 7 04:37 backports
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 base64enc
drwxrwxr-x 8 root staff 4096 Feb 7 04:37 jsonlite
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 htmltools
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 caTools
drwxrwxr-x 7 root staff 4096 Feb 7 04:37 rprojroot
drwxrwxr-x 8 root staff 4096 Feb 7 04:37 rmarkdown
drwxrwxr-x 6 root staff 4096 Feb 7 07:03 crayon
drwxrwxr-x 6 root staff 4096 Feb 7 07:03 praise
drwxrwxr-x 7 root staff 4096 Feb 7 07:03 R6
drwxrwxr-x 9 root staff 4096 Feb 8 02:56 testthat
drwxrwxr-x 10 root staff 4096 Feb 8 02:57 rJava
drwxrwxr-x 10 root staff 4096 Feb 9 19:39 SparkR
이제 이 binary들을 별도의 서버에 옮겨서 설치하겠습니다.
먼저, 저 SparkR package가 설치된 /usr/local/lib/R directory 전체를 tar로 말아서 설치하려는 서버로 옮기시고, 같은 /usr/local/lib/ 밑에 풀어 놓습니다.
u0017496@sys-85549:/usr/local/lib$ tar -zxvf R.tgz R
이어서, 위에서 만들어진 spark-2.2.0-SNAPSHOT-bin.tgz를 설치하려는 서버로 옮기시고, /usr/local directory에 풀어 넣으시면 됩니다. 그러면 다음과 같이 spark-bin directory가 생깁니다. 이를 SPARK_HOME으로 export 하십시요.
u0017496@sys-85549:/usr/local$ sudo tar -zxvf /tmp/spark-2.2.0-SNAPSHOT-bin.tgz
u0017496@sys-85549:/usr/local$ tail ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-ppc64el
export R_HOME=/usr/local/lib/R
export SPARK_HOME=/usr/local/spark-bin
export PATH=$SPARK_HOME/bin:$R_HOME/bin:$PATH
u0017496@sys-85549:/usr/local$ cd spark-bin
Spark 구동을 위해서는 local 서버 자체에서도 passwd 없이 ssh가 가능해야 합니다. 이를 위한 준비를 합니다.
u0017496@sys-85549:/usr/local/spark-bin/sbin$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/u0017496/.ssh/id_rsa):
Created directory '/home/u0017496/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
u0017496@sys-85549:/usr/local/spark-bin/sbin$ ssh-copy-id localhost
이제 spark master를 구동합니다.
u0017496@sys-85549:/usr/local/spark-bin/sbin$ sudo mkdir /usr/local/spark-bin/logs
u0017496@sys-85549:/usr/local/spark-bin/sbin$ sudo chown -R u0017496:u0017496 /usr/local/spark-bin/logs
u0017496@sys-85549:/usr/local/spark-bin/sbin$ ./start-all.sh
sparkR을 구동해서 이 spark master에 연결되는지 확인합니다.
u0017496@sys-85549:/usr/local/spark-bin/bin$ sparkR
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT
/_/
SparkSession available as 'spark'.
>
성공했습니다.
이제 sparkR에서 그래프를 그려봅니다. 그를 위해서 먼저 DISPLAY를 Xmanager를 띄워놓은 PC client로 세팅합니다.
u0017496@sys-85549:/usr/local/spark-bin/bin$ who
u0017496 pts/0 2017-02-09 19:21 (172.29.97.138)
u0017496@sys-85549:/usr/local/spark-bin/bin$ export DISPLAY=172.29.97.138:0
u0017496@sys-85549:/usr/local/spark-bin/bin$ sparkR
> plot(BOD, type = 'l')
다음과 같이 PC client에 Xmanager를 통해서 그래프가 그려지는 것을 보실 수 있습니다.
여기서 build된 tgz 파일들은 아래 URL에 올려 놓았으니 그대로 쓰셔도 됩니다. 역시 제가 법적, 기술적 책임은 못 진다는 것은 유의하시기 바랍니다.
https://drive.google.com/drive/folders/0B-F0jEb44gqUN3cxaGVDdklwdWc?usp=sharing
2017년 2월 10일 금요일
2017년 2월 7일 화요일
Linux on POWER (ppc64le)에 R 서버 및 R-Studio 설치하기
Revolution R이 MS에게 매각되면서 MS-R이 되었습니다만, 여전히 open source로서 source code는 공개되고 있습니다. MS-R과 R-Studio를 Ubuntu 16.04 LTS ppc64le 환경에서 build한 뒤, 동일한 OS의 다른 POWER 서버로 그 binary들을 가져가서 제대로 구동되는지 보겠습니다.
다음과 같이 필요한 기본 OS fileset을 설치합니다. (일부는 필요없는 것도 있겠으나... 일단 그냥 다 설치합시다.)
u0017496@sys-85548:~$ sudo apt-get install gfortran gfortran-5 libgfortran-5-dev openssl libcurl4-openssl-dev libglobus-openssl-module-dev libboost-dev libboost-regex1.58-dev libboost-filesystem1.58-dev libboost-math1.58-dev libboost-tools-dev libboost-date-time1.58-dev libboost-iostreams1.58-dev libboost-program-options1.58-dev libboost-signals1.58-dev libboost-thread1.58-dev libboost-chrono1.58-dev libpam0g-dev libr3-0 xorg-dev xserver-xorg-dev nautilus ubuntu-gnome-desktop r-base r-base-dev r-base-core r-base-html cmake openssl libcurl4-openssl-dev libglobus-openssl-module-dev libboost-dev libboost-regex1.58-dev libboost-filesystem1.58-dev libboost-math1.58-dev libboost-tools-dev libboost-date-time1.58-dev libboost-iostreams1.58-dev libboost-program-options1.58-dev libboost-signals1.58-dev libboost-thread1.58-dev libboost-chrono1.58-dev libpam0g-dev libr3-0 xorg-dev xserver-xorg-dev nautilus ubuntu-gnome-desktop uuid-dev libuuid1 ant libasio-dev cmake automake autoconf libpango1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libpangomm-1.4-1v5 libpangoxft-1.0-0 gir1.2-coglpango-1.0 gir1.2-pango-1.0 libcogl-pango20
먼저, MS-R 서버를 build 합니다. Github에서 source code를 download 받습니다.
u0017496@sys-85548:~$ wget https://github.com/Microsoft/microsoft-r-open/archive/MRO-3.3.2.tar.gz
u0017496@sys-85548:~$ tar -zxvf MRO-3.3.2.tar.gz
u0017496@sys-85548:~$ cd microsoft-r-open-MRO-3.3.2/source
u0017496@sys-85548:~/microsoft-r-open-MRO-3.3.2/source$ ./configure --enable-R-shlib --without-recommended-packages ; make ; sudo make install
이렇게 build되고 install된 R 서버는 /usr/local/lib/R directory에 설치됩니다. 이제 그 directory 전체를 tar로 말도록 하겠습니다.
u0017496@sys-85548:~/microsoft-r-open-MRO-3.3.2/source$ cd /usr/local/lib/
u0017496@sys-85548:/usr/local/lib$ sudo tar -zcvf R.tgz R
u0017496@sys-85548:/usr/local/lib$ ls -l R.tgz
-rw-r--r-- 1 root root 27970104 Feb 6 03:59 R.tgz
이 R.tgz 파일은 조금 뒤에 다른 서버로 옮겨가서 설치하겠습니다.
이제 R-Studio를 build 합니다.
u0017496@sys-85548:~$ wget https://github.com/rstudio/rstudio/archive/master.zip
u0017496@sys-85548:~$ unzip master.zip
u0017496@sys-85548:~$ cd rstudio-master
u0017496@sys-85548:~/rstudio-master$ ./package/linux/install-dependencies
u0017496@sys-85548:~/rstudio-master$ cd dependencies/common/
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-cef
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-boost
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-common
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-gwt
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-libclang
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-mathjax
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-packages
u0017496@sys-85548:~/rstudio-master/dependencies/common$ ./install-pandoc
u0017496@sys-85548:~/rstudio-master/dependencies/common$ cd ../..
u0017496@sys-85548:~/rstudio-master$ mkdir build
u0017496@sys-85548:~/rstudio-master$ cd build
u0017496@sys-85548:~/rstudio-master/build$ cmake .. -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release
u0017496@sys-85548:~/rstudio-master/build$ make ; sudo make install
이렇게 build된 R-Studio는 /usr/local/lib/rstudio directory에 설치됩니다. 역시 이 directory 전체를 tar로 말겠습니다.
u0017496@sys-85548:~/rstudio-master/build$ cd /usr/local/lib
u0017496@sys-85548:/usr/local/lib$ sudo tar -zcvf ./rstudio-server.tgz rstudio-server
u0017496@sys-85548:/usr/local/lib$ ls -l rstudio-server.tgz
-rw-r--r-- 1 root root 61493341 Feb 6 03:27 rstudio-server.tgz
* 이 파일들을 MS-R과 R-Studio를 설치해야 하는 서버로 옮긴 뒤 거기서 다음과 같이 tgz를 untar 합니다. 단, 그 전에, 다음과 같이 위 build 서버에서 apt-get install 명령으로 설치했던 모든 기본 fileset은 이 target 서버에서도 미리 apt-get install 명령으로 설치해두어야 합니다.
u0017496@sys-85549:~$ sudo apt-get install gfortran gfortran-5 libgfortran-5-dev openssl libcurl4-openssl-dev libglobus-openssl-module-dev libboost-dev libboost-regex1.58-dev libboost-filesystem1.58-dev libboost-math1.58-dev libboost-tools-dev libboost-date-time1.58-dev libboost-iostreams1.58-dev libboost-program-options1.58-dev libboost-signals1.58-dev libboost-thread1.58-dev libboost-chrono1.58-dev libpam0g-dev libr3-0 xorg-dev xserver-xorg-dev nautilus ubuntu-gnome-desktop r-base r-base-dev r-base-core r-base-html cmake openssl libcurl4-openssl-dev libglobus-openssl-module-dev libboost-dev libboost-regex1.58-dev libboost-filesystem1.58-dev libboost-math1.58-dev libboost-tools-dev libboost-date-time1.58-dev libboost-iostreams1.58-dev libboost-program-options1.58-dev libboost-signals1.58-dev libboost-thread1.58-dev libboost-chrono1.58-dev libpam0g-dev libr3-0 xorg-dev xserver-xorg-dev nautilus ubuntu-gnome-desktop uuid-dev libuuid1 ant libasio-dev cmake automake autoconf libpango1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libpangomm-1.4-1v5 libpangoxft-1.0-0 gir1.2-coglpango-1.0 gir1.2-pango-1.0 libcogl-pango20
u0017496@sys-85549:~$ cd /usr/local/lib
이 서버의 /usr/local/lib directory에 아까의 build 서버에서 말아 놓았던 tar file들을 가져와 풀어놓습니다. 먼저 R 서버부터 제대로 작동하는지 확인합니다.
u0017496@sys-85549:/usr/local/lib$ ls -ltr
total 60068
drwxrwsr-x 3 root staff 4096 Jan 13 09:17 python3.5
drwxrwsr-x 4 root staff 4096 Feb 6 01:44 python2.7
drwxrwsr-x 3 root staff 4096 Feb 6 01:50 R
-rw-r--r-- 1 root root 61493341 Feb 6 03:34 rstudio-server.tgz
-rw-r--r-- 1 root root 27970104 Feb 6 04:02 R.tgz
u0017496@sys-85549:/usr/local/lib$ sudo tar -zxvf R.tgz
Windows가 설치된 PC client에서, 자신이 어떤 IP로 접속하고 있는지 who 명령으로 확인한 뒤, 그 주소로 DISPLAY 환경변수를 export 해줍니다.
u0017496@sys-85549:/usr/local/lib$ who
u0017496 pts/0 Feb 6 19:18 (172.29.97.114)
u0017496@sys-85549:/usr/local/lib$ export DISPLAY=172.29.97.114:0
이제 172.29.97.114 주소를 가진 client PC에서 Xmanager를 passive mode로 구동시킨 뒤 아래처럼 R에서 그래프를 그리면 그 그래픽 화면이 PC 창에 띄워집니다.
u0017496@sys-85549:/usr/local/lib$ /usr/local/lib/R/bin/R
> plot(BOD, type = 'l')
다음은 R-Studio입니다. 역시 tar를 풀어주고, 다음과 같이 rstudio-server라는 이름의 script를 /etc/init.d에 등록해줍니다.
u0017496@sys-85549:/usr/local/lib$ sudo tar -zxvf rstudio-server.tgz
u0017496@sys-85549:/usr/local/lib$ sudo cp rstudio-server/extras/init.d/debian/rstudio-server /etc/init.d/rstudio-server
u0017496@sys-85549:/usr/local/lib$ sudo update-rc.d rstudio-server defaults
u0017496@sys-85549:/usr/local/lib$ sudo ln -f -s /usr/local/lib/rstudio-server/bin/rstudio-server /usr/sbin/rstudio-server
rstudio-server를 systemctl 명령으로 start 해준 뒤, status를 확인합니다.
u0017496@sys-85549:/usr/local/lib$ sudo systemctl start rstudio-server.service
u0017496@sys-85549:/usr/local/lib$ sudo systemctl status rstudio-server.service
â— rstudio-server.service - LSB: RStudio Server
Loaded: loaded (/etc/init.d/rstudio-server; bad; vendor preset: enabled)
Active: active (running) since Mon 2017-02-06 19:37:08 EST; 25s ago
Docs: man:systemd-sysv-generator(8)
Process: 19387 ExecStart=/etc/init.d/rstudio-server start (code=exited, status=0/SUCCESS)
Tasks: 3
Memory: 7.5M
CPU: 447ms
CGroup: /system.slice/rstudio-server.service
└─19395 /usr/local/lib/rstudio-server/bin/rserver
Feb 06 19:37:08 sys-85549 systemd[1]: Starting LSB: RStudio Server...
Feb 06 19:37:08 sys-85549 systemd[1]: Started LSB: RStudio Server.
이제 http://서버주소:8787 로 web browser를 통해 접근할 수 있습니다. user id와 passwd는 OS user의 것을 그대로 쓰면 됩니다.
이 file들은 아래 link에 공유해놓았으니 마음대로 쓰셔도 됩니다. 물론 제가 어떠한 기술적, 법적 책임을 지지는 않는다는 점은 유의 부탁드립니다.
https://drive.google.com/drive/folders/0B-F0jEb44gqUN3cxaGVDdklwdWc?usp=sharing
2017년 2월 2일 목요일
Minsky 서버에서 nvidia-docker를 이용하여 Caffe Alexnet training 수행하기
앞선 포스팅에서 말씀드린 바와 같이, nvidia-docker를 이용하면 다양한 환경의 deep learning framework을 사용자 간의 application 충돌 없이 손쉽게 사용 가능합니다. 이번에는 NVIDIA P100 GPU를 탑재한 ppc64le 환경인 Minsky 서버에서 Caffe를 docker image로 build하여 Alexnet training을 수행해 보겠습니다.
먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다. 이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.
root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial
# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt
RUN apt-get update && apt-get install -y cuda
RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y caffe-nv libnccl1 && \
rm -rf /var/lib/apt/lists/*
# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe
이제, 이 dockerfile로 build를 시작합니다. 물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.
root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
---> Using cache
---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
---> Using cache
---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
---> Using cache
---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
---> Using cache
---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && rm -rf /tmp/temp && apt-get update && apt-get install -y caffe-nv libnccl1 && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
---> Running in 875c714591e5
---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
---> Running in e78eaede0f62
---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
---> Running in a95e655fee4f
---> be9b92d51239
이제 docker image를 확인합니다. 이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다. 원래 필요없는 것은 빼시는 것이 좋습니다.
root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/caffe ppc64le-xenial 2705abb3bbc5 13 seconds ago 4.227 GB
bsyu/p2p ppc64le-xenial 2fe1b4ac3b03 17 hours ago 2.775 GB
registry latest 781e109ba95f 44 hours ago 612.6 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
ubuntu/xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 46 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda devel dc3faec17c11 46 hours ago 1.726 GB
cuda latest dc3faec17c11 46 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 46 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 46 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 46 hours ago 844.9 MB
cuda runtime 8e9763b6296f 46 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 6 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 6 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 6 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 6 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 6 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
Docker image를 nvidia-docker로 수행해 봅니다. Caffe 버전을 확인할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time
Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file.) type: string
default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""
그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다. /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash
root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory
그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash
root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/nvme0n1 2.9T 290G 2.5T 11% /nvme
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel caffe_alexnet_train_iter_50000.caffemodel data
caffe_alexnet_train_iter_102000.solverstate caffe_alexnet_train_iter_50000.solverstate ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel caffe_alexnet_train_iter_51000.caffemodel ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate caffe_alexnet_train_iter_51000.solverstate imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel caffe_alexnet_train_iter_56250.caffemodel kkk
caffe_alexnet_train_iter_28.solverstate caffe_alexnet_train_iter_56250.solverstate lost+found
caffe_alexnet_train_iter_37500.caffemodel caffe_alexnet_train_iter_6713.caffemodel solver.prototxt
caffe_alexnet_train_iter_37500.solverstate caffe_alexnet_train_iter_6713.solverstate train_val.prototxt
이제 caffe docker image를 이용하여 Alexnet training을 수행합니다. 아주 잘 되는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032 1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119 1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659 1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191 1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721 1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641 1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285 1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449 1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492 1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807 1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096 1 solver.cpp:429] Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148 1 solver.cpp:429] Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459 1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503 1 solver.cpp:261] Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540 1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...
이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다. GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.
Thu Feb 2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 74C P0 242W / 300W | 9329MiB / 16280MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 69C P0 256W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 75C P0 244W / 300W | 8337MiB / 16280MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 67C P0 222W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 121885 C ./caffe 9317MiB |
| 1 121885 C ./caffe 8325MiB |
| 2 121885 C ./caffe 8325MiB |
| 3 121885 C ./caffe 8325MiB |
+-----------------------------------------------------------------------------+
root@minsky:/data/mydocker# ps -ef | grep 121885
root 121885 121867 99 11:27 ? 01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121992 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121885
이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.
먼저, 다음과 같이 CUDA와 CUDNN, Caffe를 포함한 IBM의 PowerAI toolkit 등을 포함한 docker image를 만들기 위해 dockerfile을 생성합니다. 이때, COPY라는 명령에서 사용되는 directory의 끝에는 반드시 "/"를 붙여야 한다는 점을 유의하십시요.
root@minsky:/data/mydocker# vi dockerfile.caffe
FROM bsyu/p2p:ppc64le-xenial
# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt
RUN apt-get update && apt-get install -y cuda
RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo-* /tmp/temp/
COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
RUN dpkg -i /tmp/temp/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && \
dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && \
rm -rf /tmp/temp && \
apt-get update && apt-get install -y caffe-nv libnccl1 && \
rm -rf /var/lib/apt/lists/*
# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./caffe
이제, 이 dockerfile로 build를 시작합니다. 물론 현재의 host directory에는 관련 deb file들을 미리 copy해 두어야 합니다.
root@minsky:/data/mydocker# docker build -t bsyu/caffe:ppc64le-xenial -f dockerfile.caffe .
Sending build context to Docker daemon 1.664 GB
Step 1 : FROM bsyu/p2p:ppc64le-xenial
---> 2fe1b4ac3b03
Step 2 : RUN apt-get update && apt-get install -y cuda
---> Using cache
---> ae24f9bb0f23
Step 3 : RUN mkdir /tmp/temp
---> Using cache
---> 5340f9d1b49c
Step 4 : COPY libcudnn5* /tmp/temp/
---> Using cache
---> 4d1ff5eed9f0
Step 5 : COPY mldl-repo-local_1-3ibm5_ppc64el.deb /tmp/temp/
---> Using cache
---> c3e3840d33e5
Step 6 : RUN dpkg -i /tmp/temp/libcudnn5_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/libcudnn5-dev_5.1.5-1+cuda8.0_ppc64el.deb && dpkg -i /tmp/temp/mldl-repo-local_1-3ibm5_ppc64el.deb && rm -rf /tmp/temp && apt-get update && apt-get install -y caffe-nv libnccl1 && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 1868adb5dc10
Step 7 : WORKDIR /opt/DL/caffe-nv/bin
---> Running in 875c714591e5
---> ec7c68de4d7e
Step 8 : ENV LD_LIBRARY_PATH "/opt/DL/nccl/lib:/opt/DL/openblas/lib:/opt/DL/nccl/lib:/usr/local/cuda-8.0/lib6:/usr/lib:/usr/local/lib"
---> Running in e78eaede0f62
---> 7450b81fde8d
Removing intermediate container e78eaede0f62
Step 9 : CMD ./caffe
---> Running in a95e655fee4f
---> be9b92d51239
이제 docker image를 확인합니다. 이것저것 닥치는대로 넣다보니 image size가 4GB가 좀 넘습니다. 원래 필요없는 것은 빼시는 것이 좋습니다.
root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/caffe ppc64le-xenial 2705abb3bbc5 13 seconds ago 4.227 GB
bsyu/p2p ppc64le-xenial 2fe1b4ac3b03 17 hours ago 2.775 GB
registry latest 781e109ba95f 44 hours ago 612.6 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
ubuntu/xenial gevent 4ce0e6ba8a69 44 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 46 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 46 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 46 hours ago 1.726 GB
cuda devel dc3faec17c11 46 hours ago 1.726 GB
cuda latest dc3faec17c11 46 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 46 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 46 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 46 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 46 hours ago 844.9 MB
cuda runtime 8e9763b6296f 46 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 6 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 6 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 6 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 6 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 6 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
Docker image를 nvidia-docker로 수행해 봅니다. Caffe 버전을 확인할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial ./caffe --version
caffe version 0.15.13
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/caffe:ppc64le-xenial
caffe: command line brew
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time
Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file.) type: string
default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""
그냥 docker를 수행하면 container 내의 filesystem은 다음과 같습니다. /nvme라는 host 서버의 filesystem이 mount point조차 없는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/caffe:ppc64le-xenial bash
root@8f2141cfade6:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@8f2141cfade6:/opt/DL/caffe-nv/bin# cd /nvme
bash: cd: /nvme: No such file or directory
그러나 다음과 같이 -v (--volume) 옵션을 주면서 수행하면 host 서버의 filesystem도 사용할 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti -v /nvme:/nvme bsyu/caffe:ppc64le-xenial bash
root@ee2866a65362:/opt/DL/caffe-nv/bin# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 184G 619G 23% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/nvme0n1 2.9T 290G 2.5T 11% /nvme
/dev/sda2 845G 184G 619G 23% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@ee2866a65362:/opt/DL/caffe-nv/bin# ls /nvme
caffe_alexnet_train_iter_102000.caffemodel caffe_alexnet_train_iter_50000.caffemodel data
caffe_alexnet_train_iter_102000.solverstate caffe_alexnet_train_iter_50000.solverstate ilsvrc12_train_lmdb
caffe_alexnet_train_iter_208.caffemodel caffe_alexnet_train_iter_51000.caffemodel ilsvrc12_val_lmdb
caffe_alexnet_train_iter_208.solverstate caffe_alexnet_train_iter_51000.solverstate imagenet_mean.binaryproto
caffe_alexnet_train_iter_28.caffemodel caffe_alexnet_train_iter_56250.caffemodel kkk
caffe_alexnet_train_iter_28.solverstate caffe_alexnet_train_iter_56250.solverstate lost+found
caffe_alexnet_train_iter_37500.caffemodel caffe_alexnet_train_iter_6713.caffemodel solver.prototxt
caffe_alexnet_train_iter_37500.solverstate caffe_alexnet_train_iter_6713.solverstate train_val.prototxt
이제 caffe docker image를 이용하여 Alexnet training을 수행합니다. 아주 잘 되는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -v /nvme:/nvme bsyu/caffe:ppc64le-xenial ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
I0202 02:27:22.200032 1 caffe.cpp:197] Using GPUs 0, 1, 2, 3
I0202 02:27:22.201119 1 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0202 02:27:22.201659 1 caffe.cpp:202] GPU 1: Tesla P100-SXM2-16GB
I0202 02:27:22.202191 1 caffe.cpp:202] GPU 2: Tesla P100-SXM2-16GB
I0202 02:27:22.202721 1 caffe.cpp:202] GPU 3: Tesla P100-SXM2-16GB
I0202 02:27:23.986641 1 solver.cpp:48] Initializing solver from parameters:
...
I0202 02:27:28.246285 1 parallel.cpp:334] Starting Optimization
I0202 02:27:28.246449 1 solver.cpp:304] Solving AlexNet
I0202 02:27:28.246492 1 solver.cpp:305] Learning Rate Policy: step
I0202 02:27:28.303807 1 solver.cpp:362] Iteration 0, Testing net (#0)
I0202 02:27:44.866096 1 solver.cpp:429] Test net output #0: accuracy = 0.000890625
I0202 02:27:44.866148 1 solver.cpp:429] Test net output #1: loss = 6.91031 (* 1 = 6.91031 loss)
I0202 02:27:45.356459 1 solver.cpp:242] Iteration 0 (0 iter/s, 17.1098s/200 iter), loss = 6.91465
I0202 02:27:45.356503 1 solver.cpp:261] Train net output #0: loss = 6.91465 (* 1 = 6.91465 loss)
I0202 02:27:45.356540 1 sgd_solver.cpp:106] Iteration 0, lr = 0.01
...
이렇게 docker container를 이용해 training이 수행되는 동안, host 서버에서 nvidia-smi 명령을 통해 GPU 사용량을 모니터링 해봅니다. GPU를 사용하는 application 이름은 caffe로 나오는 것을 보실 수 있습니다.
Thu Feb 2 11:37:52 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 74C P0 242W / 300W | 9329MiB / 16280MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 69C P0 256W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 75C P0 244W / 300W | 8337MiB / 16280MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 67C P0 222W / 300W | 8337MiB / 16280MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 121885 C ./caffe 9317MiB |
| 1 121885 C ./caffe 8325MiB |
| 2 121885 C ./caffe 8325MiB |
| 3 121885 C ./caffe 8325MiB |
+-----------------------------------------------------------------------------+
그 pid를 통해 추적해보면 저 caffe라는 process의 parent는 docker-containerd 임을 알 수 있습니다.
root 121885 121867 99 11:27 ? 01:30:11 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121992 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121885
root@minsky:/data/mydocker# ps -ef | grep 121867
root 121867 106109 0 11:27 ? 00:00:00 docker-containerd-shim 61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c /var/run/docker/libcontainerd/61b16f54712439496aec5d04cca0906425a1106a6dda935f47e228e498ddb94c docker-runc
root 121885 121867 99 11:27 ? 01:34:04 ./caffe train -gpu 0,1,2,3 --solver=/nvme/solver.prototxt
root 121996 116723 0 11:39 pts/0 00:00:00 grep --color=auto 121867
이 docker image는 https://hub.docker.com/r/bsyu/ 에 push되어 있으므로 Minsky 서버를 가지고 계신 분은 자유롭게 pull 받아 사용하실 수 있습니다.
2017년 2월 1일 수요일
Minsky 서버 Ubuntu 16.04 (Xenial) ppc64le (POWER8)에서 nvidia-docker의 설치와 테스트
GPU 서버는 꽤 고가의 장비이기 때문에, 대개의 경우 여러 연구자가 공용으로 사용합니다. 그러나 어떤 연구자는 CUDA 7.5를 쓰는데 어떤 연구자는 8.0을 써야 하는 경우도 있고, Tensorflow r0.9를 써야 하는 연구원이 있는가 하면 r0.12를 테스트해보고자 하는 연구원도 있을 수 있습니다.
이런 경우 docker, 그 중에서도 nvidia-docker가 좋은 해결책이 될 수 있습니다. 가령 아래의 경우를 보면, host 서버에는 아예 nvcc v7.5.17이 설치되어 있습니다.
root@minsky:/data# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:31:50_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
하지만 그 host에서 돌아가는 docker container 속에서는 nvcc v8.0.44를 운용하는 것을 보실 수 있습니다.
이렇게 편리한 nvidia-docker를 사용하기 위한 절차를 step-by-step으로 알아보겠습니다.
먼저, 기본 docker 엔진 설치가 필요합니다. 다음과 같이 ppc64le용 docker의 APT repository를 /etc/apt/sources.list.d 에 등록합니다.
root@minsky:~# echo deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/16_04/docker-1.12.6-ppc64el/ xenial main > /etc/apt/sources.list.d/xenial-docker.list
root@minsky:~# apt-get update
다음과 같이 설치하고, docker service 시작합니다.
root@minsky:~# apt-get install docker-engine
root@minsky:~# service docker restart
이어서, nvidia-docker를 source에서 build합니다.
root@minsky:~# cd /data
root@minsky:/data# git clone https://github.com/NVIDIA/nvidia-docker.git
root@minsky:/data# cd nvidia-docker
root@minsky:/data/nvidia-docker# git fetch --all
root@minsky:/data/nvidia-docker# git checkout ppc64le
root@minsky:/data/nvidia-docker# ls
centos centos-7 LICENSE mk samples ubuntu ubuntu-16.04
centos-6 CLA Makefile README.md tools ubuntu-14.04
root@minsky:/data/nvidia-docker# make deb
이것이 끝나면 tools/dist 밑에 설치 가능한 nvidia-docker debian package가 만들어집니다. 그걸 dpkg 명령으로 설치합니다.
root@minsky:/data/nvidia-docker# dpkg -i tools/dist/nvidia-docker_1.0.0~rc.3-1_ppc64el.deb
nvidia-docker image 명령으로 보면, 몇몇 기본 image들이 보입니다.
root@minsky:/data/nvidia-docker# nvidia-docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia-docker deb 332eaa8c9f9d 3 minutes ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 minutes ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
ppc64le/ubuntu를 docker run 하면, 기본으로 latest라는 tag가 붙은 docker image를 run 합니다. 그러나 기존의 image 중에는 ppc64le/ubuntu:latest가 없으므로, 그 이미지를 새로 docker hub에서 download 해온 뒤 수행합니다.
root@minsky:/data/nvidia-docker# docker run -it ppc64le/ubuntu bash
Unable to find image 'ppc64le/ubuntu:latest' locally
latest: Pulling from ppc64le/ubuntu
0847857e6401: Pull complete
f8c18c152457: Pull complete
8643975d001d: Pull complete
d5802da4b3a0: Pull complete
fe172ed92137: Pull complete
Digest: sha256:5349f00594c719455f2c8e6f011b32758dcd326d8e225c737a55c15cf3d6948c
Status: Downloaded newer image for ppc64le/ubuntu:latest
이제 docker image의 bash 안으로 들어 왔습니다.
root@ba07ff7529b3:/# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 743G 60G 93% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 743G 60G 93% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@ba07ff7529b3:/# ls
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
root@ba07ff7529b3:/# uname -a
Linux ba07ff7529b3 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
Ubuntu의 minimum image이미지라서 ifconfig 명령조차 없습니다.
root@ba07ff7529b3:/# ifconfig
bash: ifconfig: command not found
root@ba07ff7529b3:/# exit
docker ps -a로 container의 상황을 볼 수 있습니다.
root@minsky:/data/nvidia-docker# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ba07ff7529b3 ppc64le/ubuntu "bash" About a minute ago Exited (127) 9 seconds ago small_ride
이제 직접 docker image를 build해 봅니다. 기본 내용은 아래 URL을 보고 따라한 것입니다.
https://www.ibm.com/developerworks/library/d-docker-on-power-linux-platform/
Ubuntu 16.04 Xenial ppc64le의 docker image를 build합니다. 그러기 위해 먼저 debootstrap package를 설치하고, 아래처럼 debootstrap.sh 스크립트를 download 받습니다.
root@minsky:/data# apt-get install -y debootstrap
root@minsky:/data# curl -o debootstrap.sh https://raw.githubusercontent.com/docker/docker/master/contrib/mkimage/debootstrap
root@minsky:/data# chmod a+x ./debootstrap.sh
아래와 같이 Xenial의 main, universe, multiverse, restricted 4개의 repository를 끼고 build하는 것으로 스크립트를 수행합니다.
root@minsky:/data# ./debootstrap.sh ubuntu --components=main,universe,multiverse,restricted xenial
수행이 끝나면 바로 아래 ubuntu directory에 OS 이미지를 위한 directory들이 생깁니다.
root@minsky:/data# ls ubuntu
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
이제 이 directory를 tar로 말면서 docker에 import 합니다.
root@minsky:/data# tar -C ubuntu -c . | docker import - ubuntu:16.04
sha256:09621ebd4cfd280af86ef61e2c5a41e8ef4e0081d6ec51203dba1fceaf69e625
Import된 docker image를 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 16.04 09621ebd4cfd 31 seconds ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d About an hour ago 430.1 MB
nvidia-docker build 8cbc22512d15 About an hour ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 12 weeks ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
여러가지로 tag를 합니다. 특히 latest tag를 꼭 합니다. 그러지 않으면 그 이미지를 부를 때마다, 같은 이름의 latest tag가 달린 이미지를 인터넷을 통해 docker hub에서 download 하려 들겁니다.
root@minsky:/data# docker tag ubuntu:16.04 ubuntu:xenial
root@minsky:/data# docker tag ubuntu:16.04 ubuntu:latest
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 16.04 09621ebd4cfd About a minute ago 234.3 MB
ubuntu latest 09621ebd4cfd About a minute ago 234.3 MB
ubuntu xenial 09621ebd4cfd About a minute ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d About an hour ago 430.1 MB
nvidia-docker build 8cbc22512d15 About an hour ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 12 weeks ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
각 docker image가 ppc64le용인지 x86용인지는 아래와 같이 확인 가능합니다.
root@minsky:/data# docker inspect ubuntu | grep -i arch
"Architecture": "ppc64le",
이제 이렇게 만들어진 ubuntu:16.04 이미지를 기반으로 nvidia-docker 이미지를 build 합니다. 먼저, 다음과 같이 몇 개의 dockerfile들을 적절히 편집합니다.
root@minsky:/data# cd nvidia-docker
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/runtime/Dockerfile.ppc64le
#FROM ppc64le/ubuntu:16.04
FROM ubuntu:16.04
...
아래에서는 인터넷에서 download 받을 cudnn-8.0-linux-ppc64le-v5.1.tgz 파일의 sha256sum 값이 Dockerfile.ppc64le 속에 이미 들어가 있는 것과 실제의 것이 맞지 않아서 생기는 error를 막기 위해, 편집하여 바꾸는 작업니다.
root@minsky:/data/nvidia-docker# sha256sum cudnn-8.0-linux-ppc64le-v5.1.tgz
663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 cudnn-8.0-linux-ppc64le-v5.1.tgz
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/runtime/cudnn5/Dockerfile.ppc64le
...
RUN CUDNN_DOWNLOAD_SUM=663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 && \
#RUN CUDNN_DOWNLOAD_SUM=51f698d468401cef2e3e2ef9bb557bd57cbeb4dca895d1d1ae8a751d090bbe39 && \
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/devel/cudnn5/Dockerfile.ppc64le
...
RUN CUDNN_DOWNLOAD_SUM=663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 && \
#RUN CUDNN_DOWNLOAD_SUM=51f698d468401cef2e3e2ef9bb557bd57cbeb4dca895d1d1ae8a751d090bbe39 && \
이제 make로 build 합니다.
root@minsky:/data/nvidia-docker# make cuda OS=ubuntu-16.04
수행이 끝나면 다음과 같이 cuda라는 이름의 이미지가 여러개 import 되어 있습니다. runtime용과 develop용, 그리고 cudnn이 있는 것/없는 것 등의 구분입니다.
root@minsky:/data/nvidia-docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
cuda 8.0-cudnn5-devel d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda cudnn d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda cudnn-devel d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda 8.0 dc3faec17c11 9 minutes ago 1.726 GB
cuda 8.0-devel dc3faec17c11 9 minutes ago 1.726 GB
cuda devel dc3faec17c11 9 minutes ago 1.726 GB
cuda latest dc3faec17c11 9 minutes ago 1.726 GB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 15 minutes ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 15 minutes ago 942.2 MB
cuda 8.0-runtime 8e9763b6296f 17 minutes ago 844.9 MB
cuda runtime 8e9763b6296f 17 minutes ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 4 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 4 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
필요에 따라 tagging 합니다.
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-cudnn5-devel cuda8-cudnn5-devel:cudnn5-devel
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-cudnn5-devel cuda8-cudnn5-devel:latest
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-runtime cuda8-runtime:latest
root@minsky:/data/nvidia-docker# docker tag cuda:cudnn-runtime cuda8-cudnn5-runtime:latest
root@minsky:/data/nvidia-docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda8-cudnn5-devel latest d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda 8.0-cudnn5-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda cudnn d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda cudnn-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda 8.0 dc3faec17c11 13 minutes ago 1.726 GB
cuda 8.0-devel dc3faec17c11 13 minutes ago 1.726 GB
cuda devel dc3faec17c11 13 minutes ago 1.726 GB
cuda latest dc3faec17c11 13 minutes ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 20 minutes ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 20 minutes ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 20 minutes ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 22 minutes ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 22 minutes ago 844.9 MB
cuda runtime 8e9763b6296f 22 minutes ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 4 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 4 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
다음과 같이 각 nvidia-docker image들의 차이를 확인해 볼 수 있습니다.
root@minsky:/data/nvidia-docker# docker run --rm cuda nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:09:38_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
root@minsky:/data/nvidia-docker# docker run -it cuda bash
root@a93070ccdc0d:/# which nvcc
/usr/local/cuda/bin/nvcc
root@a93070ccdc0d:/# ls -l /usr/local/cuda/bin
total 61648
-rwxr-xr-x 1 root root 175952 Sep 14 21:36 bin2c
lrwxrwxrwx 1 root root 4 Sep 14 21:40 computeprof -> nvvp
drwxr-xr-x 2 root root 4096 Jan 31 02:18 crt
-rwxr-xr-x 1 root root 9746984 Sep 14 21:36 cuda-gdb
-rwxr-xr-x 1 root root 500841 Sep 14 21:36 cuda-gdbserver
-rwxr-xr-x 1 root root 297576 Sep 14 21:36 cuda-memcheck
-rwxr-xr-x 1 root root 4581048 Sep 14 21:36 cudafe
-rwxr-xr-x 1 root root 4105352 Sep 14 21:36 cudafe++
-rwxr-xr-x 1 root root 699528 Sep 14 21:36 cuobjdump
-rwxr-xr-x 1 root root 245696 Sep 14 21:36 fatbinary
-rwxr-xr-x 1 root root 1108824 Sep 14 21:36 gpu-library-advisor
-rwxr-xr-x 1 root root 303928 Sep 14 21:36 nvcc
-rw-r--r-- 1 root root 411 Sep 14 21:36 nvcc.profile
-rwxr-xr-x 1 root root 16178272 Sep 14 21:36 nvdisasm
-rwxr-xr-x 1 root root 8126880 Sep 14 21:36 nvlink
-rwxr-xr-x 1 root root 8805704 Sep 14 21:36 nvprof
-rwxr-xr-x 1 root root 204712 Sep 14 21:36 nvprune
-rwxr-xr-x 1 root root 8015368 Sep 14 21:36 ptxas
root@edf46f371b00:/# find / -name libcudnn*
root@edf46f371b00:/#
여기엔 cudnn library들이 없는 것을 보셨습니다. 이제 cudnn이 들어 있는 develop용 image를 보시겠습니다.
root@minsky:/data/nvidia-docker# docker run --rm cuda8-cudnn5-devel nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:09:38_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
root@minsky:/data/nvidia-docker# docker run -it cuda8-cudnn5-devel bash
root@54c686bbec15:/# find / -name libcudnn*
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so.5.1.10
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so.5
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn_static.a
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so
이제 간단한 CUDA application용 docker image를 build 해보겠습니다. 먼저 다음과 같이 dockerfile을 만듭니다. 아래에서 base로 사용하는 것은 위에서 만든 cuda8-cudnn5-devel 이미지를 bsyu/cuda8-cudnn5-devel:cudnn5-devel 로 tagging한 것입니다. 거기에 CUDA를 설치하고 cuda sample 중 simpleP2P를 컴파일해서 넣겠습니다.
root@minsky:/data/mydocker# vi dockerfile.p2p
FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt
RUN apt-get update && apt-get install -y cuda && \
rm -rf /var/lib/apt/lists/*
# set the working directory
WORKDIR /usr/local/cuda/samples/0_Simple/simpleP2P
RUN make
# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./simpleP2P
위와 같은 dockerfile.p2p로 build를 합니다.
root@minsky:/data/mydocker# docker build -t bsyu/p2p:ppc64le-xenial -f dockerfile.p2p .
Build가 끝나면 다음과 같이 약 2.77GB의 꽤 큰 docker image가 생긴 것을 보실 수 있습니다.
root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/p2p ppc64le-xenial c307ae42d1aa About a minute ago 2.77 GB
registry latest 781e109ba95f 26 hours ago 612.6 MB
ubuntu/xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 28 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 28 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 28 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 28 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 28 hours ago 1.726 GB
cuda devel dc3faec17c11 28 hours ago 1.726 GB
cuda latest dc3faec17c11 28 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 29 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 29 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 29 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 29 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 29 hours ago 844.9 MB
cuda runtime 8e9763b6296f 29 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 5 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 5 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 5 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
이를 기존처럼 docker run 명령으로 수행하면 다음과 같이 CUDA error가 납니다.
root@minsky:/data/mydocker# docker run --rm bsyu/p2p:ppc64le-xenial
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA error at simpleP2P.cu:63 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&gpu_n)"
이제 nvidia-docker가 사용될 차례입니다. nvidia-docker는 docker에서 CUDA를 쓸 수 있게 해주는 일종의 wrapper, 또는 plugin 같은 것으로 보시면 됩니다. 사용법은 동일하며, 위에서 error가 나던 것이 이제 제대로 수행되는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/p2p:ppc64le-xenial
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
> GPU0 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU1 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU2 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU3 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla P100-SXM2-16GB (GPU0) supports UVA: Yes
> Tesla P100-SXM2-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 32.91GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
P2P 대역폭 32.91GB/sec의 위엄... NVLink라서 행복합니다... PCIe Gen3에서는 기껏해야 8GB/sec 못 넘습니다...
해당 docker image의 bash 속으로 들어가서 nvidia-smi 명령도 수행해 봅니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/p2p:ppc64le-xenial bash
oot@d4770bd8ec53:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# nvidia-smi -l 3
Wed Feb 1 07:26:39 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 26C P0 28W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 29C P0 31W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 25C P0 30W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 27C P0 29W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
docker container 속에서 보면 network 환경은 다음과 같습니다. 기본으로 주어진 172.17.0.3가 할당된 것을 보실 수 있습니다.
root@2a663f3cd0f5:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:ac:11:00:03
inet addr:172.17.0.3 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe11:3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:578 (578.0 B) TX bytes:508 (508.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Host에서 보면 docker0라는 interface에 172.17.0.1이 할당되어 있습니다.
root@minsky:~# ifconfig
docker0 Link encap:Ethernet HWaddr 02:42:16:b1:40:08
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:16ff:feb1:4008/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:242445 errors:0 dropped:0 overruns:0 frame:0
TX packets:666734 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:9924007 (9.9 MB) TX bytes:2676663029 (2.6 GB)
enP5p7s0f0 Link encap:Ethernet HWaddr 70:e2:84:14:19:25
inet addr:172.18.229.115 Bcast:172.18.229.255 Mask:255.255.255.0
inet6 addr: fe80::72e2:84ff:fe14:1925/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10039400 errors:0 dropped:160 overruns:0 frame:0
TX packets:1471125 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4229498546 (4.2 GB) TX bytes:1221935620 (1.2 GB)
Interrupt:205
다시 docker container에서 외부의 다른 서버로 ssh를 해보면, docker container는 독자적인 IP를 가지는 것이 아니라 host의 IP를 그대로 유지하는 것을 아래와 같이 보실 수 있습니다.
root@901ee2ecf38a:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# ssh test@172.18.229.117
test@k8002:~$ who
test hvc0 2017-01-09 17:05
test tty1 2017-01-11 14:49
test pts/0 2017-02-01 16:51 (172.18.229.115)
이제 public docker hub ( https://hub.docker.com ) 에 login하여 만든 이미지를 push, pull 해보겠습니다. ID/passwd는 따로 web browser를 통해 등록해두셔야 합니다.
root@minsky:/data/registry_volume# docker login --username=bsyu
Password:
Login Succeeded
User name이 bsyu로 되어 있으므로, 기존 image를 docker hub에 올리려면 앞에 bsyu/ 를 붙여 tagging을 해주어야 합니다.
root@minsky:/data/registry_volume# docker tag cuda8-cudnn5-devel:latest bsyu/cuda8-cudnn5-devel:cudnn5-devel
이제 push하면 됩니다.
root@minsky:/data/registry_volume# docker push bsyu/cuda8-cudnn5-devel:cudnn5-devel
The push refers to a repository [docker.io/bsyu/cuda8-cudnn5-devel]
c0fe73e43621: Pushed
4ce979019d1d: Pushed
724befd94678: Pushed
84f99f1bf79b: Pushed
7f7c1dccec82: Pushed
5b8880a35736: Pushed
41b97cb9a404: Pushed
08f34ce6b3fb: Pushed
cudnn5-devel: digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f size: 2003
root@minsky:/data/registry_volume# docker tag cuda:8.0-devel bsyu/ppc64le:cuda8.0-devel
root@minsky:/data/registry_volume# docker push bsyu/ppc64le:cuda8.0-devel
The push refers to a repository [docker.io/bsyu/ppc64le]
724befd94678: Mounted from bsyu/cuda
84f99f1bf79b: Mounted from bsyu/cuda
7f7c1dccec82: Mounted from bsyu/cuda
5b8880a35736: Mounted from bsyu/cuda
41b97cb9a404: Mounted from bsyu/cuda
08f34ce6b3fb: Mounted from bsyu/cuda
cuda8.0-devel: digest: sha256:5943540e7f404d9c900c8acc188f4eab85e345a282e9ad37d6e2476093afc6c5 size: 1579
root@minsky:/data/registry_volume# docker tag cuda8-cudnn5-devel:cudnn5-devel bsyu/ppc64le:cuda8-cudnn5-devel
root@minsky:/data/registry_volume# docker push bsyu/ppc64le:cuda8-cudnn5-devel
The push refers to a repository [docker.io/bsyu/ppc64le]
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Layer already exists
84f99f1bf79b: Layer already exists
7f7c1dccec82: Layer already exists
5b8880a35736: Layer already exists
41b97cb9a404: Layer already exists
08f34ce6b3fb: Layer already exists
cuda8-cudnn5-devel: digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f size: 2003
Push된 것들을 https://hub.docker.com 에서 브라우저를 통해 확인해봅니다.
반대로 pull 해보기 위해, 방금 올렸던 image들을 일괄적으로 삭제합니다. image id에 대해 rmi 명령을 날리면 같은 id의 tag들이 모두 삭제됩니다.
root@minsky:/data# docker rmi -f d8d0da2fbdf2
Untagged: bsyu/cuda8-cudnn5-devel:cudnn5-devel
Untagged: bsyu/cuda8-cudnn5-devel@sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Untagged: bsyu/ppc64le:cuda8-cudnn5-devel
Untagged: bsyu/ppc64le@sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Untagged: cuda8-cudnn5-devel:cudnn5-devel
Untagged: cuda8-cudnn5-devel:latest
Untagged: cuda:8.0-cudnn5-devel
Untagged: cuda:cudnn
Untagged: cuda:cudnn-devel
Deleted: sha256:d8d0da2fbdf24a97787e6f1b4d8531d60e665b3d0f9cac5c14d1814a91b3b946
Deleted: sha256:2320a2aed314994ad77b5cc8e8b3faf295253bed8cf8a7be8a7806be6e9c50cf
Deleted: sha256:9d10e971aaf429133422b957bd1bfb583ebd03aaea9e796c2db8b6edca0d2836
Deleted: sha256:d8877708ee88e10086ce367b63e5da965c5e21ba2c8a199ab2c7b84c2c3ff699
Deleted: sha256:37e6b06a871334c369047ac9f9ae214cd63fe29700b3ad14901702a8044548e5
Deleted: sha256:0b0445d2e213d4eeed1760f8339a4b6433b134b75ba29336ce7759e67a397f5a
Deleted: sha256:5b4eda52a5b16a564381952434146055cb690de918359587d05273c23acade22
없어진 것을 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda devel dc3faec17c11 6 hours ago 1.726 GB
cuda latest dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-cudnn5-runtime latest 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 6 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 6 hours ago 844.9 MB
cuda runtime 8e9763b6296f 6 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
이제 pull 명령으로 그 image를 가져옵니다.
root@minsky:/data# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Already exists
6b239e02a89e: Already exists
aecbc9abccdc: Already exists
8f458a3f0497: Already exists
4903f7ce6675: Already exists
0c588ac98d19: Already exists
12e624e884fc: Pull complete
18dd28bbb571: Pull complete
Digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Status: Downloaded newer image for bsyu/cuda8-cudnn5-devel:cudnn5-devel
제대로 가져왔는지 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 6 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda devel dc3faec17c11 6 hours ago 1.726 GB
cuda latest dc3faec17c11 6 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 6 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 6 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 6 hours ago 844.9 MB
cuda runtime 8e9763b6296f 6 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
이런 경우 docker, 그 중에서도 nvidia-docker가 좋은 해결책이 될 수 있습니다. 가령 아래의 경우를 보면, host 서버에는 아예 nvcc v7.5.17이 설치되어 있습니다.
root@minsky:/data# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:31:50_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
하지만 그 host에서 돌아가는 docker container 속에서는 nvcc v8.0.44를 운용하는 것을 보실 수 있습니다.
root@minsky:/data# docker run --rm bsyu/nvcc:ppc64le-xenial nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:09:38_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
이렇게 편리한 nvidia-docker를 사용하기 위한 절차를 step-by-step으로 알아보겠습니다.
먼저, 기본 docker 엔진 설치가 필요합니다. 다음과 같이 ppc64le용 docker의 APT repository를 /etc/apt/sources.list.d 에 등록합니다.
root@minsky:~# echo deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/16_04/docker-1.12.6-ppc64el/ xenial main > /etc/apt/sources.list.d/xenial-docker.list
root@minsky:~# apt-get update
다음과 같이 설치하고, docker service 시작합니다.
root@minsky:~# apt-get install docker-engine
root@minsky:~# service docker restart
이어서, nvidia-docker를 source에서 build합니다.
root@minsky:~# cd /data
root@minsky:/data# git clone https://github.com/NVIDIA/nvidia-docker.git
root@minsky:/data# cd nvidia-docker
root@minsky:/data/nvidia-docker# git fetch --all
root@minsky:/data/nvidia-docker# git checkout ppc64le
root@minsky:/data/nvidia-docker# ls
centos centos-7 LICENSE mk samples ubuntu ubuntu-16.04
centos-6 CLA Makefile README.md tools ubuntu-14.04
root@minsky:/data/nvidia-docker# make deb
이것이 끝나면 tools/dist 밑에 설치 가능한 nvidia-docker debian package가 만들어집니다. 그걸 dpkg 명령으로 설치합니다.
root@minsky:/data/nvidia-docker# dpkg -i tools/dist/nvidia-docker_1.0.0~rc.3-1_ppc64el.deb
nvidia-docker image 명령으로 보면, 몇몇 기본 image들이 보입니다.
root@minsky:/data/nvidia-docker# nvidia-docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia-docker deb 332eaa8c9f9d 3 minutes ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 minutes ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
ppc64le/ubuntu를 docker run 하면, 기본으로 latest라는 tag가 붙은 docker image를 run 합니다. 그러나 기존의 image 중에는 ppc64le/ubuntu:latest가 없으므로, 그 이미지를 새로 docker hub에서 download 해온 뒤 수행합니다.
root@minsky:/data/nvidia-docker# docker run -it ppc64le/ubuntu bash
Unable to find image 'ppc64le/ubuntu:latest' locally
latest: Pulling from ppc64le/ubuntu
0847857e6401: Pull complete
f8c18c152457: Pull complete
8643975d001d: Pull complete
d5802da4b3a0: Pull complete
fe172ed92137: Pull complete
Digest: sha256:5349f00594c719455f2c8e6f011b32758dcd326d8e225c737a55c15cf3d6948c
Status: Downloaded newer image for ppc64le/ubuntu:latest
이제 docker image의 bash 안으로 들어 왔습니다.
root@ba07ff7529b3:/# df -h
Filesystem Size Used Avail Use% Mounted on
none 845G 743G 60G 93% /
tmpfs 256G 0 256G 0% /dev
tmpfs 256G 0 256G 0% /sys/fs/cgroup
/dev/sda2 845G 743G 60G 93% /etc/hosts
shm 64M 0 64M 0% /dev/shm
root@ba07ff7529b3:/# ls
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
root@ba07ff7529b3:/# uname -a
Linux ba07ff7529b3 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
Ubuntu의 minimum image이미지라서 ifconfig 명령조차 없습니다.
root@ba07ff7529b3:/# ifconfig
bash: ifconfig: command not found
root@ba07ff7529b3:/# exit
docker ps -a로 container의 상황을 볼 수 있습니다.
root@minsky:/data/nvidia-docker# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ba07ff7529b3 ppc64le/ubuntu "bash" About a minute ago Exited (127) 9 seconds ago small_ride
이제 직접 docker image를 build해 봅니다. 기본 내용은 아래 URL을 보고 따라한 것입니다.
https://www.ibm.com/developerworks/library/d-docker-on-power-linux-platform/
Ubuntu 16.04 Xenial ppc64le의 docker image를 build합니다. 그러기 위해 먼저 debootstrap package를 설치하고, 아래처럼 debootstrap.sh 스크립트를 download 받습니다.
root@minsky:/data# apt-get install -y debootstrap
root@minsky:/data# curl -o debootstrap.sh https://raw.githubusercontent.com/docker/docker/master/contrib/mkimage/debootstrap
root@minsky:/data# chmod a+x ./debootstrap.sh
아래와 같이 Xenial의 main, universe, multiverse, restricted 4개의 repository를 끼고 build하는 것으로 스크립트를 수행합니다.
root@minsky:/data# ./debootstrap.sh ubuntu --components=main,universe,multiverse,restricted xenial
수행이 끝나면 바로 아래 ubuntu directory에 OS 이미지를 위한 directory들이 생깁니다.
root@minsky:/data# ls ubuntu
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
이제 이 directory를 tar로 말면서 docker에 import 합니다.
root@minsky:/data# tar -C ubuntu -c . | docker import - ubuntu:16.04
sha256:09621ebd4cfd280af86ef61e2c5a41e8ef4e0081d6ec51203dba1fceaf69e625
Import된 docker image를 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 16.04 09621ebd4cfd 31 seconds ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d About an hour ago 430.1 MB
nvidia-docker build 8cbc22512d15 About an hour ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 12 weeks ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
여러가지로 tag를 합니다. 특히 latest tag를 꼭 합니다. 그러지 않으면 그 이미지를 부를 때마다, 같은 이름의 latest tag가 달린 이미지를 인터넷을 통해 docker hub에서 download 하려 들겁니다.
root@minsky:/data# docker tag ubuntu:16.04 ubuntu:xenial
root@minsky:/data# docker tag ubuntu:16.04 ubuntu:latest
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 16.04 09621ebd4cfd About a minute ago 234.3 MB
ubuntu latest 09621ebd4cfd About a minute ago 234.3 MB
ubuntu xenial 09621ebd4cfd About a minute ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d About an hour ago 430.1 MB
nvidia-docker build 8cbc22512d15 About an hour ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 12 weeks ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
각 docker image가 ppc64le용인지 x86용인지는 아래와 같이 확인 가능합니다.
root@minsky:/data# docker inspect ubuntu | grep -i arch
"Architecture": "ppc64le",
이제 이렇게 만들어진 ubuntu:16.04 이미지를 기반으로 nvidia-docker 이미지를 build 합니다. 먼저, 다음과 같이 몇 개의 dockerfile들을 적절히 편집합니다.
root@minsky:/data# cd nvidia-docker
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/runtime/Dockerfile.ppc64le
#FROM ppc64le/ubuntu:16.04
FROM ubuntu:16.04
...
아래에서는 인터넷에서 download 받을 cudnn-8.0-linux-ppc64le-v5.1.tgz 파일의 sha256sum 값이 Dockerfile.ppc64le 속에 이미 들어가 있는 것과 실제의 것이 맞지 않아서 생기는 error를 막기 위해, 편집하여 바꾸는 작업니다.
root@minsky:/data/nvidia-docker# sha256sum cudnn-8.0-linux-ppc64le-v5.1.tgz
663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 cudnn-8.0-linux-ppc64le-v5.1.tgz
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/runtime/cudnn5/Dockerfile.ppc64le
...
RUN CUDNN_DOWNLOAD_SUM=663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 && \
#RUN CUDNN_DOWNLOAD_SUM=51f698d468401cef2e3e2ef9bb557bd57cbeb4dca895d1d1ae8a751d090bbe39 && \
root@minsky:/data/nvidia-docker# vi ./ubuntu-16.04/cuda/8.0/devel/cudnn5/Dockerfile.ppc64le
...
RUN CUDNN_DOWNLOAD_SUM=663aac0328f821d90ae9c74ee43e90751706546c2ce769ea9c96f92864300af6 && \
#RUN CUDNN_DOWNLOAD_SUM=51f698d468401cef2e3e2ef9bb557bd57cbeb4dca895d1d1ae8a751d090bbe39 && \
이제 make로 build 합니다.
root@minsky:/data/nvidia-docker# make cuda OS=ubuntu-16.04
수행이 끝나면 다음과 같이 cuda라는 이름의 이미지가 여러개 import 되어 있습니다. runtime용과 develop용, 그리고 cudnn이 있는 것/없는 것 등의 구분입니다.
root@minsky:/data/nvidia-docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
cuda 8.0-cudnn5-devel d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda cudnn d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda cudnn-devel d8d0da2fbdf2 7 minutes ago 1.895 GB
cuda 8.0 dc3faec17c11 9 minutes ago 1.726 GB
cuda 8.0-devel dc3faec17c11 9 minutes ago 1.726 GB
cuda devel dc3faec17c11 9 minutes ago 1.726 GB
cuda latest dc3faec17c11 9 minutes ago 1.726 GB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 15 minutes ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 15 minutes ago 942.2 MB
cuda 8.0-runtime 8e9763b6296f 17 minutes ago 844.9 MB
cuda runtime 8e9763b6296f 17 minutes ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 4 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 4 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
필요에 따라 tagging 합니다.
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-cudnn5-devel cuda8-cudnn5-devel:cudnn5-devel
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-cudnn5-devel cuda8-cudnn5-devel:latest
root@minsky:/data/nvidia-docker# docker tag cuda:8.0-runtime cuda8-runtime:latest
root@minsky:/data/nvidia-docker# docker tag cuda:cudnn-runtime cuda8-cudnn5-runtime:latest
root@minsky:/data/nvidia-docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda8-cudnn5-devel latest d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda 8.0-cudnn5-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda cudnn d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda cudnn-devel d8d0da2fbdf2 12 minutes ago 1.895 GB
cuda 8.0 dc3faec17c11 13 minutes ago 1.726 GB
cuda 8.0-devel dc3faec17c11 13 minutes ago 1.726 GB
cuda devel dc3faec17c11 13 minutes ago 1.726 GB
cuda latest dc3faec17c11 13 minutes ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 20 minutes ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 20 minutes ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 20 minutes ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 22 minutes ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 22 minutes ago 844.9 MB
cuda runtime 8e9763b6296f 22 minutes ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 4 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 4 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
다음과 같이 각 nvidia-docker image들의 차이를 확인해 볼 수 있습니다.
root@minsky:/data/nvidia-docker# docker run --rm cuda nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:09:38_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
root@minsky:/data/nvidia-docker# docker run -it cuda bash
root@a93070ccdc0d:/# which nvcc
/usr/local/cuda/bin/nvcc
root@a93070ccdc0d:/# ls -l /usr/local/cuda/bin
total 61648
-rwxr-xr-x 1 root root 175952 Sep 14 21:36 bin2c
lrwxrwxrwx 1 root root 4 Sep 14 21:40 computeprof -> nvvp
drwxr-xr-x 2 root root 4096 Jan 31 02:18 crt
-rwxr-xr-x 1 root root 9746984 Sep 14 21:36 cuda-gdb
-rwxr-xr-x 1 root root 500841 Sep 14 21:36 cuda-gdbserver
-rwxr-xr-x 1 root root 297576 Sep 14 21:36 cuda-memcheck
-rwxr-xr-x 1 root root 4581048 Sep 14 21:36 cudafe
-rwxr-xr-x 1 root root 4105352 Sep 14 21:36 cudafe++
-rwxr-xr-x 1 root root 699528 Sep 14 21:36 cuobjdump
-rwxr-xr-x 1 root root 245696 Sep 14 21:36 fatbinary
-rwxr-xr-x 1 root root 1108824 Sep 14 21:36 gpu-library-advisor
-rwxr-xr-x 1 root root 303928 Sep 14 21:36 nvcc
-rw-r--r-- 1 root root 411 Sep 14 21:36 nvcc.profile
-rwxr-xr-x 1 root root 16178272 Sep 14 21:36 nvdisasm
-rwxr-xr-x 1 root root 8126880 Sep 14 21:36 nvlink
-rwxr-xr-x 1 root root 8805704 Sep 14 21:36 nvprof
-rwxr-xr-x 1 root root 204712 Sep 14 21:36 nvprune
-rwxr-xr-x 1 root root 8015368 Sep 14 21:36 ptxas
root@edf46f371b00:/# find / -name libcudnn*
root@edf46f371b00:/#
여기엔 cudnn library들이 없는 것을 보셨습니다. 이제 cudnn이 들어 있는 develop용 image를 보시겠습니다.
root@minsky:/data/nvidia-docker# docker run --rm cuda8-cudnn5-devel nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:09:38_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
root@minsky:/data/nvidia-docker# docker run -it cuda8-cudnn5-devel bash
root@54c686bbec15:/# find / -name libcudnn*
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so.5.1.10
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so.5
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn_static.a
/usr/local/cuda-8.0/targets/ppc64le-linux/lib/libcudnn.so
이제 간단한 CUDA application용 docker image를 build 해보겠습니다. 먼저 다음과 같이 dockerfile을 만듭니다. 아래에서 base로 사용하는 것은 위에서 만든 cuda8-cudnn5-devel 이미지를 bsyu/cuda8-cudnn5-devel:cudnn5-devel 로 tagging한 것입니다. 거기에 CUDA를 설치하고 cuda sample 중 simpleP2P를 컴파일해서 넣겠습니다.
root@minsky:/data/mydocker# vi dockerfile.p2p
FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
# RUN executes a shell command
# You can chain multiple commands together with &&
# A \ is used to split long lines to help with readability
# This particular instruction installs the source files
# for deviceQuery by installing the CUDA samples via apt
RUN apt-get update && apt-get install -y cuda && \
rm -rf /var/lib/apt/lists/*
# set the working directory
WORKDIR /usr/local/cuda/samples/0_Simple/simpleP2P
RUN make
# CMD defines the default command to be run in the container
# CMD is overridden by supplying a command + arguments to
# `docker run`, e.g. `nvcc --version` or `bash`
CMD ./simpleP2P
위와 같은 dockerfile.p2p로 build를 합니다.
root@minsky:/data/mydocker# docker build -t bsyu/p2p:ppc64le-xenial -f dockerfile.p2p .
Build가 끝나면 다음과 같이 약 2.77GB의 꽤 큰 docker image가 생긴 것을 보실 수 있습니다.
root@minsky:/data/mydocker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/p2p ppc64le-xenial c307ae42d1aa About a minute ago 2.77 GB
registry latest 781e109ba95f 26 hours ago 612.6 MB
ubuntu/xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
127.0.0.1/ubuntu-xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
localhost:5000/ubuntu-xenial gevent 4ce0e6ba8a69 26 hours ago 282.5 MB
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 28 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 28 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 28 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 28 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 28 hours ago 1.726 GB
cuda devel dc3faec17c11 28 hours ago 1.726 GB
cuda latest dc3faec17c11 28 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 29 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 29 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 29 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 29 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 29 hours ago 844.9 MB
cuda runtime 8e9763b6296f 29 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 5 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 5 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 5 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 3 months ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
이를 기존처럼 docker run 명령으로 수행하면 다음과 같이 CUDA error가 납니다.
root@minsky:/data/mydocker# docker run --rm bsyu/p2p:ppc64le-xenial
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA error at simpleP2P.cu:63 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&gpu_n)"
이제 nvidia-docker가 사용될 차례입니다. nvidia-docker는 docker에서 CUDA를 쓸 수 있게 해주는 일종의 wrapper, 또는 plugin 같은 것으로 보시면 됩니다. 사용법은 동일하며, 위에서 error가 나던 것이 이제 제대로 수행되는 것을 보실 수 있습니다.
root@minsky:/data/mydocker# nvidia-docker run --rm bsyu/p2p:ppc64le-xenial
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
> GPU0 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU1 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU2 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
> GPU3 = "Tesla P100-SXM2-16GB" IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla P100-SXM2-16GB (GPU0) supports UVA: Yes
> Tesla P100-SXM2-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 32.91GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
P2P 대역폭 32.91GB/sec의 위엄... NVLink라서 행복합니다... PCIe Gen3에서는 기껏해야 8GB/sec 못 넘습니다...
해당 docker image의 bash 속으로 들어가서 nvidia-smi 명령도 수행해 봅니다.
root@minsky:/data/mydocker# nvidia-docker run --rm -ti bsyu/p2p:ppc64le-xenial bash
oot@d4770bd8ec53:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# nvidia-smi -l 3
Wed Feb 1 07:26:39 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 |
| N/A 26C P0 28W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 |
| N/A 29C P0 31W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 0006:01:00.0 Off | 0 |
| N/A 25C P0 30W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 0007:01:00.0 Off | 0 |
| N/A 27C P0 29W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
docker container 속에서 보면 network 환경은 다음과 같습니다. 기본으로 주어진 172.17.0.3가 할당된 것을 보실 수 있습니다.
root@2a663f3cd0f5:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:ac:11:00:03
inet addr:172.17.0.3 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe11:3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:578 (578.0 B) TX bytes:508 (508.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Host에서 보면 docker0라는 interface에 172.17.0.1이 할당되어 있습니다.
root@minsky:~# ifconfig
docker0 Link encap:Ethernet HWaddr 02:42:16:b1:40:08
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:16ff:feb1:4008/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:242445 errors:0 dropped:0 overruns:0 frame:0
TX packets:666734 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:9924007 (9.9 MB) TX bytes:2676663029 (2.6 GB)
enP5p7s0f0 Link encap:Ethernet HWaddr 70:e2:84:14:19:25
inet addr:172.18.229.115 Bcast:172.18.229.255 Mask:255.255.255.0
inet6 addr: fe80::72e2:84ff:fe14:1925/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10039400 errors:0 dropped:160 overruns:0 frame:0
TX packets:1471125 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4229498546 (4.2 GB) TX bytes:1221935620 (1.2 GB)
Interrupt:205
다시 docker container에서 외부의 다른 서버로 ssh를 해보면, docker container는 독자적인 IP를 가지는 것이 아니라 host의 IP를 그대로 유지하는 것을 아래와 같이 보실 수 있습니다.
root@901ee2ecf38a:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P# ssh test@172.18.229.117
test@k8002:~$ who
test hvc0 2017-01-09 17:05
test tty1 2017-01-11 14:49
test pts/0 2017-02-01 16:51 (172.18.229.115)
이제 public docker hub ( https://hub.docker.com ) 에 login하여 만든 이미지를 push, pull 해보겠습니다. ID/passwd는 따로 web browser를 통해 등록해두셔야 합니다.
root@minsky:/data/registry_volume# docker login --username=bsyu
Password:
Login Succeeded
User name이 bsyu로 되어 있으므로, 기존 image를 docker hub에 올리려면 앞에 bsyu/ 를 붙여 tagging을 해주어야 합니다.
root@minsky:/data/registry_volume# docker tag cuda8-cudnn5-devel:latest bsyu/cuda8-cudnn5-devel:cudnn5-devel
이제 push하면 됩니다.
root@minsky:/data/registry_volume# docker push bsyu/cuda8-cudnn5-devel:cudnn5-devel
The push refers to a repository [docker.io/bsyu/cuda8-cudnn5-devel]
c0fe73e43621: Pushed
4ce979019d1d: Pushed
724befd94678: Pushed
84f99f1bf79b: Pushed
7f7c1dccec82: Pushed
5b8880a35736: Pushed
41b97cb9a404: Pushed
08f34ce6b3fb: Pushed
cudnn5-devel: digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f size: 2003
root@minsky:/data/registry_volume# docker tag cuda:8.0-devel bsyu/ppc64le:cuda8.0-devel
root@minsky:/data/registry_volume# docker push bsyu/ppc64le:cuda8.0-devel
The push refers to a repository [docker.io/bsyu/ppc64le]
724befd94678: Mounted from bsyu/cuda
84f99f1bf79b: Mounted from bsyu/cuda
7f7c1dccec82: Mounted from bsyu/cuda
5b8880a35736: Mounted from bsyu/cuda
41b97cb9a404: Mounted from bsyu/cuda
08f34ce6b3fb: Mounted from bsyu/cuda
cuda8.0-devel: digest: sha256:5943540e7f404d9c900c8acc188f4eab85e345a282e9ad37d6e2476093afc6c5 size: 1579
root@minsky:/data/registry_volume# docker tag cuda8-cudnn5-devel:cudnn5-devel bsyu/ppc64le:cuda8-cudnn5-devel
root@minsky:/data/registry_volume# docker push bsyu/ppc64le:cuda8-cudnn5-devel
The push refers to a repository [docker.io/bsyu/ppc64le]
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Layer already exists
84f99f1bf79b: Layer already exists
7f7c1dccec82: Layer already exists
5b8880a35736: Layer already exists
41b97cb9a404: Layer already exists
08f34ce6b3fb: Layer already exists
cuda8-cudnn5-devel: digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f size: 2003
Push된 것들을 https://hub.docker.com 에서 브라우저를 통해 확인해봅니다.
반대로 pull 해보기 위해, 방금 올렸던 image들을 일괄적으로 삭제합니다. image id에 대해 rmi 명령을 날리면 같은 id의 tag들이 모두 삭제됩니다.
root@minsky:/data# docker rmi -f d8d0da2fbdf2
Untagged: bsyu/cuda8-cudnn5-devel:cudnn5-devel
Untagged: bsyu/cuda8-cudnn5-devel@sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Untagged: bsyu/ppc64le:cuda8-cudnn5-devel
Untagged: bsyu/ppc64le@sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Untagged: cuda8-cudnn5-devel:cudnn5-devel
Untagged: cuda8-cudnn5-devel:latest
Untagged: cuda:8.0-cudnn5-devel
Untagged: cuda:cudnn
Untagged: cuda:cudnn-devel
Deleted: sha256:d8d0da2fbdf24a97787e6f1b4d8531d60e665b3d0f9cac5c14d1814a91b3b946
Deleted: sha256:2320a2aed314994ad77b5cc8e8b3faf295253bed8cf8a7be8a7806be6e9c50cf
Deleted: sha256:9d10e971aaf429133422b957bd1bfb583ebd03aaea9e796c2db8b6edca0d2836
Deleted: sha256:d8877708ee88e10086ce367b63e5da965c5e21ba2c8a199ab2c7b84c2c3ff699
Deleted: sha256:37e6b06a871334c369047ac9f9ae214cd63fe29700b3ad14901702a8044548e5
Deleted: sha256:0b0445d2e213d4eeed1760f8339a4b6433b134b75ba29336ce7759e67a397f5a
Deleted: sha256:5b4eda52a5b16a564381952434146055cb690de918359587d05273c23acade22
없어진 것을 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda devel dc3faec17c11 6 hours ago 1.726 GB
cuda latest dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-cudnn5-runtime latest 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 6 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 6 hours ago 844.9 MB
cuda runtime 8e9763b6296f 6 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
이제 pull 명령으로 그 image를 가져옵니다.
root@minsky:/data# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Already exists
6b239e02a89e: Already exists
aecbc9abccdc: Already exists
8f458a3f0497: Already exists
4903f7ce6675: Already exists
0c588ac98d19: Already exists
12e624e884fc: Pull complete
18dd28bbb571: Pull complete
Digest: sha256:c463051a8f78430d9de187386cca68294e2826830f085c677e5b20e70caeaf3f
Status: Downloaded newer image for bsyu/cuda8-cudnn5-devel:cudnn5-devel
제대로 가져왔는지 확인합니다.
root@minsky:/data# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
bsyu/cuda8-cudnn5-devel cudnn5-devel d8d0da2fbdf2 6 hours ago 1.895 GB
bsyu/cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
bsyu/ppc64le cuda8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0 dc3faec17c11 6 hours ago 1.726 GB
cuda 8.0-devel dc3faec17c11 6 hours ago 1.726 GB
cuda devel dc3faec17c11 6 hours ago 1.726 GB
cuda latest dc3faec17c11 6 hours ago 1.726 GB
cuda8-cudnn5-runtime latest 8a3b0a60e741 6 hours ago 942.2 MB
cuda 8.0-cudnn5-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda cudnn-runtime 8a3b0a60e741 6 hours ago 942.2 MB
cuda8-runtime latest 8e9763b6296f 6 hours ago 844.9 MB
cuda 8.0-runtime 8e9763b6296f 6 hours ago 844.9 MB
cuda runtime 8e9763b6296f 6 hours ago 844.9 MB
ubuntu 16.04 09621ebd4cfd 4 days ago 234.3 MB
ubuntu latest 09621ebd4cfd 4 days ago 234.3 MB
ubuntu xenial 09621ebd4cfd 4 days ago 234.3 MB
nvidia-docker deb 332eaa8c9f9d 5 days ago 430.1 MB
nvidia-docker build 8cbc22512d15 5 days ago 1.012 GB
ppc64le/ubuntu 14.04 c040fcd69c12 12 weeks ago 227.8 MB
ppc64le/ubuntu latest 1967d889e07f 3 months ago 168 MB
ppc64le/golang 1.6.3 6a579d02d32f 5 months ago 704.7 MB
피드 구독하기:
글 (Atom)