NVIDIA CUDA 10.2를 ppc64le 아키텍처 (IBM POWER8/9)에 설치한 경우, 다음과 같이 nvidia-persistenced가 살지 못하고 error를 내는 바람에 nvidia-smi 등 CUDA 기능을 사용하지 못하는 경우가 있습니다.
[cecuser@p615-met1 ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Ma ke sure that the latest NVIDIA driver is installed and running.
[cecuser@p615-met1 ~]$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Mon 2020-02-03 20:33:38 EST; 36min ago
Process: 11480 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=1/FAILURE)
Feb 03 20:33:38 p615-met1 systemd[1]: nvidia-persistenced.service: control process exited, code=exite...us=1
Feb 03 20:33:38 p615-met1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
Feb 03 20:33:38 p615-met1 systemd[1]: Unit nvidia-persistenced.service entered failed state.
Feb 03 20:33:38 p615-met1 systemd[1]: nvidia-persistenced.service failed.
Feb 03 20:33:38 p615-met1 systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
Feb 03 20:33:38 p615-met1 systemd[1]: Stopped NVIDIA Persistence Daemon.
Feb 03 20:33:38 p615-met1 systemd[1]: start request repeated too quickly for nvidia-persistenced.service
Feb 03 20:33:38 p615-met1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
Feb 03 20:33:38 p615-met1 systemd[1]: Unit nvidia-persistenced.service entered failed state.
Feb 03 20:33:38 p615-met1 systemd[1]: nvidia-persistenced.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
보통 이 경우 다음과 같이 /etc/udev/rules.d/40-redhat.rules의 memory hotplug 관련 부분을 comment-out 처리하지 않아서 발생하는 것이 대부분입니다.
[cecuser@p615-met1 ~]$ sudo vi /etc/udev/rules.d/40-redhat.rules
...
#SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
SUBSYSTEM=="*", GOTO="memory_hotplug_end"
그런데 이렇게 하고 나서 rebooting을 해도 여전히 같은 error가 나더군요. 한참 골머리를 앓았는데, 이제 보니 CUDA 10.2에 약간의 bug가 있어서 그런 것 같습니다. 아래와 같이 /etc/ld.so.conf.d/cuda-10-2.conf 속에 ppc64le-linux 대신 x86_64-linux의 directory가 들어가 있습니다.
이것만 손으로 다음과 같이 수정하시고 ldconfig를 수행해 주시면 됩니다.
[cecuser@p615-met1 ~]$ sudo vi /etc/ld.so.conf.d/cuda-10-2.conf
#/usr/local/cuda-10.2/targets/x86_64-linux/lib
/usr/local/cuda-10.2/targets/ppc64le-linux/lib
[cecuser@p615-met1 ~]$ sudo ldconfig
[cecuser@p615-met1 ~]$ nvidia-smi
Mon Feb 3 23:50:00 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000004:04:00.0 Off | 0 |
| N/A 31C P0 51W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000004:05:00.0 Off | 0 |
| N/A 34C P0 54W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000035:03:00.0 Off | 0 |
| N/A 32C P0 54W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000035:04:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 0MiB / 32510MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
별 것도 아닌 것으로 2시간 이상 소모했습니다...
댓글 없음:
댓글 쓰기