GPU Isolation

September 10, 2017 | Docker CUDA GPU

There are 4 GeForce GTX 1080 Ti in the 4-GPU workstation Hydra:

# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Sep  8 17:14 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Sep  8 17:14 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Sep  8 17:14 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Sep  8 17:14 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Sep  8 17:14 /dev/nvidiactl
crw-rw-rw- 1 root root 243,   0 Sep  8 17:14 /dev/nvidia-uvm
crw-rw-rw- 1 root root 243,   1 Sep  8 17:14 /dev/nvidia-uvm-tools

# lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
84:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)

A caveat of running TensorFlow on a multi-GPU system such as Hydra is that by default, a TensorFlow session will allocate all GPU memory on all GPUs, even though you only use a single GPU! A better usage pattern is to launch multiple jobs in parallel, each one using a subset of the available GPUs.

Before you start any TensorFlow session, you should first run nvidia-smi to see which GPUs are being utilized, then select an idle GPU and target it with the environment variable CUDA_VISIBLE_DEVICES.

For example, to select GPU 2 (/dev/nvidia2) to run your TensorFlow session, first set the environment variable:

export CUDA_VISIBLE_DEVICES=2

then launch Python to run TensorFlow.

Alternatively you can set the environment variables within Python, before you import tensorflow. For example: CUDA_VISIBLE_DEVICES. For example:

# python
Python 3.6.2 |Anaconda custom (64-bit)| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
>>> os.environ["CUDA_VISIBLE_DEVICES"]="2"
>>> os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17032249930653661893
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 235274240
locality {
  bus_id: 2
}
incarnation: 9956660329545292162
physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0"
]

Here we also use the environmental variable TF_CPP_MIN_LOG_LEVEL to filter TensorFlow logs. It defaults to 0 (all logs shown), but can be set to 1 to filter out INFO logs, 2 to additionally filter out WARNING logs, and 3 to additionally filter out ERROR logs.

If you use nvidia-docker, you can use the environment variable NV_GPU to isolate GPU. For example:

# NV_GPU=2 nvidia-docker run --rm nvidia/cuda nvidia-smi
Sun Sep 10 20:35:35 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   21C    P8    16W / 250W |  10623MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+