In this post, we describe how we installed TensorFlow 1.4 on the 4-GPU workstation Hydra . Previously, we installed TensorFlow 1.3 with Anaconda on Hydra. But because there are a lot of major features and improvements introduced in release 1.4 , it is worthwhile to spend some effort to install the latest release of TensorFlow on the GPU box.
Python 3.6 compatibility
The latest Anaconda3 version (5.0.1 as of this writing) includes Python 3.6 by default. As noted in previous post , if we attempt a “native” pip of the official TensorFlow release:
[root@hydra ~]# pip install tensorflow-gpu
Collecting tensorflow-gpu
Downloading tensorflow_gpu-1.4.1-cp36-cp36m-manylinux1_x86_64.whl (170.3MB)
100% |████████████████████████████████| 170.3MB 8.2kB/s
...
Downloading tensorflow_tensorboard-0.4.0rc3-py3-none-any.whl (1.7MB)
100% |████████████████████████████████| 1.7MB 863kB/s
...
Successfully installed bleach-1.5.0 enum34-1.1.6 html5lib-0.9999999 markdown-2.6.11 protobuf-3.5.1 tensorflow-gpu-1.4.1 tensorflow-tensorboard-0.4.0rc3
the installation will fail to pass the simple validation:
[root@hydra ~]# python
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
> >> import tensorflow as tf
/opt/tensorflow14/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
It looks like we need to downgrade to Python 3.5!
Anaconda Python 3.5
There are three ways to downgrade to Anaconda Python 3.5 :
Install the latest version of Anaconda and then make a Python 3.5 environment
Install the latest version of Anaconda and run this command to install Python 3.5 in the root environment: conda install python=3.5
Install the most recent Anaconda that included Python 3.5 by default, Anaconda 4.2.0
We’ll take the second approach.
1) Install a separate copy of Anaconda3 at /opt/tensorflow14
so that it will be accessible by all users on Hydra:
# cd ~/Downloads
# ./Anaconda3-5.0.1-Linux-x86_64.sh
2) Create an environment module , python/tensorflow14
, for this Anaconda3 installation. Then load the module:
# module load python/tensorflow14
3) Install Python 3.5 in the root environment:
# conda install python = 3.5
# which python
/opt/tensorflow14/bin/python
# python --version
Python 3.5.4 :: Anaconda custom (64-bit)
4) Perform a “native” pip install of the official TensorFlow release:
[root@hydra Downloads]# pip install tensorflow-gpu
Collecting tensorflow-gpu
Downloading tensorflow_gpu-1.4.1-cp35-cp35m-manylinux1_x86_64.whl (170.1MB)
100% |████████████████████████████████| 170.1MB 8.3kB/s
...
Successfully installed bleach-1.5.0 enum34-1.1.6 html5lib-0.9999999 markdown-2.6.11 protobuf-3.5.1 tensorflow-gpu-1.4.1 tensorflow-tensorboard-0.4.0rc3
Weak-modules woes
Unfortunately, the installation still couldn’t pass the simple validation test, albeit with a new error!
# python
Python 3.5.4 |Anaconda custom (64-bit)| (default, Nov 20 2017, 18:44:38)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
> >> import tensorflow as tf
> >> sess = tf.Session()
2018-01-10 11:32:30.927717: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
After some digging, I found something might be amiss with the Nvidia driver:
# modinfo nvidia
filename: /lib/modules/3.10.0-693.11.6.el7.x86_64/weak-updates/nvidia.ko
modinfo: ERROR: could not get modinfo from 'nvidia': No such file or directory
# cd /lib/modules/3.10.0-693.11.6.el7.x86_64/weak-updates/
# ls -l nvidia*
total 0
lrwxrwxrwx 1 root root 58 Jan 4 09:45 nvidia-drm.ko -> /lib/modules/3.10.0-693.5.2.el7.x86_64/extra/nvidia-drm.ko
lrwxrwxrwx 1 root root 54 Jan 4 09:45 nvidia.ko -> /lib/modules/3.10.0-693.5.2.el7.x86_64/extra/nvidia.ko
lrwxrwxrwx 1 root root 62 Jan 4 09:45 nvidia-modeset.ko -> /lib/modules/3.10.0-693.5.2.el7.x86_64/extra/nvidia-modeset.ko
lrwxrwxrwx 1 root root 58 Jan 4 09:45 nvidia-uvm.ko -> /lib/modules/3.10.0-693.5.2.el7.x86_64/extra/nvidia-uvm.ko
However, the targets of those symbolic links were no longer existent on Hydra! Now I recall that in the first week of 2018 I upgraded all the packages on Hydra, then removed the old kernel packages after the upgrade. What I didn’t realize then was that the simple command yum erase kernel-3.10.0-693.5.2.el7.x86_64
would also erase the Nvidia kernel drivers that were required by the new kernel!
Once we knew what the problem was, it was straightforward to fix it.
1) Delete the dangling symbolic links:
# rm nvidia*
rm: remove symbolic link ‘nvidia-drm.ko’? y
rm: remove symbolic link ‘nvidia.ko’? y
rm: remove symbolic link ‘nvidia-modeset.ko’? y
rm: remove symbolic link ‘nvidia-uvm.ko’? y
2) Erase the nvidia-kmod package:
# yum erase -y nvidia-kmod
================================================================================
Removing:
nvidia-kmod x86_64 1:387.26-2.el7 @cuda 23 M
Removing for dependencies:
cuda x86_64 9.1.85-1 @cuda 0.0
cuda-8-0 x86_64 8.0.61-1 @cuda 0.0
cuda-9-0 x86_64 9.0.176-1 @cuda 0.0
cuda-9-1 x86_64 9.1.85-1 @cuda 0.0
cuda-demo-suite-8-0 x86_64 8.0.61-1 @cuda 11 M
cuda-demo-suite-9-0 x86_64 9.0.176-1 @cuda 11 M
cuda-demo-suite-9-1 x86_64 9.1.85-1 @cuda 11 M
cuda-drivers x86_64 387.26-1 @cuda 0.0
cuda-runtime-8-0 x86_64 8.0.61-1 @cuda 0.0
cuda-runtime-9-0 x86_64 9.0.176-1 @cuda 0.0
cuda-runtime-9-1 x86_64 9.1.85-1 @cuda 0.0
xorg-x11-drv-nvidia x86_64 1:387.26-1.el7 @cuda 11 M
xorg-x11-drv-nvidia-devel x86_64 1:387.26-1.el7 @cuda 579 k
xorg-x11-drv-nvidia-gl x86_64 1:387.26-1.el7 @cuda 75 M
xorg-x11-drv-nvidia-libs x86_64 1:387.26-1.el7 @cuda 93 M
3) Reboot.
4) Reinstall CUDA:
# yum install -y cuda
5) Reboot one more time. Now the latest Nvidia driver is properly installed and loaded:
# modinfo nvidia
filename: /lib/modules/3.10.0-693.11.6.el7.x86_64/extra/nvidia.ko
alias: char-major-195-*
version: 387.26
supported: external
license: NVIDIA
rhelversion: 7.4
...
Validation
Finally TensorFlow 1.4.1 appears to work as expected!
$ module load python/tensorflow14
$ python
Python 3.5.4 |Anaconda custom (64-bit)| (default, Nov 20 2017, 18:44:38)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
> >> import tensorflow as tf
> >> a = tf.constant([ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape =[ 2, 3], name = 'a' )
> >> b = tf.constant([ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape =[ 3, 2], name = 'b' )
> >> c = tf.matmul( a, b)
> >> sess = tf.Session( config = tf.ConfigProto( log_device_placement = True))
2018-01-10 12:23:56.993368: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-01-10 12:24:02.603082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-10 12:24:02.995246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-10 12:24:03.369357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-10 12:24:03.717394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:84:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-10 12:24:03.723314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-01-10 12:24:03.723923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3
2018-01-10 12:24:03.723950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y N N
2018-01-10 12:24:03.723969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y N N
2018-01-10 12:24:03.723988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: N N Y Y
2018-01-10 12:24:03.724007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: N N Y Y
2018-01-10 12:24:03.724049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> ( device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-01-10 12:24:03.724074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> ( device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-01-10 12:24:03.724136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> ( device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-01-10 12:24:03.724158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> ( device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1
2018-01-10 12:24:04.190943: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1
> >> print( sess.run( c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-10 12:24:27.442923: I tensorflow/core/common_runtime/placer.cc:874] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-10 12:24:27.443079: I tensorflow/core/common_runtime/placer.cc:874] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-10 12:24:27.443159: I tensorflow/core/common_runtime/placer.cc:874] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[ 22. 28.]
[ 49. 64.]]
> >> sess.close()
> >> quit()