Hydra - a 4-GPU Workstation for Machine Learning
July 28, 2017 | GPU CUDA Linux Machine LearningHydra is a rack-mountable 4-GPU workstation for Machine Learning research, kindly provided to us by our partners at Pacific Research Platform.
- Hardware
- Basic Software
- CUDA
- cuDNN
- ZFS on SATA SSDs
- XFS on NVMe SSD
- Upgrading to CentOS 7.4
- Upgrading to CentOS 7.5
Hardware
- Two 8-core Intel Xeon E5-2620 v4 processors @ 2.1 GHz
- Four Nvidia GeForce GTX 1080 Ti Founder’s Edition Graphics Cards
- 128GB DDR4 ECC/REG memory (8 x 16GB)
- Two 128GB SATA DOMs
- One 3.2TB Samsung PM1725a HHHL PCIe NVMe SSD
- Eight 480GB Samsung PM863a SATA SSDs (installed in eight hot-swap Drive Bays)
- One Mellanox ConnectX-4 Lx EN (MCX4131A-BCAT) 40GbE Single Port QSFP28 Network Adapter
- Integrated Intel i350 Dual Port Gigabit Ethernet Controller
- Integrated IPMI 2.0 with Virtual Media over LAN and KVM-over-LAN Support
- Integrated ASPEED AST2400 BMC Graphics
- Supermicro 7048GR-TR 4U rack-mountable Tower chassis
- 2000W high efficiency (96%) redundant PSUs, Titanium Level
Basic Software
1) Minimal installation of CentOS 7.3 in July 2017.
2) Disable Selinux by changing the following line in /etc/selinux/config
:
to:
3) After copying SSH keys to the host, disable password authentication of SSH by changing the following line in /etc/ssh/sshd_config
:
to
4) Disable GSSAPI authentication of SSH by changing the following line in /etc/ssh/sshd_config
:
to:
5) Update all packages:
6) Reboot;
7) Remove the old kernel:
8) Install Development Tools
9) Enable EPEL repository:
10) Install DKMS, which is required by CUDA packages:
11) Install Environment Modules:
CUDA
Initially the open source nouveau driver is loaded by default:
Install CUDA Toolkit, using Installer Type rpm(network)
:
This installs CUDA 8.0 at /usr/local/cuda-8.0
; and also creates a symbolic link /usr/local/cuda
pointing to that directory.
Now CUDA 8.0 is installed; and the nouveau driver is blacklisted with /etc/modprobe.d/blacklist-nouveau.conf
.
Let’s list all Nvidia GPUs:
We note in passing that NVIDIA CUDA Profile Tools Interface is installed at /usr/local/cuda-8.0/extras/CUPTI
. This library provides advanced profiling support; and is required by current version (1.3) of TensorFlow.
cuDNN
The current version (1.3) of TensorFlow requires cuDNN v6.0. Download the tarball of cuDNN v6.0 for CUDA 8.0; and unpack it:
which places the header and the libraries in /usr/local/cuda
, which is a symbolic link to /usr/local/cuda-8.0
.
The CUDA installation has added an entry cuda-8-0.conf in /etc/ld.so.conf.d/; so there is no need to alter the LD_LIBRARY_PATH environment variable.
Refresh the shared library cache:
ZFS on SATA SSDs
Install the ZFS repository for RHEL/CentOS 7.3:
We’ll install kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then install zfs
:
which installs zfs
as well as its dependencies kmod-spl
, kmod-zfs
, libzfs2
, libzpool2
& spl
.
Enable and start ZFS:
Note there is a preset file for ZFS at /usr/lib/systemd/system-preset/50-zfs.preset
:
Here we’ve only enabled zfs-import-cache
, zfs-import-scan
& zfs-mount zfs.target
. But it’s worthwhile looking into zfs-zed (ZFS Even Daemon).
Find out the disk IDs of the eight 480GB Samsung PM863a SATA SSDs:
Use the following shell script, mkzfs
, to create a ZFS on the eight 480GB Samsung PM863a SATA SSDs:
This script will automatically create /etc/zfs/zpool.cache
; so we don’t need to run the following command as instructed in the Arch Linux wiki article:
XFS on NVMe SSD
Partition the NVMe SSD:
Create XFS on the NVMe partition:
We have the sticky bit set.
Find out the UUID of the NVMe partition:
Append the following line to /etc/fstab
so that the partition will be automatically mounted on startup:
Upgrading to CentOS 7.4
We upgraded the OS to CentOS 7.4 in November 2017:
1) After reboot, the kernel has been upgraded to 3.10.0-693.5.2.el7.x86_64
(from 3.10.0-514.26.2.el7.x86_64
):
2) The nvidia driver has been upgraded to 384.81
(from 384.66
):
3) CUDA 9.0 is installed; and /usr/local/cuda
now points to cuda-9.0
. Note CUDA 8.0 is not removed; it is still kept at /usr/local/cuda-8.0
.
4) The CUDA 9.0 installation has added an entry cuda-9-0.conf
in /etc/ld.so.conf.d/
:
ZFS
But ZFS stopped working! Let’s fix it.
Remove old zfs-release
, which is for RHEL/CentOS 7.3:
Install zfs-release for EL7.4:
We’ll again use the kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then update to install the latest zfs packages:
Restart zfs.target:
Now ZFS is mounted:
ZFS has been upgraded to 0.7.3-1
(from 0.7.1-1
):
CephFS
CephFS was not mounted either! As it turns out, the systemd unit file /usr/lib/systemd/system/ceph-fuse@.service
has been overwritten. We had to modify it again by following steps in the post Mounting a Subdirectory of CephFS on a CentOS 7 Client; then it works again!
cuDNN
Download the tarball of cuDNN v7.0 for CUDA 9.0; and unpack it:
Refresh the shared library cache:
Upgrading to CentOS 7.5
We upgraded the OS to CentOS 7.5 in July 2018:
1) After reboot, the kernel has been upgraded to 3.10.0-862.9.1.el7.x86_64
:
2) The nvidia driver has been upgraded to 396.37
:
3) CUDA 9.2 is installed; and /usr/local/cuda
now points to cuda-9.2
. Note CUDA 9.0 is not removed; it is still kept at /usr/local/cuda-9.0
.
4) The CUDA 9.2 installation has added an entry cuda-9-2.conf
in /etc/ld.so.conf.d/
:
ZFS
But ZFS again stopped working! Let’s fix it.
Remove old zfs-release
, which is for RHEL/CentOS 7.3:
Install zfs-release for EL7.4:
We’ll again use the kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then update to install the latest zfs packages:
Restart zfs.target:
Now ZFS is mounted:
ZFS has been upgraded to 0.7.9-1
(from 0.7.3-1
):
CephFS
CephFS was not mounted either! As it turns out, the systemd unit file /usr/lib/systemd/system/ceph-fuse@.service
has been overwritten. We had to modify it again by following steps in the post Mounting a Subdirectory of CephFS on a CentOS 7 Client; then it works again!
cuDNN
Download the tarball of cuDNN v7.1 for CUDA 9.2; and unpack it:
Refresh the shared library cache: