Hydra - a 4-GPU Workstation for Machine Learning
July 28, 2017 | GPU CUDA Linux Machine LearningHydra is a rack-mountable 4-GPU workstation for Machine Learning research, kindly provided to us by our partners at Pacific Research Platform.
- Hardware
- Basic Software
- CUDA
- cuDNN
- ZFS on SATA SSDs
- XFS on NVMe SSD
- Upgrading to CentOS 7.4
- Upgrading to CentOS 7.5
Hardware
- Two 8-core Intel Xeon E5-2620 v4 processors @ 2.1 GHz
- Four Nvidia GeForce GTX 1080 Ti Founder’s Edition Graphics Cards
- 128GB DDR4 ECC/REG memory (8 x 16GB)
- Two 128GB SATA DOMs
- One 3.2TB Samsung PM1725a HHHL PCIe NVMe SSD
- Eight 480GB Samsung PM863a SATA SSDs (installed in eight hot-swap Drive Bays)
- One Mellanox ConnectX-4 Lx EN (MCX4131A-BCAT) 40GbE Single Port QSFP28 Network Adapter
- Integrated Intel i350 Dual Port Gigabit Ethernet Controller
- Integrated IPMI 2.0 with Virtual Media over LAN and KVM-over-LAN Support
- Integrated ASPEED AST2400 BMC Graphics
- Supermicro 7048GR-TR 4U rack-mountable Tower chassis
- 2000W high efficiency (96%) redundant PSUs, Titanium Level
Basic Software
1) Minimal installation of CentOS 7.3 in July 2017.
2) Disable Selinux by changing the following line in /etc/selinux/config
:
SELINUX=enforcing
to:
SELINUX=disabled
3) After copying SSH keys to the host, disable password authentication of SSH by changing the following line in /etc/ssh/sshd_config
:
PasswordAuthentication yes
to
PasswordAuthentication no
4) Disable GSSAPI authentication of SSH by changing the following line in /etc/ssh/sshd_config
:
GSSAPIAuthentication yes
to:
GSSAPIAuthentication no
5) Update all packages:
# yum -y update
6) Reboot;
7) Remove the old kernel:
# yum erase -y kernel-3.10.0-514.el7.x86_64
8) Install Development Tools
# yum groupinstall -y "Development Tools"
9) Enable EPEL repository:
# yum install -y epel-release
10) Install DKMS, which is required by CUDA packages:
# yum install -y dkms
11) Install Environment Modules:
# yum -y install environment-modules
CUDA
Initially the open source nouveau driver is loaded by default:
# lsmod | grep nouveau
nouveau 1527946 0
video 24400 1 nouveau
mxm_wmi 13021 1 nouveau
drm_kms_helper 146456 2 ast,nouveau
ttm 93908 2 ast,nouveau
i2c_algo_bit 13413 3 ast,igb,nouveau
drm 372540 5 ast,ttm,drm_kms_helper,nouveau
i2c_core 40756 8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nouveau
wmi 19070 2 mxm_wmi,nouveau
Install CUDA Toolkit, using Installer Type rpm(network)
:
# yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
# yum clean all
# yum install -y cuda
# reboot
This installs CUDA 8.0 at /usr/local/cuda-8.0
; and also creates a symbolic link /usr/local/cuda
pointing to that directory.
Now CUDA 8.0 is installed; and the nouveau driver is blacklisted with /etc/modprobe.d/blacklist-nouveau.conf
.
# RPM Fusion blacklist for nouveau driver - you need to run as root:
# dracut -f /boot/initramfs-$(uname -r).img $(uname -r)
# if nouveau is loaded despite this file.
blacklist nouveau
Let’s list all Nvidia GPUs:
# nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9e5e4bb8-5810-69c5-c783-1759ac6cd0ce)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-3d046d75-24b5-b06f-5ec2-ba43aeb6b17f)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-be6407ea-08f6-06d4-732b-47b40896950c)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-41444e04-2b58-56a6-2a85-318f03017dd6)
We note in passing that NVIDIA CUDA Profile Tools Interface is installed at /usr/local/cuda-8.0/extras/CUPTI
. This library provides advanced profiling support; and is required by current version (1.3) of TensorFlow.
cuDNN
The current version (1.3) of TensorFlow requires cuDNN v6.0. Download the tarball of cuDNN v6.0 for CUDA 8.0; and unpack it:
# tar tfz cudnn-8.0-linux-x64-v6.0.tgz
cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.6
cuda/lib64/libcudnn.so.6.0.21
cuda/lib64/libcudnn_static.a
# tar xvfz cudnn-8.0-linux-x64-v6.0.tgz -C /usr/local/
which places the header and the libraries in /usr/local/cuda
, which is a symbolic link to /usr/local/cuda-8.0
.
The CUDA installation has added an entry cuda-8-0.conf in /etc/ld.so.conf.d/; so there is no need to alter the LD_LIBRARY_PATH environment variable.
# cat /etc/ld.so.conf.d/cuda-8-0.conf
/usr/local/cuda-8.0/targets/x86_64-linux/lib
Refresh the shared library cache:
# ldconfig
ZFS on SATA SSDs
Install the ZFS repository for RHEL/CentOS 7.3:
# yum install -y http://download.zfsonlinux.org/epel/zfs-release.el7_3.noarch.rpm
We’ll install kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then install zfs
:
# yum install -y zfs
which installs zfs
as well as its dependencies kmod-spl
, kmod-zfs
, libzfs2
, libzpool2
& spl
.
Enable and start ZFS:
# systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs.target
# systemctl start zfs.target
Note there is a preset file for ZFS at /usr/lib/systemd/system-preset/50-zfs.preset
:
# ZFS is enabled by default
enable zfs-import-cache.service
disable zfs-import-scan.service
enable zfs-mount.service
enable zfs-share.service
enable zfs-zed.service
enable zfs.target
Here we’ve only enabled zfs-import-cache
, zfs-import-scan
& zfs-mount zfs.target
. But it’s worthwhile looking into zfs-zed (ZFS Even Daemon).
Find out the disk IDs of the eight 480GB Samsung PM863a SATA SSDs:
# ls -lh /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411211 -> ../../sde
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411217 -> ../../sdf
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411300 -> ../../sdh
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411303 -> ../../sdb
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411318 -> ../../sdc
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411319 -> ../../sdd
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411320 -> ../../sda
lrwxrwxrwx 1 root root 9 Jul 28 21:05 ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411321 -> ../../sdg
Use the following shell script, mkzfs
, to create a ZFS on the eight 480GB Samsung PM863a SATA SSDs:
#!/bin/bash
/sbin/modprobe zfs
zpool create -f -m /home home -o ashift=12 raidz1 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411320 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411303 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411318 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411319
zpool add -f home -o ashift=12 raidz1 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411211 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411217 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411321 \
ata-SAMSUNG_MZ7LM480HMHQ-00005_S2UJNX0J411300
zfs set recordsize=1024K home
zfs set checksum=fletcher4 home
zfs set atime=off home
This script will automatically create /etc/zfs/zpool.cache
; so we don’t need to run the following command as instructed in the Arch Linux wiki article:
zpool set cachefile=/etc/zfs/zpool.cache <pool>
XFS on NVMe SSD
Partition the NVMe SSD:
# parted /dev/nvme0n1
GNU Parted 3.1
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/nvme0n1 will be destroyed and all data on this
disk will be lost. Do you want to continue?
Yes/No? Yes
(parted) unit s
(parted) print free
Model: Unknown (unknown)
Disk /dev/nvme0n1: 6251233968s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
34s 6251233934s 6251233901s Free Space
(parted) mkpart primary 0% 100%
(parted) print free
Model: Unknown (unknown)
Disk /dev/nvme0n1: 6251233968s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
34s 2047s 2014s Free Space
1 2048s 6251233279s 6251231232s primary
6251233280s 6251233934s 655s Free Space
(parted) quit
Information: You may need to update /etc/fstab.
Create XFS on the NVMe partition:
# mkfs.xfs /dev/nvme0n1p1
meta-data=/dev/nvme0n1p1 isize=512 agcount=4, agsize=195350976 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=781403904, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=381544, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# xfs_admin -L scratch /dev/nvme0n1p1
writing all SBs
new label = "scratch"
# mkdir /scratch
# mount /dev/nvme0n1p1 /scratch
# chmod 1777 /scratch
# ls -ld /scratch
drwxrwxrwt 2 root root 6 Jul 28 21:38 /scratch
We have the sticky bit set.
Find out the UUID of the NVMe partition:
# cd /dev/disk/by-uuid/
# ls -l
lrwxrwxrwx 1 root root 15 Jul 28 21:39 f67abe6c-42db-4006-aece-ea87b89b8eaf -> ../../nvme0n1p1
Append the following line to /etc/fstab
so that the partition will be automatically mounted on startup:
UUID=f67abe6c-42db-4006-aece-ea87b89b8eaf /scratch xfs defaults 0 0
Upgrading to CentOS 7.4
We upgraded the OS to CentOS 7.4 in November 2017:
# yum -y update
# reboot
1) After reboot, the kernel has been upgraded to 3.10.0-693.5.2.el7.x86_64
(from 3.10.0-514.26.2.el7.x86_64
):
# uname -r
3.10.0-693.5.2.el7.x86_64
2) The nvidia driver has been upgraded to 384.81
(from 384.66
):
# nvidia-smi
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
3) CUDA 9.0 is installed; and /usr/local/cuda
now points to cuda-9.0
. Note CUDA 8.0 is not removed; it is still kept at /usr/local/cuda-8.0
.
4) The CUDA 9.0 installation has added an entry cuda-9-0.conf
in /etc/ld.so.conf.d/
:
# cat /etc/ld.so.conf.d/cuda-9-0.conf
/usr/local/cuda-9.0/targets/x86_64-linux/lib
ZFS
But ZFS stopped working! Let’s fix it.
Remove old zfs-release
, which is for RHEL/CentOS 7.3:
# rpm -q zfs-release
zfs-release-1-4.el7_3.centos.noarch
# yum erase zfs-release
Install zfs-release for EL7.4:
# yum -y install http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm
We’ll again use the kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then update to install the latest zfs packages:
# yum clean all
# yum update
Restart zfs.target:
# systemctl restart zfs.target
Now ZFS is mounted:
# df -h /home
Filesystem Size Used Avail Use% Mounted on
home 2.5T 379G 2.1T 16% /home
ZFS has been upgraded to 0.7.3-1
(from 0.7.1-1
):
# cat /sys/module/zfs/version
0.7.3-1
CephFS
CephFS was not mounted either! As it turns out, the systemd unit file /usr/lib/systemd/system/ceph-fuse@.service
has been overwritten. We had to modify it again by following steps in the post Mounting a Subdirectory of CephFS on a CentOS 7 Client; then it works again!
cuDNN
Download the tarball of cuDNN v7.0 for CUDA 9.0; and unpack it:
# tar tfz cudnn-9.0-linux-x64-v7.tgz
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.0.4
cuda/lib64/libcudnn_static.a
# tar xvfz cudnn-9.0-linux-x64-v7.0.tgz -C /usr/local/
Refresh the shared library cache:
# ldconfig
Upgrading to CentOS 7.5
We upgraded the OS to CentOS 7.5 in July 2018:
# yum -y update
# reboot
1) After reboot, the kernel has been upgraded to 3.10.0-862.9.1.el7.x86_64
:
# uname -r
3.10.0-862.9.1.el7.x86_64
2) The nvidia driver has been upgraded to 396.37
:
# nvidia-smi
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
3) CUDA 9.2 is installed; and /usr/local/cuda
now points to cuda-9.2
. Note CUDA 9.0 is not removed; it is still kept at /usr/local/cuda-9.0
.
4) The CUDA 9.2 installation has added an entry cuda-9-2.conf
in /etc/ld.so.conf.d/
:
# cat /etc/ld.so.conf.d/cuda-9-2.conf
/usr/local/cuda-9.2/targets/x86_64-linux/lib
ZFS
But ZFS again stopped working! Let’s fix it.
Remove old zfs-release
, which is for RHEL/CentOS 7.3:
# rpm -q zfs-release
zfs-release-1-5.el7_4.noarch
# yum erase zfs-release
Install zfs-release for EL7.4:
# yum -y install http://download.zfsonlinux.org/epel/zfs-release.el7_5.noarch.rpm
We’ll again use the kABI-tracking kmod ZFS package. Follow instructions in the wiki to modify /etc/yum.repos.d/zfs.repo
then update to install the latest zfs packages:
# yum clean all
# yum update
Restart zfs.target:
# systemctl restart zfs.target
Now ZFS is mounted:
# df -h /home
Filesystem Size Used Avail Use% Mounted on
home 2.5T 379G 2.1T 16% /home
ZFS has been upgraded to 0.7.9-1
(from 0.7.3-1
):
# cat /sys/module/zfs/version
0.7.9-1
CephFS
CephFS was not mounted either! As it turns out, the systemd unit file /usr/lib/systemd/system/ceph-fuse@.service
has been overwritten. We had to modify it again by following steps in the post Mounting a Subdirectory of CephFS on a CentOS 7 Client; then it works again!
cuDNN
Download the tarball of cuDNN v7.1 for CUDA 9.2; and unpack it:
# tar tfz cudnn-9.2-linux-x64-v7.1.tgz
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.1.4
cuda/lib64/libcudnn_static.a
# tar xvfz cudnn-9.2-linux-x64-v7.1.tgz -C /usr/local/
Refresh the shared library cache:
# ldconfig