venadi.ucsc.edu (IPv4 address: 128.114.109.74) is a Data Transfer Node, kindly provided by Pacific Research Platform (PRP), in order to improve the data transfer performance of the Hyades cluster. The system had been configured and tuned by John Graham at UCSD, before being shipped to UCSC.
Venadi is a FIONA box. FIONA stands for Flash I/O Network Appliance. Designed by Phil Papadopoulos and Tom DeFanti at UCSD, FIONA is a low-cost, flash memory-based data server appliance that can handle huge data flows. Here are the hardware specs of Venadi:
The two (2x) Intel 240GB SSDs are attached to onboard SATA3 (6Gbps) ports; and serve as boot drives. Two software RAID1 (mirror) volumes, md126 & md127, respectively, are created on the 2 SSDs; they are mounted at / & /boot, respectively.
ZFS
ZFS on Linux 0.6.5.8 is installed on Venadi. The following script was used to create a ZFS on the 16x Intel 480GB SSDs:
So there are 2 raidz1 VDEVs (virtual devices), each on 8 SSDs; the zpool bigdata stripes the 2 raidz1 VDEVs. The ZFS has a usable capacity of 5.7TB:
Next we run dd with the option conv=fdatasync, which will cause dd to physically write output file data before finishing:
The write speed to the software RAID1 volume dropped to appallingly low 116 MB/s; while that to ZFS was only slightly lower at 4.2 GB/s
We also run dd with the option oflag=dsync, which will cause dd to use synchronized I/O for data:
Now both numbers are abysmal: 69.3 MB/s for software RAID1 and 142 MB/s for ZFS. However, the numbers may be artificially too low, because in this mode dd syncs every megabyte (bs).
Raising bs to 1GB may give a slightly more accurate number:
Be default, bonnie++ uses datasets whose sizes are twice the amount of memory, in order to minimize the effect of file caching. The total memory of venadi is 128 GB.
Block sequential write speed is 1476574 KB/s = 1.48 GB/s.
Block sequential rewrite speed is 1133554 KB/s = 1.13 GB/s.
Block sequential read speed is 3.53 GB/s.
The ZFS volume delivers 4029 IOPS (Random Seeks in the table). By comparison, a 10,000 SAS drive has ~140 IOPS.
IOzone
IOzone is another popular filesystem benchmark tool. The benchmark generates and measures a variety of file operations. It tests file I/O performance for the following operations: Read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read, aio_writeiozone.
By default, IOzone will automatically create temporary files of size from 64k to 512M, to perform various testing; and will generate a lot of data. Here we fix the test file size to 256 GB, twice the amount of memory. For the sake of time and space, we only test write and read speed:
Note the write speed was only 0.87 GB/s, and rewrite speed was even slower at 0.50 GB/s!
Let’s do a full test with fixed file size of 8 GB:
Here, most likely caching distorts IOZone results for smaller files! The write speed reached a maximum of 6.96 GB/s, and read speed 9.55 GB/s, when record length (reclen) was 2048 KiB.
iperf3
Before we perform disk-to-disk data transfer tests between venadi and NERSC systems, it is worthwhile to measure the memory-to-memory performance of the network. We’ll test against the perfSONAR host at NERSC (perfsonar.nersc.gov).
Each DTN has four 10GbE links (bonded as bond0) for transfers over the network and two FDR IB connections to the filesystem:
Here is the route from venadi to dtn03.nersc.gov:
scp/sftp
Although not recommended, let’s see how long it takes to download the 100GB data file from NERSC to venadi, via scp/sftp:
So we got on average 182.2MB/s (1.46 Gbps) via scp/sftp.
BaBar Copy (bbcp)
Next let’s see how long it takes to download the 100GB data file from NERSC to venadi, via bbcp:
So we got about 440 MB/s (3.52 Gbps) via bbcp, about 2.4 times the transfer speed via scp/sftp.
My 100GB data file is stored at /global/cscratch1/sd/shawdong/, which is on Cori's Lustre file system. Perhaps the lackluster transfer speed is due to the poor performance of the Lustre file system? It could be too busy; or it may not be fully optimized, since Cori is new.
Let’s test against Eli Dart’s dataset, which is stored on GPFS:
First let’s download a single 50GB file:
Great! We were getting close to 1GB/s!
Next, let’s download the whole dataset:
Sweet! We got a consistent 1GB/s and up when downloading the 520GB dataset!
GridFTP
Let’s see how long it takes to download the 100GB data file from NERSC to venadi, via GridFTP:
However, at this moment, the NERSC DTNs were having some issue with ID mapping:
But Edison works fine:
Let’s see how fast to transfer a 100GB file from NERSC Edison to venadi:
The average speed was only 108.32 MB/sec. But I noticed that the peak speed had reached 290 MB/s. Most likely, the limiting factor was the scratch file system, which is a busy resource shared by thousands of users!
Update: NERSC consul confirmed that my account was not properly set up in NIM for the data transfer nodes. The issue has been fixed! Let’s test again:
The sluggish speed is likely due to the performance of Cori's Lustre file system.
Let’s test against Eli Dart’s dataset, which is stored on GPFS:
We got a whopping average transfer speed of 1.487 GB/s, with peak speed of almost 2GB/s!
Globus Online
Lastly, let’s use Globus Online to transfer the 100GB file from Edison:
The paltry effective speed of 45.56 MB/s was mostly due to the long time spent in "verifying file integrity after transfer"; the actual transfer speed was about 110 MB/s, same as that of GridFTP. Once again, the sluggish speed is due to the performance of Edison's Lustre file system. Tests against Eli Dart's dataset, which is storage on GPFS, would give much better results.
Quick Remarks
UCSC’s SciDMZ is in excellent shape! A well-tuned host like venadi can reach line rate of 40GbE in memory-to-memory transfer;
Local IO performance of FIONA boxes is fast! For small files, we observed a sequential write speed of 4.2 GB/s when using dd, and more with iozone; for big files, we observed a sequential write speed of 1.48 GB/s when using bonnie++.
We are able to transfer files from NERSC GPFS to venadi at lightning fast disk-to-disk speeds:
we got consistent transfer speed of 1GB/s using bbcp
we got almost 1.5 GB/s average transfer speed, using GridFTP/Globus