Network Tuning for Pulpos
October 10, 2017 | Linux NetworkIn the post, we document the network tuning we’ve done for the nodes in the Pulpos cluster.
Default Settings
Default settings on CentOS 7 hosts with 32GB RAM (such as pulpo-admin):
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_mem = 766212 1021616 1532424
net.ipv4.udp_mem = 768948 1025267 1537896
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_mtu_probing = 0
Default settings on CentOS 7 hosts with 64GB RAM (such as pulpo-mds01):
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_mem = 1540329 2053772 3080658
net.ipv4.udp_mem = 1543059 2057414 3086118
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_mtu_probing = 0
And the default Transmit Queue Length (txqueuelen) is 1000:
[root@pulpo-admin ~]# ifconfig ens1f0 | grep txqueuelen
ether 90:e2:ba:6f:83:58 txqueuelen 1000 (Ethernet)
Install network test tools
iperf3
is available at the base repo & nuttcp
at EPEL:
[root@pulpo-admin ~]# yum -y install iperf3 nuttcp
[root@pulpo-admin ~]# tentakel yum -y install iperf3 nuttcp
iperf3
ESnet maintains an excellent guide on iperf3. The default port for iper3 is 5201.
The default port for iper3 is 5201.
1) First we’ll perform a memory to memory test between two 10G hosts: venadi (a FIONA box) and pulpo-admin. On venadi, all those ports are open:
[root@venadi ~]# firewall-cmd --list-all
public (default, active)
interfaces: enp5s0
sources:
services: dhcpv6-client http ssh
ports: 6001-6200/tcp 50000-51000/udp 4823/tcp 5001-5900/tcp 6001-6200/udp 2223/tcp 5001-5900/udp 50000-51000/tcp
masquerade: no
forward-ports:
icmp-blocks:
rich rules:
Start iperf3 server on venadi:
[root@venadi ~]# iperf3 -s
Start iperf3 client on pulpo-admin:
[root@pulpo-admin ~]# iperf3 -c venadi.ucsc.edu -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-admin.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Tue, 10 Oct 2017 21:05:46 GMT
Connecting to host venadi.ucsc.edu, port 5201
Cookie: pulpo-admin.ucsc.edu.1507669546.2008
TCP MSS: 8948 (default)
[ 4] local 128.114.86.2 port 51346 connected to 128.114.109.74 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 0 830 KBytes
[ 4] 1.00-2.00 sec 1.15 GBytes 9.89 Gbits/sec 0 830 KBytes
[ 4] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 0 830 KBytes
[ 4] 3.00-4.00 sec 1.15 GBytes 9.91 Gbits/sec 0 830 KBytes
[ 4] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 830 KBytes
[ 4] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 830 KBytes
[ 4] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 830 KBytes
[ 4] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 830 KBytes
[ 4] 8.00-9.00 sec 1.14 GBytes 9.82 Gbits/sec 0 830 KBytes
[ 4] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 961 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 11.5 GBytes 9.89 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 11.5 GBytes 9.89 Gbits/sec receiver
CPU Utilization: local/sender 26.9% (0.3%u/26.7%s), remote/receiver 10.5% (0.5%u/10.0%s)
snd_tcp_congestion cubic
iperf Done.
2) Next we’ll perform a memory to memory test between two 40G hosts: pulpo-osd01 & pulpo-osd02.
Start iperf3 server on pulpo-osd01:
[root@pulpo-osd01 ~]# iperf3 -s
Start iperf3 client on pulpo-osd02:
[root@pulpo-osd02 ~]# iperf3 -c pulpo-osd01.cluster -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-osd02.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Tue, 10 Oct 2017 21:47:44 GMT
Connecting to host pulpo-osd01.cluster, port 5201
Cookie: pulpo-osd02.ucsc.edu.1507672064.6398
TCP MSS: 8948 (default)
[ 4] local 192.168.40.6 port 57414 connected to 192.168.40.5 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 2.62 GBytes 22.5 Gbits/sec 0 717 KBytes
[ 4] 1.00-2.00 sec 2.78 GBytes 23.9 Gbits/sec 0 717 KBytes
[ 4] 2.00-3.00 sec 2.93 GBytes 25.2 Gbits/sec 0 717 KBytes
[ 4] 3.00-4.00 sec 2.94 GBytes 25.2 Gbits/sec 20 507 KBytes
[ 4] 4.00-5.00 sec 2.87 GBytes 24.7 Gbits/sec 0 507 KBytes
[ 4] 5.00-6.00 sec 2.90 GBytes 24.9 Gbits/sec 0 533 KBytes
[ 4] 6.00-7.00 sec 2.90 GBytes 24.9 Gbits/sec 0 533 KBytes
[ 4] 7.00-8.00 sec 2.93 GBytes 25.2 Gbits/sec 0 568 KBytes
[ 4] 8.00-9.00 sec 2.85 GBytes 24.5 Gbits/sec 2 428 KBytes
[ 4] 9.00-10.00 sec 2.84 GBytes 24.4 Gbits/sec 0 472 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 28.6 GBytes 24.5 Gbits/sec 22 sender
[ 4] 0.00-10.00 sec 28.6 GBytes 24.5 Gbits/sec receiver
CPU Utilization: local/sender 34.3% (0.4%u/34.0%s), remote/receiver 15.2% (0.1%u/15.2%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic
iperf Done.
Before network tuning, we only got about 25 Gbps between 40G hosts!
nuttcp
ESnet also maintains an excellent guide on nuttcp.
By default, the nuttcp server listens for commands on port 5000, and the actual nuttcp data transfers take place on port 5001.
1) First we’ll perform a memory to memory test between two 10G hosts: venadi (a FIONA box) and pulpo-admin. On venadi, Open TCP port 5000 on the server (venadi):
[root@venadi ~]# firewall-cmd --zone=public --add-port=5000/tcp
success
Open TCP port 5001 on the client (pulpo-admin):
[root@pulpo-admin ~]# firewall-cmd --zone=public --add-port=5000-5100/tcp
success
Start nuttcp server on venadi:
[root@venadi ~]# nuttcp -S
Start nuttcp client on pulpo-admin::
[root@pulpo-admin ~]# nuttcp -i1 -r venadi.ucsc.edu
1178.5000 MB / 1.00 sec = 9885.4901 Mbps 17 retrans
1179.9375 MB / 1.00 sec = 9898.3697 Mbps 27 retrans
1180.0000 MB / 1.00 sec = 9898.5376 Mbps 34 retrans
1180.0000 MB / 1.00 sec = 9898.4585 Mbps 29 retrans
1179.4375 MB / 1.00 sec = 9893.8883 Mbps 40 retrans
1179.6250 MB / 1.00 sec = 9895.1248 Mbps 30 retrans
1179.4375 MB / 1.00 sec = 9894.0763 Mbps 38 retrans
1179.8125 MB / 1.00 sec = 9896.8658 Mbps 24 retrans
1179.5000 MB / 1.00 sec = 9894.6006 Mbps 33 retrans
1179.3750 MB / 1.00 sec = 9893.3245 Mbps 29 retrans
11801.2792 MB / 10.01 sec = 9894.3578 Mbps 11 %TX 49 %RX 301 retrans 0.18 msRTT
Almost at line rate despite the retrans!
Kill the server:
[root@venadi ~]# pkill nuttcp
Remove the TCP ports on both server and client:
[root@venadi ~]# firewall-cmd --zone=public --remove-port=5000/tcp
[root@pulpo-admin ~]# firewall-cmd --zone=public --remove-port=5000-5100/tcp
2) Next we’ll perform a memory to memory test between two 40G hosts: pulpo-osd01 & pulpo-osd01.
Start nuttcp server (one shot) on pulpo-osd01:
[root@pulpo-osd01 ~]# nuttcp -1
Start nuttcp client on pulpo-osd02:
[root@pulpo-osd02 ~]# nuttcp -i1 pulpo-osd01.cluster
2729.2500 MB / 1.00 sec = 22892.2734 Mbps 0 retrans
2761.8125 MB / 1.00 sec = 23169.7550 Mbps 0 retrans
2797.0000 MB / 1.00 sec = 23463.2181 Mbps 0 retrans
2765.2500 MB / 1.00 sec = 23196.5287 Mbps 0 retrans
2758.5625 MB / 1.00 sec = 23140.5689 Mbps 0 retrans
2744.1875 MB / 1.00 sec = 23019.5449 Mbps 0 retrans
2824.8125 MB / 1.00 sec = 23696.5291 Mbps 0 retrans
2870.1875 MB / 1.00 sec = 24077.1427 Mbps 0 retrans
2919.0000 MB / 1.00 sec = 24486.1264 Mbps 0 retrans
2784.3125 MB / 1.00 sec = 23355.1515 Mbps 0 retrans
27960.6875 MB / 10.00 sec = 23452.7184 Mbps 42 %TX 99 %RX 0 retrans 0.21 msRTT
Before network tuning, we only got about 24 Gbps between 40G hosts, with a lot of retrans!
sysctl
FIONA settings
FIONA boxes use the following kernel parameters (/etc/sysctl.d/prp.conf
) on 40G hosts:
# buffers up to 512MB
net.core.rmem_max=536870912
net.core.wmem_max=536870912
# increase Linux autotuning TCP buffer limit to 256MB
net.ipv4.tcp_rmem=4096 87380 268435456
net.ipv4.tcp_wmem=4096 65536 268435456
net.core.netdev_max_backlog=250000
net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_no_metrics_save = 1
ESnet recommendations
ESnet maintains excellent guide on how to tune Linux, Mac OSX, and FreeBSD hosts connected at speeds of 1Gbps or higher for maximum I/O performance for wide area network transfers.
Pulpos settings
We use the following kernel parameters (/etc/sysctl.d/10g.conf
) on 10G hosts, which are based upon ESnet’s recommendations for a host with a 10G NIC optimized for network paths up to 200ms RTT:
# http://fasterdata.es.net/host-tuning/linux/
# allow testing with buffers up to 128MB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# increase Linux autotuning TCP buffer limit to 64MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq
We use the following kernel parameters (/etc/sysctl.d/40g.conf
) on 40G hosts, which are based upon ESnet’s recommendations for a host with a Mellanox ConnectX-3 NIC:
# http://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/
# allow testing with buffers up to 256MB
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
# increase Linux autotuning TCP buffer limit to 128MB
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# increase the length of the processor input queue
net.core.netdev_max_backlog = 250000
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq
To load the configuration file, execute:
# sysctl -p /etc/sysctl.d/40g.conf
Mellanox ConnectX-3
Each OSD node has a single-port Mellanox ConnectX-3 Pro 10/40/56GbE Adapter, showing up as ens2
in CentOS 7.
# lspci | grep -i Mellanox
02:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
Esnet recommends using the latest device driver from Mellanox rather than the one that comes with CentOS 7.
Install pre-requisites:
# tentakel -g osd yum -y install lsof libxml2-python
Install Mellanox EN Driver for Linux on the OSD nodes:
# tar xvfz mlnx-en-4.1-1.0.2.0-rhel7.4-x86_64.tgz
# cd mlnx-en-4.1-1.0.2.0-rhel7.4-x86_64
# ./install
Contrary to what is said in ESnet article on Mellanox ConnectX-3, the driver installation script didn’t add anything to /etc/sysctl.conf
, nor to /etc/sysctl.d/
.
Furthermore, modify /etc/modprobe.d/mlx4.conf
& /etc/rc.d/rc.local
according to Mellanox guide on ConnectX-3/Pro Tuning For Linux:
# cat /sys/class/net/ens2/device/numa_node
0
# set_irq_affinity_bynode.sh 0 ens2
Speed tests
1) iperf3
Start server:
[root@pulpo-osd01 ~]# iperf3 -s
Start client:
[root@pulpo-osd02 ~]# iperf3 -c pulpo-osd01.cluster -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-osd02.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Wed, 11 Oct 2017 22:57:37 GMT
Connecting to host pulpo-osd01.cluster, port 5201
Cookie: pulpo-osd02.ucsc.edu.1507762657.0888
TCP MSS: 8948 (default)
[ 4] local 192.168.40.6 port 48812 connected to 192.168.40.5 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 3.13 GBytes 26.9 Gbits/sec 0 1.17 MBytes
[ 4] 1.00-2.00 sec 3.25 GBytes 27.9 Gbits/sec 0 1.31 MBytes
[ 4] 2.00-3.00 sec 3.24 GBytes 27.8 Gbits/sec 0 1.31 MBytes
[ 4] 3.00-4.00 sec 3.24 GBytes 27.8 Gbits/sec 0 1.32 MBytes
[ 4] 4.00-5.00 sec 3.25 GBytes 27.9 Gbits/sec 0 1.35 MBytes
[ 4] 5.00-6.00 sec 3.27 GBytes 28.1 Gbits/sec 0 1.41 MBytes
[ 4] 6.00-7.00 sec 3.23 GBytes 27.7 Gbits/sec 0 1.41 MBytes
[ 4] 7.00-8.00 sec 3.19 GBytes 27.4 Gbits/sec 0 1.43 MBytes
[ 4] 8.00-9.00 sec 3.25 GBytes 27.9 Gbits/sec 0 1.43 MBytes
[ 4] 9.00-10.00 sec 3.24 GBytes 27.9 Gbits/sec 0 1.43 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 32.3 GBytes 27.7 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 32.3 GBytes 27.7 Gbits/sec receiver
CPU Utilization: local/sender 36.3% (0.5%u/35.9%s), remote/receiver 11.8% (0.2%u/11.6%s)
snd_tcp_congestion htcp
rcv_tcp_congestion htcp
iperf Done.
Better; but far from line rate!
2) nuttcp
Start server (one shot):
[root@pulpo-osd01 ~]# nuttcp -1
Start client:
[root@pulpo-osd02 ~]# nuttcp -i1 pulpo-osd01.cluster
3209.6875 MB / 1.00 sec = 26924.4871 Mbps 0 retrans
3305.1250 MB / 1.00 sec = 27724.8712 Mbps 0 retrans
3368.3750 MB / 1.00 sec = 28256.3448 Mbps 0 retrans
3052.5625 MB / 1.00 sec = 25607.1599 Mbps 0 retrans
3052.6875 MB / 1.00 sec = 25607.6195 Mbps 0 retrans
3053.5000 MB / 1.00 sec = 25614.4608 Mbps 0 retrans
3049.8750 MB / 1.00 sec = 25584.6663 Mbps 0 retrans
3108.3750 MB / 1.00 sec = 26074.8612 Mbps 0 retrans
3125.7500 MB / 1.00 sec = 26220.5341 Mbps 0 retrans
3091.7500 MB / 1.00 sec = 25935.3750 Mbps 0 retrans
31443.2500 MB / 10.01 sec = 26360.8014 Mbps 43 %TX 89 %RX 0 retrans 0.21 msRTT
Also better; but far from line rate!
Note
- I tried to change core affinity of the iperf3/nuttcp processes, but didn’t observe an improved performance;
- I set cpufreq governor to performance (see 40G/100G Tuning), but didn’t observe an improved performance either:
# cpupower frequency-set -g performance
weak_module?
Philip Papadopoulos suggested the lackluster performance might be due to the fact that driver had been installed as a weak module:
[root@pulpo-osd01 ~]# modinfo mlx4_core
filename: /lib/modules/3.10.0-693.2.2.el7.x86_64/weak-updates/mlnx-en/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version: 4.1-1.0.2
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
rhelversion: 7.4
weak-updates in the driver path is the smoking gun. So let’s build Mellanox EN Driver for the current kernel.
Install the prerequisite packages:
# yum -y groupinstall "Development Tools"
# yum -y install createrepo
# yum -y install kernel-devel
Uninstall Mellanox EN Driver:
# ./uninstall.sh
Compile and install Mellanox EN Driver with kernel support:
# ./install --add-kernel-support
Load the new driver:
# /etc/init.d/mlnx-en.d restart
# modinfo mlx4_core
filename: /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/mlnx-en/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version: 4.1-1.0.2
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
rhelversion: 7.4
The driver was successfully installed. However, it doesn’t appear to improve performance either! I’ll look further into this when I get a chance.