Network Tuning for Pulpos

October 10, 2017 | Linux Network

In the post, we document the network tuning we’ve done for the nodes in the Pulpos cluster.

Default Settings

Default settings on CentOS 7 hosts with 32GB RAM (such as pulpo-admin):

net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_rmem = 4096        87380   6291456
net.ipv4.tcp_wmem = 4096        16384   4194304
net.ipv4.tcp_mem = 766212       1021616 1532424
net.ipv4.udp_mem = 768948       1025267 1537896
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_mtu_probing = 0

Default settings on CentOS 7 hosts with 64GB RAM (such as pulpo-mds01):

net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_rmem = 4096        87380   6291456
net.ipv4.tcp_wmem = 4096        16384   4194304
net.ipv4.tcp_mem = 1540329      2053772 3080658
net.ipv4.udp_mem = 1543059      2057414 3086118
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_mtu_probing = 0

And the default Transmit Queue Length (txqueuelen) is 1000:

[root@pulpo-admin ~]# ifconfig ens1f0 | grep txqueuelen
        ether 90:e2:ba:6f:83:58  txqueuelen 1000  (Ethernet)

Install network test tools

iperf3 is available at the base repo & nuttcp at EPEL:

[root@pulpo-admin ~]# yum -y install iperf3 nuttcp
[root@pulpo-admin ~]# tentakel yum -y install iperf3 nuttcp

iperf3

ESnet maintains an excellent guide on iperf3. The default port for iper3 is 5201.

The default port for iper3 is 5201.

1) First we’ll perform a memory to memory test between two 10G hosts: venadi (a FIONA box) and pulpo-admin. On venadi, all those ports are open:

[root@venadi ~]# firewall-cmd --list-all
public (default, active)
  interfaces: enp5s0
  sources:
  services: dhcpv6-client http ssh
  ports: 6001-6200/tcp 50000-51000/udp 4823/tcp 5001-5900/tcp 6001-6200/udp 2223/tcp 5001-5900/udp 50000-51000/tcp
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:

Start iperf3 server on venadi:

[root@venadi ~]# iperf3 -s

Start iperf3 client on pulpo-admin:

[root@pulpo-admin ~]# iperf3 -c venadi.ucsc.edu -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-admin.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Tue, 10 Oct 2017 21:05:46 GMT
Connecting to host venadi.ucsc.edu, port 5201
      Cookie: pulpo-admin.ucsc.edu.1507669546.2008
      TCP MSS: 8948 (default)
[  4] local 128.114.86.2 port 51346 connected to 128.114.109.74 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.15 GBytes  9.91 Gbits/sec    0    830 KBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.89 Gbits/sec    0    830 KBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    0    830 KBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.91 Gbits/sec    0    830 KBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    0    830 KBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0    830 KBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.90 Gbits/sec    0    830 KBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.90 Gbits/sec    0    830 KBytes
[  4]   8.00-9.00   sec  1.14 GBytes  9.82 Gbits/sec    0    830 KBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.90 Gbits/sec    0    961 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec                  receiver
CPU Utilization: local/sender 26.9% (0.3%u/26.7%s), remote/receiver 10.5% (0.5%u/10.0%s)
snd_tcp_congestion cubic

iperf Done.

2) Next we’ll perform a memory to memory test between two 40G hosts: pulpo-osd01 & pulpo-osd02.

Start iperf3 server on pulpo-osd01:

[root@pulpo-osd01 ~]# iperf3 -s

Start iperf3 client on pulpo-osd02:

[root@pulpo-osd02 ~]# iperf3 -c pulpo-osd01.cluster -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-osd02.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Tue, 10 Oct 2017 21:47:44 GMT
Connecting to host pulpo-osd01.cluster, port 5201
      Cookie: pulpo-osd02.ucsc.edu.1507672064.6398
      TCP MSS: 8948 (default)
[  4] local 192.168.40.6 port 57414 connected to 192.168.40.5 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.62 GBytes  22.5 Gbits/sec    0    717 KBytes
[  4]   1.00-2.00   sec  2.78 GBytes  23.9 Gbits/sec    0    717 KBytes
[  4]   2.00-3.00   sec  2.93 GBytes  25.2 Gbits/sec    0    717 KBytes
[  4]   3.00-4.00   sec  2.94 GBytes  25.2 Gbits/sec   20    507 KBytes
[  4]   4.00-5.00   sec  2.87 GBytes  24.7 Gbits/sec    0    507 KBytes
[  4]   5.00-6.00   sec  2.90 GBytes  24.9 Gbits/sec    0    533 KBytes
[  4]   6.00-7.00   sec  2.90 GBytes  24.9 Gbits/sec    0    533 KBytes
[  4]   7.00-8.00   sec  2.93 GBytes  25.2 Gbits/sec    0    568 KBytes
[  4]   8.00-9.00   sec  2.85 GBytes  24.5 Gbits/sec    2    428 KBytes
[  4]   9.00-10.00  sec  2.84 GBytes  24.4 Gbits/sec    0    472 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  28.6 GBytes  24.5 Gbits/sec   22             sender
[  4]   0.00-10.00  sec  28.6 GBytes  24.5 Gbits/sec                  receiver
CPU Utilization: local/sender 34.3% (0.4%u/34.0%s), remote/receiver 15.2% (0.1%u/15.2%s)
snd_tcp_congestion cubic
rcv_tcp_congestion cubic

iperf Done.

Before network tuning, we only got about 25 Gbps between 40G hosts!

nuttcp

ESnet also maintains an excellent guide on nuttcp.

By default, the nuttcp server listens for commands on port 5000, and the actual nuttcp data transfers take place on port 5001.

1) First we’ll perform a memory to memory test between two 10G hosts: venadi (a FIONA box) and pulpo-admin. On venadi, Open TCP port 5000 on the server (venadi):

[root@venadi ~]# firewall-cmd --zone=public --add-port=5000/tcp
success

Open TCP port 5001 on the client (pulpo-admin):

[root@pulpo-admin ~]#  firewall-cmd --zone=public --add-port=5000-5100/tcp
success

Start nuttcp server on venadi:

[root@venadi ~]# nuttcp -S

Start nuttcp client on pulpo-admin::

[root@pulpo-admin ~]# nuttcp -i1 -r venadi.ucsc.edu
 1178.5000 MB /   1.00 sec = 9885.4901 Mbps    17 retrans
 1179.9375 MB /   1.00 sec = 9898.3697 Mbps    27 retrans
 1180.0000 MB /   1.00 sec = 9898.5376 Mbps    34 retrans
 1180.0000 MB /   1.00 sec = 9898.4585 Mbps    29 retrans
 1179.4375 MB /   1.00 sec = 9893.8883 Mbps    40 retrans
 1179.6250 MB /   1.00 sec = 9895.1248 Mbps    30 retrans
 1179.4375 MB /   1.00 sec = 9894.0763 Mbps    38 retrans
 1179.8125 MB /   1.00 sec = 9896.8658 Mbps    24 retrans
 1179.5000 MB /   1.00 sec = 9894.6006 Mbps    33 retrans
 1179.3750 MB /   1.00 sec = 9893.3245 Mbps    29 retrans

11801.2792 MB /  10.01 sec = 9894.3578 Mbps 11 %TX 49 %RX 301 retrans 0.18 msRTT

Almost at line rate despite the retrans!

Kill the server:

[root@venadi ~]# pkill nuttcp

Remove the TCP ports on both server and client:

[root@venadi ~]# firewall-cmd --zone=public --remove-port=5000/tcp

[root@pulpo-admin ~]# firewall-cmd --zone=public --remove-port=5000-5100/tcp

2) Next we’ll perform a memory to memory test between two 40G hosts: pulpo-osd01 & pulpo-osd01.

Start nuttcp server (one shot) on pulpo-osd01:

[root@pulpo-osd01 ~]# nuttcp -1

Start nuttcp client on pulpo-osd02:

[root@pulpo-osd02 ~]# nuttcp -i1 pulpo-osd01.cluster
 2729.2500 MB /   1.00 sec = 22892.2734 Mbps     0 retrans
 2761.8125 MB /   1.00 sec = 23169.7550 Mbps     0 retrans
 2797.0000 MB /   1.00 sec = 23463.2181 Mbps     0 retrans
 2765.2500 MB /   1.00 sec = 23196.5287 Mbps     0 retrans
 2758.5625 MB /   1.00 sec = 23140.5689 Mbps     0 retrans
 2744.1875 MB /   1.00 sec = 23019.5449 Mbps     0 retrans
 2824.8125 MB /   1.00 sec = 23696.5291 Mbps     0 retrans
 2870.1875 MB /   1.00 sec = 24077.1427 Mbps     0 retrans
 2919.0000 MB /   1.00 sec = 24486.1264 Mbps     0 retrans
 2784.3125 MB /   1.00 sec = 23355.1515 Mbps     0 retrans

27960.6875 MB /  10.00 sec = 23452.7184 Mbps 42 %TX 99 %RX 0 retrans 0.21 msRTT

Before network tuning, we only got about 24 Gbps between 40G hosts, with a lot of retrans!

sysctl

FIONA settings

FIONA boxes use the following kernel parameters (/etc/sysctl.d/prp.conf) on 40G hosts:

# buffers up to 512MB
net.core.rmem_max=536870912
net.core.wmem_max=536870912
# increase Linux autotuning TCP buffer limit to 256MB
net.ipv4.tcp_rmem=4096 87380 268435456
net.ipv4.tcp_wmem=4096 65536 268435456
net.core.netdev_max_backlog=250000
net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_no_metrics_save = 1

ESnet recommendations

ESnet maintains excellent guide on how to tune Linux, Mac OSX, and FreeBSD hosts connected at speeds of 1Gbps or higher for maximum I/O performance for wide area network transfers.

Pulpos settings

We use the following kernel parameters (/etc/sysctl.d/10g.conf) on 10G hosts, which are based upon ESnet’s recommendations for a host with a 10G NIC optimized for network paths up to 200ms RTT:

# http://fasterdata.es.net/host-tuning/linux/
# allow testing with buffers up to 128MB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# increase Linux autotuning TCP buffer limit to 64MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq

We use the following kernel parameters (/etc/sysctl.d/40g.conf) on 40G hosts, which are based upon ESnet’s recommendations for a host with a Mellanox ConnectX-3 NIC:

# http://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/
# allow testing with buffers up to 256MB
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
# increase Linux autotuning TCP buffer limit to 128MB
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# increase the length of the processor input queue
net.core.netdev_max_backlog = 250000
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq

To load the configuration file, execute:

# sysctl -p /etc/sysctl.d/40g.conf

Mellanox ConnectX-3

Each OSD node has a single-port Mellanox ConnectX-3 Pro 10/40/56GbE Adapter, showing up as ens2 in CentOS 7.

# lspci | grep -i Mellanox
02:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

Esnet recommends using the latest device driver from Mellanox rather than the one that comes with CentOS 7.

Install pre-requisites:

# tentakel -g osd yum -y install lsof libxml2-python

Install Mellanox EN Driver for Linux on the OSD nodes:

# tar xvfz mlnx-en-4.1-1.0.2.0-rhel7.4-x86_64.tgz
# cd mlnx-en-4.1-1.0.2.0-rhel7.4-x86_64
# ./install

Contrary to what is said in ESnet article on Mellanox ConnectX-3, the driver installation script didn’t add anything to /etc/sysctl.conf, nor to /etc/sysctl.d/.

Furthermore, modify /etc/modprobe.d/mlx4.conf & /etc/rc.d/rc.local according to Mellanox guide on ConnectX-3/Pro Tuning For Linux:

# cat /sys/class/net/ens2/device/numa_node
0
# set_irq_affinity_bynode.sh 0 ens2

Speed tests

1) iperf3

Start server:

[root@pulpo-osd01 ~]# iperf3 -s

Start client:

[root@pulpo-osd02 ~]# iperf3 -c pulpo-osd01.cluster -i 1 -t 10 -V
iperf 3.1.7
Linux pulpo-osd02.ucsc.edu 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64
Control connection MSS 8948
Time: Wed, 11 Oct 2017 22:57:37 GMT
Connecting to host pulpo-osd01.cluster, port 5201
      Cookie: pulpo-osd02.ucsc.edu.1507762657.0888
      TCP MSS: 8948 (default)
[  4] local 192.168.40.6 port 48812 connected to 192.168.40.5 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  3.13 GBytes  26.9 Gbits/sec    0   1.17 MBytes
[  4]   1.00-2.00   sec  3.25 GBytes  27.9 Gbits/sec    0   1.31 MBytes
[  4]   2.00-3.00   sec  3.24 GBytes  27.8 Gbits/sec    0   1.31 MBytes
[  4]   3.00-4.00   sec  3.24 GBytes  27.8 Gbits/sec    0   1.32 MBytes
[  4]   4.00-5.00   sec  3.25 GBytes  27.9 Gbits/sec    0   1.35 MBytes
[  4]   5.00-6.00   sec  3.27 GBytes  28.1 Gbits/sec    0   1.41 MBytes
[  4]   6.00-7.00   sec  3.23 GBytes  27.7 Gbits/sec    0   1.41 MBytes
[  4]   7.00-8.00   sec  3.19 GBytes  27.4 Gbits/sec    0   1.43 MBytes
[  4]   8.00-9.00   sec  3.25 GBytes  27.9 Gbits/sec    0   1.43 MBytes
[  4]   9.00-10.00  sec  3.24 GBytes  27.9 Gbits/sec    0   1.43 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  32.3 GBytes  27.7 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  32.3 GBytes  27.7 Gbits/sec                  receiver
CPU Utilization: local/sender 36.3% (0.5%u/35.9%s), remote/receiver 11.8% (0.2%u/11.6%s)
snd_tcp_congestion htcp
rcv_tcp_congestion htcp

iperf Done.

Better; but far from line rate!

2) nuttcp

Start server (one shot):

[root@pulpo-osd01 ~]# nuttcp -1

Start client:

[root@pulpo-osd02 ~]# nuttcp -i1 pulpo-osd01.cluster
 3209.6875 MB /   1.00 sec = 26924.4871 Mbps     0 retrans
 3305.1250 MB /   1.00 sec = 27724.8712 Mbps     0 retrans
 3368.3750 MB /   1.00 sec = 28256.3448 Mbps     0 retrans
 3052.5625 MB /   1.00 sec = 25607.1599 Mbps     0 retrans
 3052.6875 MB /   1.00 sec = 25607.6195 Mbps     0 retrans
 3053.5000 MB /   1.00 sec = 25614.4608 Mbps     0 retrans
 3049.8750 MB /   1.00 sec = 25584.6663 Mbps     0 retrans
 3108.3750 MB /   1.00 sec = 26074.8612 Mbps     0 retrans
 3125.7500 MB /   1.00 sec = 26220.5341 Mbps     0 retrans
 3091.7500 MB /   1.00 sec = 25935.3750 Mbps     0 retrans

31443.2500 MB /  10.01 sec = 26360.8014 Mbps 43 %TX 89 %RX 0 retrans 0.21 msRTT

Also better; but far from line rate!

Note

  1. I tried to change core affinity of the iperf3/nuttcp processes, but didn’t observe an improved performance;
  2. I set cpufreq governor to performance (see 40G/100G Tuning), but didn’t observe an improved performance either:
# cpupower frequency-set -g performance

weak_module?

Philip Papadopoulos suggested the lackluster performance might be due to the fact that driver had been installed as a weak module:

[root@pulpo-osd01 ~]# modinfo mlx4_core
filename:       /lib/modules/3.10.0-693.2.2.el7.x86_64/weak-updates/mlnx-en/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version:        4.1-1.0.2
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA low-level driver
author:         Roland Dreier
rhelversion:    7.4

weak-updates in the driver path is the smoking gun. So let’s build Mellanox EN Driver for the current kernel.

Install the prerequisite packages:

# yum -y groupinstall "Development Tools"
# yum -y install createrepo
# yum -y install kernel-devel

Uninstall Mellanox EN Driver:

# ./uninstall.sh

Compile and install Mellanox EN Driver with kernel support:

# ./install --add-kernel-support

Load the new driver:

# /etc/init.d/mlnx-en.d restart

# modinfo mlx4_core
filename:       /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/mlnx-en/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version:        4.1-1.0.2
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA low-level driver
author:         Roland Dreier
rhelversion:    7.4

The driver was successfully installed. However, it doesn’t appear to improve performance either! I’ll look further into this when I get a chance.