VyOS - 100Gbit BGP - Part 1
I have been using VyOS as a routing platform for various networks and projects for a long time with great success. With speeds between 1-40Gbit, I have been able to rely on VyOS to date. However, this is the first time that these speeds are insufficient and we have to increase the capacity to multiple 100 Gbit/s routers.
To achieve those speeds, I have two machines in the first data center rented at Hetzner with the following specifications:
- AMD EPYC™ 7502P 32-Core
- 128GB DDR4 3200 Ram
- 2x 1TB Nvme
- 1x Broadcom BCM57508 100/200Gbit adapter
- 1x Intel 10Gbps adapter (OOB)
These two servers are connected to two separate Juniper switches with 2x 100Gbps in between, so more than sufficient capacity. The other data center is not yet connected at the time of writing, so I have started installing and testing with these two machines.
Initial installation
Not exactly what we hoped for
After installing 1.5 rolling, i immediately ran a benchmark via iPerf, without any adjustments or tweaks. The outcome of this was not quite what we had hoped for in terms of speed.
# iperf -c 192.168.1.200
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.100 port 46940 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=14/1448/317)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0566 sec 3.51 GBytes 3.00 Gbits/sec
Based on the documentation, I adjusted the system profile to get the CPU mode in performance, and add some minor sysctl adjustments. The result shows that these options had almost no effect at all.
# set system option performance throughput
[edit]
# commit
# cpupower frequency-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.50 GHz - 2.50 GHz
available frequency steps: 2.50 GHz, 2.20 GHz, 1.50 GHz
available cpufreq governors: conservative ondemand performance schedutil
current policy: frequency should be within 1.50 GHz and 2.50 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: 2.50 GHz (asserted by call to hardware)
boost state support:
Supported: yes
Active: yes
Boost States: 0
Total States: 3
Pstate-P0: 2500MHz
Pstate-P1: 2200MHz
Pstate-P2: 1500MHz
[edit]
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.200 port 46720 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=14/1448/328)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0576 sec 3.72 GBytes 3.18 Gbits/sec
[edit]
The next option was to increase the ring buffers on the network card. This article gives a clear explanation of what ring buffers are and how they can impact performance.
# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX: 8191
RX Mini: n/a
RX Jumbo: n/a
TX: 2047
TX push buff len: n/a
Current hardware settings:
RX: 511
RX Mini: n/a
RX Jumbo: n/a
TX: 511
RX Buf Len: n/a
CQE Size: n/a
TX Push: off
RX Push: off
TX push buff len: n/a
TCP data split: off
[edit]
# set interfaces ethernet eth2 ring-buffer rx 8191
[edit]
# set interfaces ethernet eth2 ring-buffer tx 2047
[edit]
# commit
[edit]
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.200 port 54470 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=14/1448/334)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0349 sec 3.76 GBytes 3.22 Gbits/sec
This also seems to have little influence, speeds are still only around 3 Gbit/s. Let's start with setting the MTU to 9000 (Jumbo Frames) and see if that helps.
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.200 port 34698 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/116)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0170 sec 15.7 GBytes 13.4 Gbits/sec
Clearly that helped, so I then enabled pause-frames on the interface but that seemed to have no effect, speed didn't change compared to setting the MTU to a higher value. What did help however is disabling Simultaneous Multithreading (SMT):
# echo off > /sys/devices/system/cpu/smt/control
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.200 port 40052 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/113)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0072 sec 18.2 GBytes 15.6 Gbits/sec
But still, not 100Gbit/s and no 100% usage on the server side core. Just to be sure i checked if the card is in the correct PCI slot and is capable of handling high speeds:
# lspci | grep Ethernet
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
c1:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Network Connection (rev 01)
c5:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
# lspci -s 01:00.0 -vvv | grep PCIe
Subsystem: Broadcom Inc. and subsidiaries NetXtreme-E Dual-port 100G QSFP56 Ethernet PCIe4.0 x16 Adapter (BCM957508-P2100G)
Product Name: Broadcom NetXtreme-E Dual-port 100Gb Ethernet PCIe Adapter
# lspci -s 01:00.0 -vvv | grep Speed
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
LnkSta: Speed 16GT/s, Width x16
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
This seems correct, PCIe4.0 should be able to handle 100 Gbps of throughput which is confirmed by the Link Target Speed (16GT/s). In some cases there is a set limit 'MaxReadReq', in our case it's set to the max (4096).
# lspci -s 01:00.0 -vvv | grep MaxReadReq
MaxPayload 512 bytes, MaxReadReq 4096 bytes
The command to set it to the max is:
setpci -s 01:00.0 68.w=5936
So, we are still stuck at around 15 Gb/s on a single TCP stream, while it's not using 100% of the core capacity. Changing the txqueuelen also had no imrovement on that. What caused some improvement was enabling the large-receive-offload on the NIC's:
# ethtool -K eth1 lro on
[edit]
# ethtool -k eth1|grep larg
large-receive-offload: on
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.200 port 54574 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/290)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0095 sec 22.0 GBytes 18.9 Gbits/sec
Playing around with offloading i found none of them really mattered except for TCP Segmentation Offloading, when enabling that the speed greatly improved:
# iperf -c 192.168.1.200
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.100 port 42850 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/202)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.1071 sec 29.8 GBytes 25.4 Gbits/sec
Testing with 4 threads also seams pretty reasonable:
# iperf -c 192.168.1.200 -P 4
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 2] local 192.168.1.100 port 39640 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/169)
[ 3] local 192.168.1.100 port 39642 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/365)
[ 1] local 192.168.1.100 port 39670 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/176)
[ 4] local 192.168.1.100 port 39638 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/396)
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0088 sec 25.4 GBytes 21.8 Gbits/sec
[ 3] 0.0000-10.0088 sec 22.8 GBytes 19.6 Gbits/sec
[ 2] 0.0000-10.0086 sec 22.2 GBytes 19.0 Gbits/sec
[ 1] 0.0000-10.0087 sec 23.6 GBytes 20.3 Gbits/sec
[SUM] 0.0000-10.0004 sec 94.0 GBytes 80.8 Gbits/sec
And with 6 threads we max the card out:
# iperf -c 192.168.1.200 -P 6
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.100 port 39134 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/130)
[ 1] local 192.168.1.100 port 39144 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/142)
[ 4] local 192.168.1.100 port 39138 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/127)
[ 2] local 192.168.1.100 port 39146 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/135)
[ 5] local 192.168.1.100 port 39164 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/91)
[ 6] local 192.168.1.100 port 39172 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/69)
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0057 sec 18.5 GBytes 15.9 Gbits/sec
[ 6] 0.0000-10.0056 sec 17.5 GBytes 15.0 Gbits/sec
[ 5] 0.0000-10.0054 sec 18.9 GBytes 16.2 Gbits/sec
[ 2] 0.0000-10.0055 sec 19.0 GBytes 16.3 Gbits/sec
[ 3] 0.0000-10.0055 sec 19.1 GBytes 16.4 Gbits/sec
[ 1] 0.0000-10.0055 sec 18.3 GBytes 15.7 Gbits/sec
[SUM] 0.0000-10.0002 sec 111 GBytes 95.6 Gbits/sec
So far i'm pretty impressed, next part is configure BGP and not only test between these two hosts, but between clients connected to them and check what speeds we get.