VyOS - 100Gbit BGP - Part 1

I have been using VyOS as a routing platform for various networks and projects for a long time with great success. With speeds between 1-40Gbit, I have been able to rely on VyOS to date. However, this is the first time that these speeds are insufficient and we have to increase the capacity to multiple 100 Gbit/s routers.

To achieve those speeds, I have two machines in the first data center rented at Hetzner with the following specifications:

  • AMD EPYC™ 7502P 32-Core
  • 128GB DDR4 3200 Ram
  • 2x 1TB Nvme
  • 1x Broadcom BCM57508 100/200Gbit adapter
  • 1x Intel 10Gbps adapter (OOB)

These two servers are connected to two separate Juniper switches with 2x 100Gbps in between, so more than sufficient capacity. The other data center is not yet connected at the time of writing, so I have started installing and testing with these two machines.

Initial installation

Not exactly what we hoped for

After installing 1.5 rolling, i immediately ran a benchmark via iPerf, without any adjustments or tweaks. The outcome of this was not quite what we had hoped for in terms of speed.

# iperf -c 192.168.1.200
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.100 port 46940 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=14/1448/317)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0566 sec  3.51 GBytes  3.00 Gbits/sec

Based on the documentation, I adjusted the system profile to get the CPU mode in performance, and add some minor sysctl adjustments. The result shows that these options had almost no effect at all.

# set system option performance throughput 
[edit]
# commit
# cpupower frequency-info
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.50 GHz - 2.50 GHz
  available frequency steps:  2.50 GHz, 2.20 GHz, 1.50 GHz
  available cpufreq governors: conservative ondemand performance schedutil
  current policy: frequency should be within 1.50 GHz and 2.50 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: 2.50 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes
    Boost States: 0
    Total States: 3
    Pstate-P0:  2500MHz
    Pstate-P1:  2200MHz
    Pstate-P2:  1500MHz
[edit]
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.200 port 46720 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=14/1448/328)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0576 sec  3.72 GBytes  3.18 Gbits/sec
[edit]

The next option was to increase the ring buffers on the network card. This article gives a clear explanation of what ring buffers are and how they can impact performance.

# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX:			8191
RX Mini:		n/a
RX Jumbo:		n/a
TX:			2047
TX push buff len:	n/a
Current hardware settings:
RX:			511
RX Mini:		n/a
RX Jumbo:		n/a
TX:			511
RX Buf Len:		n/a
CQE Size:		n/a
TX Push:		off
RX Push:		off
TX push buff len:	n/a
TCP data split:		off
[edit]
# set interfaces ethernet eth2 ring-buffer rx 8191
[edit]
# set interfaces ethernet eth2 ring-buffer tx 2047
[edit]
# commit
[edit]
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.200 port 54470 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=14/1448/334)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0349 sec  3.76 GBytes  3.22 Gbits/sec

This also seems to have little influence, speeds are still only around 3 Gbit/s. Let's start with setting the MTU to 9000 (Jumbo Frames) and see if that helps.

# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.200 port 34698 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/116)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0170 sec  15.7 GBytes  13.4 Gbits/sec

Clearly that helped, so I then enabled pause-frames on the interface but that seemed to have no effect, speed didn't change compared to setting the MTU to a higher value. What did help however is disabling Simultaneous Multithreading (SMT):

# echo off > /sys/devices/system/cpu/smt/control

# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.200 port 40052 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/113)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0072 sec  18.2 GBytes  15.6 Gbits/sec

But still, not 100Gbit/s and no 100% usage on the server side core. Just to be sure i checked if the card is in the correct PCI slot and is capable of handling high speeds:

# lspci | grep Ethernet
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
c1:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Network Connection (rev 01)
c5:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
# lspci -s 01:00.0 -vvv | grep PCIe
	Subsystem: Broadcom Inc. and subsidiaries NetXtreme-E Dual-port 100G QSFP56 Ethernet PCIe4.0 x16 Adapter (BCM957508-P2100G)
		Product Name: Broadcom NetXtreme-E Dual-port 100Gb Ethernet PCIe Adapter
# lspci -s 01:00.0 -vvv | grep Speed
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported
		LnkSta:	Speed 16GT/s, Width x16
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-

This seems correct, PCIe4.0 should be able to handle 100 Gbps of throughput which is confirmed by the Link Target Speed (16GT/s). In some cases there is a set limit 'MaxReadReq', in our case it's set to the max (4096).

# lspci -s 01:00.0 -vvv | grep MaxReadReq
			MaxPayload 512 bytes, MaxReadReq 4096 bytes

The command to set it to the max is:

setpci -s 01:00.0 68.w=5936

So, we are still stuck at around 15 Gb/s on a single TCP stream, while it's not using 100% of the core capacity. Changing the txqueuelen also had no imrovement on that. What caused some improvement was enabling the large-receive-offload on the NIC's:

# ethtool -K eth1 lro on
[edit]
# ethtool -k eth1|grep larg
large-receive-offload: on
# iperf -c 192.168.1.100
------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.200 port 54574 connected with 192.168.1.100 port 5001 (icwnd/mss/irtt=87/8948/290)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0095 sec  22.0 GBytes  18.9 Gbits/sec

Playing around with offloading i found none of them really mattered except for TCP Segmentation Offloading, when enabling that the speed greatly improved:

# iperf -c 192.168.1.200
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 192.168.1.100 port 42850 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/202)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.1071 sec  29.8 GBytes  25.4 Gbits/sec

Testing with 4 threads also seams pretty reasonable:

# iperf -c 192.168.1.200 -P 4
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  2] local 192.168.1.100 port 39640 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/169)
[  3] local 192.168.1.100 port 39642 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/365)
[  1] local 192.168.1.100 port 39670 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/176)
[  4] local 192.168.1.100 port 39638 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/396)
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0088 sec  25.4 GBytes  21.8 Gbits/sec
[  3] 0.0000-10.0088 sec  22.8 GBytes  19.6 Gbits/sec
[  2] 0.0000-10.0086 sec  22.2 GBytes  19.0 Gbits/sec
[  1] 0.0000-10.0087 sec  23.6 GBytes  20.3 Gbits/sec
[SUM] 0.0000-10.0004 sec  94.0 GBytes  80.8 Gbits/sec

And with 6 threads we max the card out:

# iperf -c 192.168.1.200 -P 6
------------------------------------------------------------
Client connecting to 192.168.1.200, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.100 port 39134 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/130)
[  1] local 192.168.1.100 port 39144 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/142)
[  4] local 192.168.1.100 port 39138 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/127)
[  2] local 192.168.1.100 port 39146 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/135)
[  5] local 192.168.1.100 port 39164 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/91)
[  6] local 192.168.1.100 port 39172 connected with 192.168.1.200 port 5001 (icwnd/mss/irtt=87/8948/69)
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0057 sec  18.5 GBytes  15.9 Gbits/sec
[  6] 0.0000-10.0056 sec  17.5 GBytes  15.0 Gbits/sec
[  5] 0.0000-10.0054 sec  18.9 GBytes  16.2 Gbits/sec
[  2] 0.0000-10.0055 sec  19.0 GBytes  16.3 Gbits/sec
[  3] 0.0000-10.0055 sec  19.1 GBytes  16.4 Gbits/sec
[  1] 0.0000-10.0055 sec  18.3 GBytes  15.7 Gbits/sec
[SUM] 0.0000-10.0002 sec   111 GBytes  95.6 Gbits/sec

So far i'm pretty impressed, next part is configure BGP and not only test between these two hosts, but between clients connected to them and check what speeds we get.