David Ahern

When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.

This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.

Packet Processing in Linux

Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, 'top' can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload.

net_rx_action, the handler for network Rx traffic, usually runs really fast – like under 25 usecs[1] – dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see __do_softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).

One way to see the above description is using perf:

sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit
        -e irq:softirq_entry --filter="vec == 3" \
        -e irq:softirq_exit --filter="vec == 3"  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script

The output is something like:

swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0

In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.

The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?

With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000's to 10,000's of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.

Let's take a look at one way to see this impact.

Undo the Distributed Processing

First, let's undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.

RPS is disabled on the 2 nics using

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

Next, add flow rules to push packets for the VM under test to a single CPU

DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2

Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.

openssl speed

I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as 'openssl speed' and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting).

'openssl speed' is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344

so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tms_utime will match the test time (10 seconds in this case).

Since it is is a pure userspace benchmark ANY system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896

shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead – time stolen from this process for “system needs.”

Using openssl to Infer Networking Overhead

With an understanding of how 'openssl speed' works, let's look at a near idle server:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc's in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc's in 10.00s

so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let's add network load, netperf TCP_STREAM from 2 sources, and re-do the test:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc's in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc's in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc's in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc's in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc's in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc's in 1.89s

Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets – either in the context of openssl or via ksoftirqd.

Looking at top for the previous openssl run:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5

one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.

If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5

but openssl reports ~4 seconds of userspace time:

Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc's in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc's in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc's in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc's in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc's in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc's in 4.16s

Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.

In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything – virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.

Up Next

This post is the basis for a follow up post where I will discuss a comparison of the overhead of “full stack” on a VM host to “XDP”.

[1] Measured using ebpf program on entry and exit. See net_rx_action in https://github.com/dsahern/bpf-progs

[2] Assuming no bugs in the networking stack and driver. I have examined systems where net_rx_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).

[3] https://github.com/isc-projects/ethq

Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.

This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:

  • Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS

  • Install docker-ce

  • Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:

    ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
    --containerd=/run/containerd/containerd.sock
    
  • Tell systemd about the change and restart docker

    systemctl daemon-reload
    systemctl restart docker
    

With that, docker pull should work fine – in mgmt vrf or default vrf.

Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).

The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:

$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic

despite VRF being enabled and built:

$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m

which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: $ sudo apt-get install --install-recommends linux-generic-hwe-18.04

although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.

First, let's enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:

    echo "net.ipv4.tcp_l3mdev_accept=1" >> /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf

Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]

Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018: https://bugs.launchpad.net/netplan/+bug/1773522

One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:

#!/bin/bash

ip link show dev mgmt 2>/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default

The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]

Reboot node to make the changes take effect.

WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.

After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).

Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use 'ip vrf exec' as a helper to do the VRF binding. 'ip vrf exec' uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device ('mgmt' in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).

Let's see what happens:

$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'

Theoretically, this should Just Work, but it clearly does not. Why?

Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0

So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.

Where does the theory breakdown? In 3 places.

First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:

$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache

That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let's add the loopback addresses to it:

    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache

The second problem is systemd-resolved binds its socket to the loopback device:

    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=13))

The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.

The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.

Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.

$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf > /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]

When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding: $ ip vrf exec mgmt su – dsahern

Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].

Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:

NS=$(ip netns identify)
[ -n "$NS" ] && NS=":${NS}"

VRF=$(ip vrf identify)
[ -n "$VRF" ] && VRF=":${VRF}"

And then adding '${NS}${VRF}' after the host in PS1:

PS1='${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ '

For example, now the prompt becomes: dsahern@myhost:mgmt:~$

References:

[1] VRF tutorial, Open Source Summit, North America, Sept 2017 http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf

[2] VRF helpers, e.g., systemd instances for VRF https://github.com/CumulusNetworks/vrf

[3] Example using VRF in dhclient-exit-hook https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf

[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.