Jakub Kicinski

Cross-posting my blog post about TCP overload investigation, I didn't know how to add pictures here :( https://developers.facebook.com/blog/post/2022/04/25/investigating-tcp-self-throttling-triggered-overload/

In Linux 5.13 ethtool gains an interface for querying IEEE and IETF statistics. This removes the need to parse vendor specific strings in ethtool -S.

Status quo

Linux has two sources of NIC statistics, the common interface stats (which show up in ifconfig, ip link, sysfs and few other places) and ethtool -S. The former – common interface stats – are a mix of basic info (packets, bytes, drops, errors in each direction) and a handful of lower level stats like CRC errors, framing errors, collisions or FIFO errors. Many of these statistics became either irrelevant (collisions) or semantically unclear (FIFO errors) in modern NICs.

This is why deployments increasingly depend on ethtool -S statistics for error tracking. ethtool -S is a free form list of stats provided by the driver. It started out as a place for drivers to report custom, implementation specific stats, but ended up also serving as a reporting place for new statistics as the networking standards developed.

Sadly there is no commonality in how vendors name their ethtool statistics. The spelling and abbreviation of IEEE stats always differ, sometimes the names chosen do not resemble the standard names at all (reportedly because vendors consider those names “too confusing” for the users). This forces infrastructure teams to maintain translations and custom per-vendor logic to scrape ethtool -S output.

What changed

Starting with Linux 5.6 Michal Kubecek has been progressively porting ethtool from ioctls to a more structured and extensible netlink interface. Thanks to that we can now augment the old commands to carry statistics. When user specifies -I | --include-statistics on the command line (or the appropriate flag in netlink) kernel will include relevant statistics in its response, e.g. for flow control:

 # ethtool -I -a eth0
 Pause parameters for eth0:
 Autonegotiate:    off
 RX:        off
 TX:        on
 Statistics:
   tx_pause_frames: 25545561
   rx_pause_frames: 0

General statistics such as PHY and MAC counters are now available via ethtool -S under standard-based names though a new --groups switch, e.g.:

 # ethtool -S eth0 --groups eth-mac
 Standard stats for eth0:
 eth-mac-FramesTransmittedOK: 902623288966
 eth-mac-FramesReceivedOK: 28727667047
 eth-mac-FrameCheckSequenceErrors: 1
 eth-mac-AlignmentErrors: 0
 eth-mac-OutOfRangeLengthField: 0

Each of the commands supports JSON-formatted output for ease of parsing (--json).

So little, so late

Admittedly the new interface is quite basic. It mostly includes statistics provided in IEEE or IETF standards, and NICs may report more interesting data. There is also no metadata about “freshness” of the stats here, or filtering built into the interface.

The starting point is based on fulfilling immediate needs. We hope the interfaces will be extended as needed. Statistics can be made arbitrarily complex, so after a couple false-starts with complex interfaces we decided to let the use cases drive the interface.

It’s also very useful to lean on the standards for clear definition of the semantics. Going forward we can work with vendors on codifying the definitions of other counters they have.

List of currently supported stats

IEEE 802.3 attributes::

 30.3.2.1.5 aSymbolErrorDuringCarrier
 30.3.1.1.2 aFramesTransmittedOK
 30.3.1.1.3 aSingleCollisionFrames
 30.3.1.1.4 aMultipleCollisionFrames
 30.3.1.1.5 aFramesReceivedOK
 30.3.1.1.6 aFrameCheckSequenceErrors
 30.3.1.1.7 aAlignmentErrors
 30.3.1.1.8 aOctetsTransmittedOK
 30.3.1.1.9 aFramesWithDeferredXmissions
 30.3.1.1.10 aLateCollisions
 30.3.1.1.11 aFramesAbortedDueToXSColls
 30.3.1.1.12 aFramesLostDueToIntMACXmitError
 30.3.1.1.13 aCarrierSenseErrors
 30.3.1.1.14 aOctetsReceivedOK
 30.3.1.1.15 aFramesLostDueToIntMACRcvError
 
 30.3.1.1.18 aMulticastFramesXmittedOK
 30.3.1.1.19 aBroadcastFramesXmittedOK
 30.3.1.1.20 aFramesWithExcessiveDeferral
 30.3.1.1.21 aMulticastFramesReceivedOK
 30.3.1.1.22 aBroadcastFramesReceivedOK
 30.3.1.1.23 aInRangeLengthErrors
 30.3.1.1.24 aOutOfRangeLengthField
 30.3.1.1.25 aFrameTooLongErrors

 30.3.3.3 aMACControlFramesTransmitted
 30.3.3.4 aMACControlFramesReceived
 30.3.3.5 aUnsupportedOpcodesReceived
 
 30.3.4.2 aPAUSEMACCtrlFramesTransmitted
 30.3.4.3 aPAUSEMACCtrlFramesReceived

 30.5.1.1.17 aFECCorrectedBlocks
 30.5.1.1.18 aFECUncorrectableBlocks

IETF RMON (RFC 2819)

 etherStatsUndersizePkts
 etherStatsOversizePkts
 etherStatsFragments
 etherStatsJabbers

 etherStatsPkts64Octets
 etherStatsPkts65to127Octets
 etherStatsPkts128to255Octets
 etherStatsPkts256to511Octets
 etherStatsPkts512to1023Octets
 etherStatsPkts1024to1518Octets
 (incl. further stats for jumbo MTUs)

Kernel side changes: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8203c7ce4ef2840929d38b447b4ccd384727f92b

Recent months saw a lot of changes in the venerable NAPI polling mechanism, which warrants a write up.

NAPI primer

NAPI (New API) is so old it could buy alcohol in the US – which is why I will not attempt a full history lesson. It is, however, worth clearing up one major misconception. The basic flow of NAPI operation does not involve any explicit processing delays.

Before we even talk about how NAPI works, however, we need a sidebar on software interrupts.

Software interrupts (softirq) or bottom halves are a kernel concept which helps decrease interrupt service latency. Because normal interrupts don't nest in Linux, the system can't service any new interrupt while it's already processing one. Therefore doing a lot of work directly in an IRQ handler is a bad idea. softirqs are a form of processing which allows the IRQ handler to schedule a function to run as soon as IRQ handler exits. This adds a tier of “low latency processing” which does not block hardware interrupts. If software interrupts start consuming a lot of cycles, however, kernel will wake up a ksoftirq thread to take over the I/O portion of the processing. This helps back-pressure the I/O, and makes sure random threads don't get their scheduler slice depleted by softirq work.

Now that we understand softirqs, this is what NAPI does:

  1. Device interrupt fires
  2. Interrupt handler masks the individual NIC IRQ which has fired (modern NICs mask their IRQs automatically)
  3. Interrupt handler “schedules” NAPI in softirq
  4. Interrupt handler exits
  5. softirq runs NAPI callback immediately (or less often in ksoftirqd)
  6. At the end of processing NAPI re-enables the appropriate NIC IRQ again

As you can see there is no explicit delay from IRQ firing to NAPI, or extra batching, or re-polling built in.

Problem

NAPI was designed several years before Intel released its first multi-core CPU. Today systems have tens of CPUs and all of the cores can have dedicated networking queues. Experiments show that separating network processing from application processing yields better application performance. That said manual tuning and CPU allocation for every workload is tedious and often not worth the effort.

In terms of raw compute throughput having many cores service interrupts means more interrupts (less batching) and more cache pollution. Interrupts are also bad for application latency. Application workers are periodically stopped to service networking traffic. It would be much better to let the application finish its calculations and then service I/O only once it needs more data.

Last but not least NAPI semi-randomly gets kicked out into the ksoftirqd thread which degrades the network latency.

Busy polling

Busy polling is a kernel feature which was originally intended for low latency processing. Whenever an application was out of work it could check if the NIC has any packets to service thus circumventing the interrupt coalescing delay.

Recent work by Bjorn Topel reused the concept to avoid application interruptions altogether. An application can now make a “promise” to the kernel that it will periodically check for new packets itself (kernel sets a timer/alarm to make sure application doesn't break that promise.) The application is expected to use busy polling to process packets, replacing the interrupt driven parts of NAPI.

For example the usual timeline of NAPI processing would look something like:

EVENTS            irq coalescing delay (e.g. 50us)
                  <--->
packet arrival | p  p   ppp   p   pp  p  p  p  pp   p  ppp  p
           IRQ |       X      X        X      X     X     X    
-----------------------------------------------------------------
CPU USE
          NAPI |       NN     NNN      N      N     NN    N
   application |AAAAAAA  AAAAA   AAA AA AAAAAA AAAAA   AA  AAA
                < process req 1    > <process req 2     > <proc..

[Forgive the rough diagram, one space is 10us, assume app needs 150us per req, A is time used by app, N by NAPI.]

With new busy polling we want to achieve this:

EVENTS
packet arrival | p  p   ppp   p   pp  p  p  p  pp   p  ppp  p
           IRQ |
-----------------------------------------------------------
CPU USE
          NAPI |               NNNN                  NNN
   application |AAAAAAAAAAAAAAA^     AAAAAAAAAAAAAAAA^    AA 
                < process req 1>     <process req 2 >     <proc..

Here the application does not get interrupted. Once it's done with a request it asks the kernel to process packets. This allows the app to improve the latency by the amount of time NAPI processing would steal from request processing.

The two caveats of this approach are:

  • application processing has to be on similar granularity as NAPI processing (the typical cycle shouldn't be longer than 200us)
  • the application itself needs to be designed with CPU mapping in mind, or to put it simply the app architecture needs to follow the thread per core design – since NAPI instances are mapped to cores and there needs to be a thread responsible for polling each NAPI

Threaded NAPI

For applications which don't want to take on the polling challenge a new “threaded” NAPI processing option was added (after years of poking from many teams).

Unlike normal NAPI which relies on the built-in softirq processing, threaded NAPI queues have their own threads which always do the processing. Conceptually it's quite similar to the ksoftirq thread, but:

  • it never does processing “in-line” right after hardware IRQ, it always wakes up the thread
  • it only processes NAPI, not work from other subsystems
  • there is a thread per NAPI (NIC queue pair), rather than thread per core

The main advantage of threaded NAPI is that the network processing load is visible to the CPU scheduler, allowing it to make better choices. In tests performed by Google NAPI threads were explicitly pinned to cores but the application threads were not.

TAPI (work in progress)

The main disadvantage of threaded NAPI is that according to my tests it in fact requires explicit CPU allocation, unless the system is relatively idle, otherwise NAPI threads suffer latencies similar to ksoftirq latencies.

The idea behind “TAPI” is to automatically pack and rebalance multiple instances of NAPI to each thread. The hope is that each thread reaches high enough CPU consumption to get a CPU core all to itself from the scheduler. Rather than having 3 threaded NAPI workers at 30% CPU each, TAPI would have one at 90% which services its 3 instances in a round robin fashion. The automatic packing and balancing should therefore remove the need to manually allocate CPU cores to networking. This mode of operation is inspired by Linux workqueues, but with higher locality and latency guarantees.

Unfortunately, due to upstream emergencies after initial promising results the TAPI work has been on hold for the last 2 months.

All modern NICs implement IRQ coalescing (ethtool -c/-C), which delays RX/TX interrupts hoping more frames arrive in the meantime to allow for batch processing. IRQ coalescing trades off latency for system throughput.

It’s commonly believed that the higher the packet rate, the more batching system needs to keep up. At lower rates the batching would be limited, anyway. This leads to the idea of adaptive IRQ coalescing where the NIC itself – or more likely the driver – adjusts the IRQ timeouts based on the recent rate of packet arrivals.

Unfortunately adaptive coalescing is not a panacea, as it often has predefined range of values it chooses from, and it costs extra CPU processing to continuously recalculate and update the rate (especially with modern NICs which often need to talk to firmware to change settings rather than simply writing to device registers).

The summary above – while correct (I hope :)) misses one important point. There are two sets of IRQ coalescing settings, and only one of them has a significant latency impact. NICs (and the Linux kernel ethtool API) have separate settings for RX and TX. While RX processing is more costly, and therefore playing with RX settings feels more significant – RX batching costs latency. For TX processing (or actually TX completion processing) the latency matters much, much less.

With a simple bpftrace command:

bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

we can check how many RX packets get received on every NAPI poll (that’s to say how many packets get coalesced). On a moderately loaded system the top entries may look something like:

@[4]: 750
@[3]: 2180
@[2]: 15828
@[1]: 233080
@[0]: 298525

where the first number (in square brackets) is the number of packets coalesced, and the second number is a counter of occurrences.

The 0 work done entries (@[0]: 298525) usually mean that the driver received a TX interrupt, and there were no RX packets to process. Drivers will generally clear their TX rings while doing RX processing – so with TX processing being less latency sensitive – in an ideal scenario we’d like to see no TX interrupts at all, but rather have TX processing piggy back on the RX interrupts.

How high can we set the TX coalescing parameters, then? If the workload is mostly using TCP all we really need to ensure is that we don’t run awry of TCP Small Queues (/proc/sys/net/ipv4/tcp_limit_output_bytes) which is the number of bytes TCP stack is willing to queue up to the NIC without getting a TX completion.

For example recent upstream kernels have TSQ of 1MB, so even with a 50GB NIC – delaying TX interrupts for up to 350us should be fine. Obviously we want to give ourselves a safety margin for scheduling delays, timer slack etc. Additionally, according to my experiments the gains of TX coalescing above 200us are perhaps too low for the risk.

Repeating the bpftrace command from above after setting coalescing to 150us / 128 frames:

@[4]: 831
@[3]: 2066
@[2]: 16056
@[0]: 177186
@[1]: 228985

We see far less @[0] occurrences compared to the first run. The gain in system throughput depends on the workload, I’ve seen an increase of 6% on the workload I tested with.

A word of warning – even though upstream reviewers try to make sure drivers behave sanely and return errors for unsupported configurations – there are vendors out there who will silently ignore TX coalescing settings...