Jakub Kicinski

Kernel TLS implements the record encapsulation and cryptography of the TLS protocol. There are four areas where implementing (a portion of) TLS in the kernel helps:

  • enabling seamless acceleration (NIC or crypto accelerator offload)
  • enabling sendfile on encrypted connections
  • saving extra data copies (data can be encrypted as it is copied into the kernel)
  • enabling the use of TLS on kernel sockets (nbd, NFS etc.)

Kernel TLS handles only data records turning them into a cleartext data stream, all the control records (TLS handshake etc.) get sent to the application via a side channel for user space (OpenSSL or such) to process. The first implementation of kTLS was designed in the good old days of TLS 1.2. When TLS 1.3 came into the picture the interest in kTLS had slightly diminished and the implementation, although functional, was rather simple and did not retain all the benefits. This post covers developments in the Linux 5.20 implementation of TLS which claws back the performance lost moving to TLS 1.3. One of the features we lost in TLS 1.3 was the ability to decrypt data as it was copied into the user buffer during read. TLS 1.3 hides the true type of the record. Recall that kTLS wants to punt control records to a different path than data records. TLS 1.3 always populates the TLS header with application_data as the record type and the real record type is appended at the end, before record padding. This means that the data has to be decrypted for the true record type to be known.

Problem 1 – CoW on big GRO segments is inefficient

kTLS was made to dutifully decrypt the TLS 1.3 records first before copying the data to user space. Modern CPUs are relatively good at copying data, so the copy is not a huge problem in itself. What’s more problematic is how the kTLS code went about performing the copy. The data queued on TCP sockets is considered read-only by the kernel. The pages data sits in may have been zero-copy-sent and for example belong to a file. kTLS tried to decrypt “in place” because it didn’t know how to deal with separate input/output skbs. To decrypt “in place” it calls skb_cow_data(). As the name suggests this function makes a copy of the memory underlying an skb, to make it safe for writing. This function, however, is intended to be run on MTU-sized skbs (individual IP packets), not skbs from the TCP receive queue. The skbs from the receive queue can be much larger than a single TLS record (16kB). As a result TLS would CoW a 64kB skb 4 times to extract the 4 records inside it. Even worse if we consider that the last record will likely straddle skbs so we need to CoW two 64kB skbs to decrypt it “in place”. The diagram below visualizes the problem and the solution. SKB CoW The possible solutions are quite obvious – either create a custom version of skb_cow_data() or teach TLS to deal with different input and output skbs. I opted for the latter (due to further optimizations it enables). Now we use a fresh buffer for the decrypted data and there is no need to CoW the big skbs TCP produces. This fix alone results in ~25-45% performance improvement (depending on the exact CPU SKU and available memory bandwidth). A jump in performance from abysmal to comparable with the user space OpenSSL.

Problem 2 – direct decrypt

Removing pointless copies is all well and good, but as mentioned we also lost the ability to decrypt directly to the user space buffer. We still need to copy the data to user space after it has been decrypted (A in the diagram below, here showing just a single record not full skb). SKB direct decrypt We can’t regain the full efficiency of TLS 1.2 because we don’t know the record type upfront. In practice, however, most of the records are data/application records (records carrying the application data rather than TLS control traffic like handshake messages or keys), so we can optimize for that case. We can optimistically decrypt to the user buffer, hoping the record contains data, and then check if we were right. Since decrypt to a user space buffer does not destroy the original encrypted record if we turn out to be wrong we can decrypting again, this time to a kernel skb (which we can then direct to the control message queue). Obviously this sort of optimization would not be acceptable in the Internet wilderness, as attackers could force us to waste time decrypting all records twice. The real record type in TLS 1.3 is at the tail of the data. We must either trust that the application will not overwrite the record type after we place it in its buffer (B in the diagram below), or assume there will be no padding and use a kernel address as the destination of that chunk of data (C). Since record padding is also rare – I chose option (C). It improves the single stream performance by around 10%.

Problem 3 – latency

Applications tests have also showed that kTLS performs much worse than user space TLS in terms of the p99 RPC response latency. This is due to the fact that kTLS holds the socket lock for very long periods of time, preventing TCP from processing incoming packets. Inserting periodic TCP processing points into the kTLS code fixes the problem. The following graph shows the relationship between the TCP processing frequency (on the x axis in kB of consumed data, 0 = inf), throughput of a single TLS flow (“data”) and TCP socket state. TCP CWND SWND The TCP-perceived RTT of the connection grows the longer TLS hogs the socket lock without letting TCP process the ingress backlog. TCP responds by growing the congestion window. Delaying the TCP processing will prevent TCP from responding to network congestion effectively, therefore I decided to be conservative and use 128kB as the TCP processing threshold. Processing the incoming packets has the additional benefit of TLS being able to consume the data as it comes in from the NIC. Previously TLS had access to the data already processed by TCP when the read operation began. Any packets coming in from the NIC while TLS was decrypting would be backlogged at TCP input. On the way to user space TLS would release the socket lock, allowing the TCP backlog processing to kick in. TCP processing would schedule a TLS worker. TLS worker would tell the application there is more data.

In light of ongoing work to improve the TCP Tx zero-copy efficiency [1] one begins to wonder what can be done on the Rx side. Tx zero-copy is generally easier to implement because it requires no extra HW support. It's primarily a SW exercise in keeping references to user data rather than owning a copy of it.

[1] https://lore.kernel.org/all/cover.1653992701.git.asml.silence@gmail.com/

There had been efforts to support direct Rx to user space buffers by performing header-data splitting in the HW and sending headers (for kernel consumption) to a different buffer than the data. There are two known implementations which depend on header-data splitting (HDS).

First one, which is upstream today [2] depends on data payloads being received in page-sized chunks and mapping (mmap'ing?) the data into the process's virtual address space. Even though modifying the virtual address map is not cheap this scheme works well for large transfers. Importantly applications which use zero-copy are often more DRAM bandwidth constrained than CPU bound. Meaning that even for smaller transfers they may prefer to burn extra cycles modifying page tables than copying data and using up the precious DRAM transfers.

[2] https://lwn.net/Articles/752188/

The second approach is to pre-register user memory and let the NIC DMA the data directly there [3]. The memory in this case can be main DRAM or accelerator memory, that's not really important for us here. The model is similar to that of AF_XDP UMEMs. It still depends on header-data split as we don't want the application to be able to modify the headers, as they flow thru the networking stack and crash the kernel. Additionally the device must provide sufficient flow steering to be able to direct traffic for an application to its buffers / queues – we don't want the data to end up in memory of the wrong application. Once the TCP stack is done processing the headers, it simply tells the application where the data is.

[3] https://lore.kernel.org/netdev/20200727224444.2987641-1-jonathan.lemon@gmail.com/

Apart from the ability to direct the data to memory other than DRAM (compute accelerators, disks) the second approach has the advantage of not requiring page table changes and neatly page-sized payloads. It is harder to implement in SW because the stack does not necessarily have access to the payload memory. Although netgpu/zctap patches have stalled the idea may have sufficient merit to eventually come to fruition.

Looking ahead

Both of the approaches described above deal with packet-sized chunks of data. Admittedly some HW supports data coalescing (GRO-HW/LRO) but it's limited in size, best effort, latency inducing and generally unproven at scale.

Given that we already modify the HW to support zero-copy Rx (HDS for the former case, HDS+steering for the latter) what modifications would help us get to the next level? The data coalescing can certainly be improved.

LRO depends on intelligence in the NIC and maintaining state about connections which always poses scaling challenges. At the same time modern applications most often know the parameters of the transfer upfront, so they can put processing hints in the packet.

The receiver needs to know (1) which memory pool / region to place the data into and (2) at what offset. We can assume the sender gets these parameters thru the RPC layer. Sender can insert the placement information into the packet it sends. Either wrapped in a UDP header in front of the TCP header, or as a TCP option.

One useful modification may be to express the offset in terms of the TCP sequence number. Instead of providing the absolute offset where the packet data has to land (i.e. “data of this packet has to land in memory X at offset Y”) provide a base TCP sequence number and offset. The destination address would then be computed as

address = mem_pool[packet.dma.mem_id].base + 
          packet.dma.offset +
          packet.tcp.seq_no - packet.dma.base_tcp_seq

This simplifies the sender's job as it no longer has to change the offset as it breaks up a TSO super-frame into MTU-sized segments.

Last thing – we must protect applications from rogue writes. This can be done by programming an allow list of flow to memory pool pairs into the NIC. Unfortunately the full list scales linearly with the number of flows. A better approach would be to depend on a security protocol like Google's PSP [4]. PSP hands out association IDs to the senders, and only after authentication. The PSP IDs are much smaller (4B) than a IPv6 flow key (36B) if we have to keep a full list. We can try to be clever about how we hand them out – for instance allocating the “direct write” IDs from high numbers and less trusted connections from low numbers. We can also modify PSP to maintain a key per memory pool rather than per device, or incorporate the memory region ID in the key derivation algorithm. I'm not sufficiently crypto-savvy to know if the latter would weaken the protection too much.

[4] https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf

To sum up in the scheme I'm dreaming up we add the following fields after the PSP header:

  • 16b – memory id – ID of the region;
  • 64b – base offset – absolute offset within the memory region;
  • 32b – TCP sequence number base for the transfer.

Since we have 16 extra bits to round up to full 128b header we can consider adding a generation number for the memory region. This could be useful if memory region configuration / page table update is asynchronous so that we can post the read request before NIC confirmed the configuration is complete. In most cases configuration should be done by the time the data arrives, if it's not we can fall back to non-zero-copy TCP.

I don't work for Netronome/Corigine any more but it's certainly something their HW can easily support with a FW update. Much like PSP itself. What an amazing piece of HW that is...

Cross-posting my blog post about TCP overload investigation, I didn't know how to add pictures here :( https://developers.facebook.com/blog/post/2022/04/25/investigating-tcp-self-throttling-triggered-overload/

In Linux 5.13 ethtool gains an interface for querying IEEE and IETF statistics. This removes the need to parse vendor specific strings in ethtool -S.

Status quo

Linux has two sources of NIC statistics, the common interface stats (which show up in ifconfig, ip link, sysfs and few other places) and ethtool -S. The former – common interface stats – are a mix of basic info (packets, bytes, drops, errors in each direction) and a handful of lower level stats like CRC errors, framing errors, collisions or FIFO errors. Many of these statistics became either irrelevant (collisions) or semantically unclear (FIFO errors) in modern NICs.

This is why deployments increasingly depend on ethtool -S statistics for error tracking. ethtool -S is a free form list of stats provided by the driver. It started out as a place for drivers to report custom, implementation specific stats, but ended up also serving as a reporting place for new statistics as the networking standards developed.

Sadly there is no commonality in how vendors name their ethtool statistics. The spelling and abbreviation of IEEE stats always differ, sometimes the names chosen do not resemble the standard names at all (reportedly because vendors consider those names “too confusing” for the users). This forces infrastructure teams to maintain translations and custom per-vendor logic to scrape ethtool -S output.

What changed

Starting with Linux 5.6 Michal Kubecek has been progressively porting ethtool from ioctls to a more structured and extensible netlink interface. Thanks to that we can now augment the old commands to carry statistics. When user specifies -I | --include-statistics on the command line (or the appropriate flag in netlink) kernel will include relevant statistics in its response, e.g. for flow control:

 # ethtool -I -a eth0
 Pause parameters for eth0:
 Autonegotiate:    off
 RX:        off
 TX:        on
 Statistics:
   tx_pause_frames: 25545561
   rx_pause_frames: 0

General statistics such as PHY and MAC counters are now available via ethtool -S under standard-based names though a new --groups switch, e.g.:

 # ethtool -S eth0 --groups eth-mac
 Standard stats for eth0:
 eth-mac-FramesTransmittedOK: 902623288966
 eth-mac-FramesReceivedOK: 28727667047
 eth-mac-FrameCheckSequenceErrors: 1
 eth-mac-AlignmentErrors: 0
 eth-mac-OutOfRangeLengthField: 0

Each of the commands supports JSON-formatted output for ease of parsing (--json).

So little, so late

Admittedly the new interface is quite basic. It mostly includes statistics provided in IEEE or IETF standards, and NICs may report more interesting data. There is also no metadata about “freshness” of the stats here, or filtering built into the interface.

The starting point is based on fulfilling immediate needs. We hope the interfaces will be extended as needed. Statistics can be made arbitrarily complex, so after a couple false-starts with complex interfaces we decided to let the use cases drive the interface.

It’s also very useful to lean on the standards for clear definition of the semantics. Going forward we can work with vendors on codifying the definitions of other counters they have.

List of currently supported stats

IEEE 802.3 attributes::

 30.3.2.1.5 aSymbolErrorDuringCarrier
 30.3.1.1.2 aFramesTransmittedOK
 30.3.1.1.3 aSingleCollisionFrames
 30.3.1.1.4 aMultipleCollisionFrames
 30.3.1.1.5 aFramesReceivedOK
 30.3.1.1.6 aFrameCheckSequenceErrors
 30.3.1.1.7 aAlignmentErrors
 30.3.1.1.8 aOctetsTransmittedOK
 30.3.1.1.9 aFramesWithDeferredXmissions
 30.3.1.1.10 aLateCollisions
 30.3.1.1.11 aFramesAbortedDueToXSColls
 30.3.1.1.12 aFramesLostDueToIntMACXmitError
 30.3.1.1.13 aCarrierSenseErrors
 30.3.1.1.14 aOctetsReceivedOK
 30.3.1.1.15 aFramesLostDueToIntMACRcvError
 
 30.3.1.1.18 aMulticastFramesXmittedOK
 30.3.1.1.19 aBroadcastFramesXmittedOK
 30.3.1.1.20 aFramesWithExcessiveDeferral
 30.3.1.1.21 aMulticastFramesReceivedOK
 30.3.1.1.22 aBroadcastFramesReceivedOK
 30.3.1.1.23 aInRangeLengthErrors
 30.3.1.1.24 aOutOfRangeLengthField
 30.3.1.1.25 aFrameTooLongErrors

 30.3.3.3 aMACControlFramesTransmitted
 30.3.3.4 aMACControlFramesReceived
 30.3.3.5 aUnsupportedOpcodesReceived
 
 30.3.4.2 aPAUSEMACCtrlFramesTransmitted
 30.3.4.3 aPAUSEMACCtrlFramesReceived

 30.5.1.1.17 aFECCorrectedBlocks
 30.5.1.1.18 aFECUncorrectableBlocks

IETF RMON (RFC 2819)

 etherStatsUndersizePkts
 etherStatsOversizePkts
 etherStatsFragments
 etherStatsJabbers

 etherStatsPkts64Octets
 etherStatsPkts65to127Octets
 etherStatsPkts128to255Octets
 etherStatsPkts256to511Octets
 etherStatsPkts512to1023Octets
 etherStatsPkts1024to1518Octets
 (incl. further stats for jumbo MTUs)

Kernel side changes: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8203c7ce4ef2840929d38b447b4ccd384727f92b

Recent months saw a lot of changes in the venerable NAPI polling mechanism, which warrants a write up.

NAPI primer

NAPI (New API) is so old it could buy alcohol in the US – which is why I will not attempt a full history lesson. It is, however, worth clearing up one major misconception. The basic flow of NAPI operation does not involve any explicit processing delays.

Before we even talk about how NAPI works, however, we need a sidebar on software interrupts.

Software interrupts (softirq) or bottom halves are a kernel concept which helps decrease interrupt service latency. Because normal interrupts don't nest in Linux, the system can't service any new interrupt while it's already processing one. Therefore doing a lot of work directly in an IRQ handler is a bad idea. softirqs are a form of processing which allows the IRQ handler to schedule a function to run as soon as IRQ handler exits. This adds a tier of “low latency processing” which does not block hardware interrupts. If software interrupts start consuming a lot of cycles, however, kernel will wake up a ksoftirq thread to take over the I/O portion of the processing. This helps back-pressure the I/O, and makes sure random threads don't get their scheduler slice depleted by softirq work.

Now that we understand softirqs, this is what NAPI does:

  1. Device interrupt fires
  2. Interrupt handler masks the individual NIC IRQ which has fired (modern NICs mask their IRQs automatically)
  3. Interrupt handler “schedules” NAPI in softirq
  4. Interrupt handler exits
  5. softirq runs NAPI callback immediately (or less often in ksoftirqd)
  6. At the end of processing NAPI re-enables the appropriate NIC IRQ again

As you can see there is no explicit delay from IRQ firing to NAPI, or extra batching, or re-polling built in.

Problem

NAPI was designed several years before Intel released its first multi-core CPU. Today systems have tens of CPUs and all of the cores can have dedicated networking queues. Experiments show that separating network processing from application processing yields better application performance. That said manual tuning and CPU allocation for every workload is tedious and often not worth the effort.

In terms of raw compute throughput having many cores service interrupts means more interrupts (less batching) and more cache pollution. Interrupts are also bad for application latency. Application workers are periodically stopped to service networking traffic. It would be much better to let the application finish its calculations and then service I/O only once it needs more data.

Last but not least NAPI semi-randomly gets kicked out into the ksoftirqd thread which degrades the network latency.

Busy polling

Busy polling is a kernel feature which was originally intended for low latency processing. Whenever an application was out of work it could check if the NIC has any packets to service thus circumventing the interrupt coalescing delay.

Recent work by Bjorn Topel reused the concept to avoid application interruptions altogether. An application can now make a “promise” to the kernel that it will periodically check for new packets itself (kernel sets a timer/alarm to make sure application doesn't break that promise.) The application is expected to use busy polling to process packets, replacing the interrupt driven parts of NAPI.

For example the usual timeline of NAPI processing would look something like:

EVENTS            irq coalescing delay (e.g. 50us)
                  <--->
packet arrival | p  p   ppp   p   pp  p  p  p  pp   p  ppp  p
           IRQ |       X      X        X      X     X     X    
-----------------------------------------------------------------
CPU USE
          NAPI |       NN     NNN      N      N     NN    N
   application |AAAAAAA  AAAAA   AAA AA AAAAAA AAAAA   AA  AAA
                < process req 1    > <process req 2     > <proc..

[Forgive the rough diagram, one space is 10us, assume app needs 150us per req, A is time used by app, N by NAPI.]

With new busy polling we want to achieve this:

EVENTS
packet arrival | p  p   ppp   p   pp  p  p  p  pp   p  ppp  p
           IRQ |
-----------------------------------------------------------
CPU USE
          NAPI |               NNNN                  NNN
   application |AAAAAAAAAAAAAAA^     AAAAAAAAAAAAAAAA^    AA 
                < process req 1>     <process req 2 >     <proc..

Here the application does not get interrupted. Once it's done with a request it asks the kernel to process packets. This allows the app to improve the latency by the amount of time NAPI processing would steal from request processing.

The two caveats of this approach are:

  • application processing has to be on similar granularity as NAPI processing (the typical cycle shouldn't be longer than 200us)
  • the application itself needs to be designed with CPU mapping in mind, or to put it simply the app architecture needs to follow the thread per core design – since NAPI instances are mapped to cores and there needs to be a thread responsible for polling each NAPI

Threaded NAPI

For applications which don't want to take on the polling challenge a new “threaded” NAPI processing option was added (after years of poking from many teams).

Unlike normal NAPI which relies on the built-in softirq processing, threaded NAPI queues have their own threads which always do the processing. Conceptually it's quite similar to the ksoftirq thread, but:

  • it never does processing “in-line” right after hardware IRQ, it always wakes up the thread
  • it only processes NAPI, not work from other subsystems
  • there is a thread per NAPI (NIC queue pair), rather than thread per core

The main advantage of threaded NAPI is that the network processing load is visible to the CPU scheduler, allowing it to make better choices. In tests performed by Google NAPI threads were explicitly pinned to cores but the application threads were not.

TAPI (work in progress)

The main disadvantage of threaded NAPI is that according to my tests it in fact requires explicit CPU allocation, unless the system is relatively idle, otherwise NAPI threads suffer latencies similar to ksoftirq latencies.

The idea behind “TAPI” is to automatically pack and rebalance multiple instances of NAPI to each thread. The hope is that each thread reaches high enough CPU consumption to get a CPU core all to itself from the scheduler. Rather than having 3 threaded NAPI workers at 30% CPU each, TAPI would have one at 90% which services its 3 instances in a round robin fashion. The automatic packing and balancing should therefore remove the need to manually allocate CPU cores to networking. This mode of operation is inspired by Linux workqueues, but with higher locality and latency guarantees.

Unfortunately, due to upstream emergencies after initial promising results the TAPI work has been on hold for the last 2 months.

All modern NICs implement IRQ coalescing (ethtool -c/-C), which delays RX/TX interrupts hoping more frames arrive in the meantime to allow for batch processing. IRQ coalescing trades off latency for system throughput.

It’s commonly believed that the higher the packet rate, the more batching system needs to keep up. At lower rates the batching would be limited, anyway. This leads to the idea of adaptive IRQ coalescing where the NIC itself – or more likely the driver – adjusts the IRQ timeouts based on the recent rate of packet arrivals.

Unfortunately adaptive coalescing is not a panacea, as it often has predefined range of values it chooses from, and it costs extra CPU processing to continuously recalculate and update the rate (especially with modern NICs which often need to talk to firmware to change settings rather than simply writing to device registers).

The summary above – while correct (I hope :)) misses one important point. There are two sets of IRQ coalescing settings, and only one of them has a significant latency impact. NICs (and the Linux kernel ethtool API) have separate settings for RX and TX. While RX processing is more costly, and therefore playing with RX settings feels more significant – RX batching costs latency. For TX processing (or actually TX completion processing) the latency matters much, much less.

With a simple bpftrace command:

bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

we can check how many RX packets get received on every NAPI poll (that’s to say how many packets get coalesced). On a moderately loaded system the top entries may look something like:

@[4]: 750
@[3]: 2180
@[2]: 15828
@[1]: 233080
@[0]: 298525

where the first number (in square brackets) is the number of packets coalesced, and the second number is a counter of occurrences.

The 0 work done entries (@[0]: 298525) usually mean that the driver received a TX interrupt, and there were no RX packets to process. Drivers will generally clear their TX rings while doing RX processing – so with TX processing being less latency sensitive – in an ideal scenario we’d like to see no TX interrupts at all, but rather have TX processing piggy back on the RX interrupts.

How high can we set the TX coalescing parameters, then? If the workload is mostly using TCP all we really need to ensure is that we don’t run awry of TCP Small Queues (/proc/sys/net/ipv4/tcp_limit_output_bytes) which is the number of bytes TCP stack is willing to queue up to the NIC without getting a TX completion.

For example recent upstream kernels have TSQ of 1MB, so even with a 50GB NIC – delaying TX interrupts for up to 350us should be fine. Obviously we want to give ourselves a safety margin for scheduling delays, timer slack etc. Additionally, according to my experiments the gains of TX coalescing above 200us are perhaps too low for the risk.

Repeating the bpftrace command from above after setting coalescing to 150us / 128 frames:

@[4]: 831
@[3]: 2066
@[2]: 16056
@[0]: 177186
@[1]: 228985

We see far less @[0] occurrences compared to the first run. The gain in system throughput depends on the workload, I’ve seen an increase of 6% on the workload I tested with.

A word of warning – even though upstream reviewers try to make sure drivers behave sanely and return errors for unsupported configurations – there are vendors out there who will silently ignore TX coalescing settings...