Easy network performance wins with IRQ coalescing

All modern NICs implement IRQ coalescing (ethtool -c/-C), which delays RX/TX interrupts hoping more frames arrive in the meantime to allow for batch processing. IRQ coalescing trades off latency for system throughput.

It’s commonly believed that the higher the packet rate, the more batching system needs to keep up. At lower rates the batching would be limited, anyway. This leads to the idea of adaptive IRQ coalescing where the NIC itself – or more likely the driver – adjusts the IRQ timeouts based on the recent rate of packet arrivals.

Unfortunately adaptive coalescing is not a panacea, as it often has predefined range of values it chooses from, and it costs extra CPU processing to continuously recalculate and update the rate (especially with modern NICs which often need to talk to firmware to change settings rather than simply writing to device registers).

The summary above – while correct (I hope :)) misses one important point. There are two sets of IRQ coalescing settings, and only one of them has a significant latency impact. NICs (and the Linux kernel ethtool API) have separate settings for RX and TX. While RX processing is more costly, and therefore playing with RX settings feels more significant – RX batching costs latency. For TX processing (or actually TX completion processing) the latency matters much, much less.

With a simple bpftrace command:

bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

we can check how many RX packets get received on every NAPI poll (that’s to say how many packets get coalesced). On a moderately loaded system the top entries may look something like:

@[4]: 750
@[3]: 2180
@[2]: 15828
@[1]: 233080
@[0]: 298525

where the first number (in square brackets) is the number of packets coalesced, and the second number is a counter of occurrences.

The 0 work done entries (@[0]: 298525) usually mean that the driver received a TX interrupt, and there were no RX packets to process. Drivers will generally clear their TX rings while doing RX processing – so with TX processing being less latency sensitive – in an ideal scenario we’d like to see no TX interrupts at all, but rather have TX processing piggy back on the RX interrupts.

How high can we set the TX coalescing parameters, then? If the workload is mostly using TCP all we really need to ensure is that we don’t run awry of TCP Small Queues (/proc/sys/net/ipv4/tcp_limit_output_bytes) which is the number of bytes TCP stack is willing to queue up to the NIC without getting a TX completion.

For example recent upstream kernels have TSQ of 1MB, so even with a 50GB NIC – delaying TX interrupts for up to 350us should be fine. Obviously we want to give ourselves a safety margin for scheduling delays, timer slack etc. Additionally, according to my experiments the gains of TX coalescing above 200us are perhaps too low for the risk.

Repeating the bpftrace command from above after setting coalescing to 150us / 128 frames:

@[4]: 831
@[3]: 2066
@[2]: 16056
@[0]: 177186
@[1]: 228985

We see far less @[0] occurrences compared to the first run. The gain in system throughput depends on the workload, I’ve seen an increase of 6% on the workload I tested with.

A word of warning – even though upstream reviewers try to make sure drivers behave sanely and return errors for unsupported configurations – there are vendors out there who will silently ignore TX coalescing settings...