Recent months saw a lot of changes in the venerable NAPI polling mechanism, which warrants a write up.
NAPI (New API) is so old it could buy alcohol in the US – which is why I will not attempt a full history lesson. It is, however, worth clearing up one major misconception. The basic flow of NAPI operation does not involve any explicit processing delays.
Before we even talk about how NAPI works, however, we need a sidebar on software interrupts.
Software interrupts (softirq) or bottom halves are a kernel concept which helps decrease interrupt service latency. Because normal interrupts don't nest in Linux, the system can't service any new interrupt while it's already processing one. Therefore doing a lot of work directly in an IRQ handler is a bad idea. softirqs are a form of processing which allows the IRQ handler to schedule a function to run as soon as IRQ handler exits. This adds a tier of “low latency processing” which does not block hardware interrupts. If software interrupts start consuming a lot of cycles, however, kernel will wake up a ksoftirq thread to take over the I/O portion of the processing. This helps back-pressure the I/O, and makes sure random threads don't get their scheduler slice depleted by softirq work.
Now that we understand softirqs, this is what NAPI does:
- Device interrupt fires
- Interrupt handler masks the individual NIC IRQ which has fired (modern NICs mask their IRQs automatically)
- Interrupt handler “schedules” NAPI in softirq
- Interrupt handler exits
- softirq runs NAPI callback immediately (or less often in ksoftirqd)
- At the end of processing NAPI re-enables the appropriate NIC IRQ again
As you can see there is no explicit delay from IRQ firing to NAPI, or extra batching, or re-polling built in.
NAPI was designed several years before Intel released its first multi-core CPU. Today systems have tens of CPUs and all of the cores can have dedicated networking queues. Experiments show that separating network processing from application processing yields better application performance. That said manual tuning and CPU allocation for every workload is tedious and often not worth the effort.
In terms of raw compute throughput having many cores service interrupts means more interrupts (less batching) and more cache pollution. Interrupts are also bad for application latency. Application workers are periodically stopped to service networking traffic. It would be much better to let the application finish its calculations and then service I/O only once it needs more data.
Last but not least NAPI semi-randomly gets kicked out into the ksoftirqd thread which degrades the network latency.
Busy polling is a kernel feature which was originally intended for low latency processing. Whenever an application was out of work it could check if the NIC has any packets to service thus circumventing the interrupt coalescing delay.
Recent work by Bjorn Topel reused the concept to avoid application interruptions altogether. An application can now make a “promise” to the kernel that it will periodically check for new packets itself (kernel sets a timer/alarm to make sure application doesn't break that promise.) The application is expected to use busy polling to process packets, replacing the interrupt driven parts of NAPI.
For example the usual timeline of NAPI processing would look something like:
EVENTS irq coalescing delay (e.g. 50us) <---> packet arrival | p p ppp p pp p p p pp p ppp p IRQ | X X X X X X ----------------------------------------------------------------- CPU USE NAPI | NN NNN N N NN N application |AAAAAAA AAAAA AAA AA AAAAAA AAAAA AA AAA < process req 1 > <process req 2 > <proc..
[Forgive the rough diagram, one space is 10us, assume app needs 150us per req, A is time used by app, N by NAPI.]
With new busy polling we want to achieve this:
EVENTS packet arrival | p p ppp p pp p p p pp p ppp p IRQ | ----------------------------------------------------------- CPU USE NAPI | NNNN NNN application |AAAAAAAAAAAAAAA^ AAAAAAAAAAAAAAAA^ AA < process req 1> <process req 2 > <proc..
Here the application does not get interrupted. Once it's done with a request it asks the kernel to process packets. This allows the app to improve the latency by the amount of time NAPI processing would steal from request processing.
The two caveats of this approach are:
- application processing has to be on similar granularity as NAPI processing (the typical cycle shouldn't be longer than 200us)
- the application itself needs to be designed with CPU mapping in mind, or to put it simply the app architecture needs to follow the thread per core design – since NAPI instances are mapped to cores and there needs to be a thread responsible for polling each NAPI
For applications which don't want to take on the polling challenge a new “threaded” NAPI processing option was added (after years of poking from many teams).
Unlike normal NAPI which relies on the built-in softirq processing, threaded NAPI queues have their own threads which always do the processing. Conceptually it's quite similar to the ksoftirq thread, but:
- it never does processing “in-line” right after hardware IRQ, it always wakes up the thread
- it only processes NAPI, not work from other subsystems
- there is a thread per NAPI (NIC queue pair), rather than thread per core
The main advantage of threaded NAPI is that the network processing load is visible to the CPU scheduler, allowing it to make better choices. In tests performed by Google NAPI threads were explicitly pinned to cores but the application threads were not.
TAPI (work in progress)
The main disadvantage of threaded NAPI is that according to my tests it in fact requires explicit CPU allocation, unless the system is relatively idle, otherwise NAPI threads suffer latencies similar to ksoftirq latencies.
The idea behind “TAPI” is to automatically pack and rebalance multiple instances of NAPI to each thread. The hope is that each thread reaches high enough CPU consumption to get a CPU core all to itself from the scheduler. Rather than having 3 threaded NAPI workers at 30% CPU each, TAPI would have one at 90% which services its 3 instances in a round robin fashion. The automatic packing and balancing should therefore remove the need to manually allocate CPU cores to networking. This mode of operation is inspired by Linux workqueues, but with higher locality and latency guarantees.
Unfortunately, due to upstream emergencies after initial promising results the TAPI work has been on hold for the last 2 months.