NIC memory reserve

NIC drivers pre-allocate memory for received packets. Once the packets arrive NIC can DMA them into the buffers, potentially hundreds of them, before host processing kicks in.

For efficiency reasons each packet-processing CPU (in extreme cases every CPU on the system) will have its own set of packet queues, including its own set of pre-allocated buffers.

The amount of memory pre-allocated for Rx is a product of:

A reasonable example in data centers would be:

8k * 32 queues * 4k entries = 1GB

Buffer size is roughly dictated by the MTU of the network, for a modern datacenter network 8k (2 pages) is likely a right ballpark figure. Number of queues depends on the number of cores on the system and the request-per-second rate of the workload. 32 queues is a reasonable choice for the example (either 100+ threads or a network-heavy workload).

Last but not least – the queue depth. Because networking is bursty, and NAPI processing is at the whim of the scheduler (the latter is more of an issue in practice) the queue depths of 4k or 8k entries are not uncommon.

Can we do better?

Memory is not cheap, having 1GB of memory sitting around unused 99% of the time has a real cost. If we were to advise a NIC design (or had access to highly flexible devices like the Netronome/Corigine NICs) we could use the following scheme to save memory:

Normal processing rarely requires queue depth of more than 512 entries. We could therefore have smaller dedicated queues, and a larger “reserve” – a queue from which every Rx queue can draw, but which requires additional synchronization on the host side. To achieve the equivalent of 4k entries we'd only need:

8k * 32 queues * 512 entries + 8k * 1 reserve * 4k entries = 160MB

The NIC would try to use the 512 entries dedicated to each queue first, but if they run out (due to a packet burst or a scheduling delay) it could use the entries from the reserve. Bursts and latency spikes are rarely synchronized across the queues.

Can we do worse?

In practice memory savings are rarely top-of-mind for NIC vendors. Multiple drivers in Linux allocate a set of rings for each thread of the CPU. I can only guess that this is to make sure iperf tests run without a hitch...

As we wait for vendors to improve their devices – double check the queue count and queue size you use are justified (ethtool -g / ethtool -l).