KASan for ARM32 Decompression Stop

January 14, 2021

This is a retrospect of my work on the KASan Kernel Address Sanitizer for the ARM32 platform. The name is a pun on the diving decompression stop that is something you perform after going down below the surface to avoid decompression sickness.

Where It All Began

The AddressSanitizer (ASan) is a really clever invention by Google, hats off. It is one of those development tools that like git just take on the world in a short time. It was invented by some smart russians, especially Андрей Коновалов (Andrey Konovalov) and Дмитрий Вьюков (Dmitry Vyukov). It appears to be not just funded by Google but also part of a PhD thesis work.

The idea with ASan is to help ensure memory safety by intercepting all memory accesses through compiler instrumentation, and consequently providing “ASan splats” (runtime problem detections) while stressing the code. Code instrumented with ASan gets significantly slower than normal and uses up a bunch of memory for “shadowing” (I will explain this) making it a pure development tool: it is not intended to be enabled on production systems.

The way that ASan instruments code is by linking every load and store into symbols like these:

__asan_load1(unsigned long addr);
__asan_store1(unsigned long addr);
__asan_load2(unsigned long addr);
__asan_store2(unsigned long addr);
__asan_load4(unsigned long addr);
__asan_store4(unsigned long addr);
__asan_load8(unsigned long addr);
__asan_store8(unsigned long addr);
__asan_load16(unsigned long addr);
__asan_store16(unsigned long addr);

As you can guess these calls loads or stores 1, 2, 4, 8 or 16 bytes of memory in a chunk into the virtual address addr and reflects how the compiler thinks that the compiled code (usually C) thinks about these memory accesses. ASan intercepts all reads and writes by placing itself between the executing program and any memory management. The above symbols can be implemented by any runtime environment. The address will reflect what the compiler assumed runtime environment thinks about the (usually virtual) memory where the program will execute.

You will instrument your code with ASan, toss heavy test workloads on the code, and see if something breaks. If something breaks, you go and investigate the breakage to find the problem. The problem will often be one or another instance of buffer overflow, out-of-bounds array or string access, or the use of a dangling pointer such as use-after-free. These problems are a side effect of using the C programming language.

When resolving the mentioned load/store symbols, ASan instrumentation is based on shadow memory, and this is on turn based on the idea that one bit in a single byte “shadows” 8 bytes of memory, so you allocate 1/8 the amount of memory that your instrumented program will use and shadow that to some other memory using an offset calculation like this:

shadow = (address >> 3) + offset

The shadow memory is located at offset, and if our instrumented memory is N bytes then we need to allocate N/8 = N >> 3 bytes to be used as shadow memory. Notice that I say instrumented memory not code: ASan shadows not only the actual compiled code but mainly and most importantly any allocations and referenced pointers the code maintains. Also the DATA (contants) and BSS (global variables) part of the executable image are shadowed. To achive this the userspace links to a special malloc() implementation that overrides the default and manages all of this behind the scenes. One aspect of it is that malloc() will of course return chunks of memory naturally aligned to 8, so that the shadow memory will be on an even byte boundary.

The ASan shadow memory shadows the memory you're interested in inspecting for memory safety.

The helper library will allocate shadow memory as part of the runtime, and use it to shadow the code, data and runtime allocations of the program. Some will be allocated up front as the program is started, some will be allocated to shadow allocations at runtime, such as dynamically allocated buffers or anything else you malloc().

The error detection was based on the observation that a shadowing byte with each bit representing an out-of-bounds access error can have a “no error” state (0x00) and 8 error states, in total 9 states. Later on a more elaborate scheme was adopted. Values 1..7 indicate how many of the bytes are valid for access (if you malloc() just 5 bytes then it will be 5) and then there are magic bytes for different conditions.

When a piece of memory is legally allocated and accessed the corresponding bits are zeroed. Uninitialized memory is “poisoned”, i.e. set to a completely illegal value != 0. Further SLAB allocations are padded with “red zones” poisoning memory in front and behind of every legal allocation. When accessing a byte in memory, it is easy to verify that the access is legal: is the shadow byte == 0? That means all 8 bytes can be freely accessed and we can quickly proceed. Else we need a closer look. Values 1 thru 7 means bytes 1 thru 7 are valid for access (partly addressable) so we check that and any other values means uh oh.

0xFA and 0xFB means we have hit a heap left/right redzone so an out-of-bounds access has happened
0xFD means access to a free:ed heap region, so use-after-free
etc

Decoding the hex values gives a clear insight into what access violation we should be looking for.

To be fair the one bit per byte (8-to-1) mapping is not compulsory. This should be pretty obvious. Other schemes such as mapping even 32 bytes to one byte have been discussed for memory-constrained systems.

All memory access calls (such as any instance of dereferencing a pointer) and all functions in the library such as all string functions are patched to check for these conditions. It's easy when you have compiler instrumentation. We check it all. It is not very fast but it's bareable for testing.

Researchers in one camp argue that we should all be writing software (including operating systems) in the programming language Rust in order to avoid the problems ASan is trying to solve altogether. This is a good point, but rewriting large existing software such as the Linux kernel in Rust is not seen as realistic. Thus we paper over the problem instead of using the silver bullet. Hybrid approaches to using Rust in kernel development are being discussed but so far not much has materialized.

KASan Arrives

The AddressSanitizer (ASan) was written with userspace in mind, and the userspace project is very much alive.

As the mechanism used by ASan was quite simple, and the compilers were already patched to handle shadow memory, the Kernel Address Sanitizer (KASan) was a natural step. At this point (2020) the original authors seem to spend a lot of time with the kernel, so the kernel hardening project has likely outgrown the userspace counterpart.

The magic values assigned to shadow memory used by KASan is different:

0xFA means the memory has been free:ed so accessing it means use-after-free.
0xFB is a free:ed managed resources (devm_* accessors) in the Linux kernel.
0xFC and 0xFE means we access a kmalloc() redzone indicating an out-of-bounds access.

This is why these values often occur in KASan splats. The full list of specials (not very many) can be found in mm/kasan/kasan.h.

The crucial piece to create KASan was a compiler flag to indicate where to shadow the kernel memory: when the kernel Image is linked, addresses are resolved to absolute virtual memory locations, and naturally all of these, plus the area where kernel allocates memory (SLABs) at runtime need to be shadowed. As can be seen in the kernel Makefile.kasan include, this boils down to passing the flags -fsanitize=kernel-address and -asan-mapping-offset=$(KASAN_SHADOW_OFFSET) when linking the kernel.

The kernel already had some related tools, notably kmemcheck which can detect some cases of use-after-free and references to uninitialized memory. It was based on a slower mechanism so KASan has since effectively superceded it, as kmemcheck was removed.

KASan was added to the kernel in a commit dated february 2015 along with support for the x86_64 architecture.

To exercise the kernel to find interesting bugs, the inventors were often using syzkaller, a tool similar to the older Trinity: it bombs the kernel with fuzzy system calls to try to provoke undefined and undesired behaviours yielding KASan splats and revealing vulnerabilities.

Since the kernel is the kernel we need to explicitly assign memory for shadowing. Since we are the kernel we need to do some manouvers that userspace can not do or do not need to do:

During early initialization of the kernel we point all shadow memory to a single page of just zeroes making all accesses seem fine until we have proper memory management set up. Userspace programs do not need this phase as “someone else” (the C standard library) handles all memory set up for them.
Memory areas which are just big chunks of code and data can all point to a single physical page with poison. In the virtual memory it might look like kilobytes and megabytes of poison bytes but it all points to the same physical page of 4KB.
We selectively de-instrument code as well: code like KASan itself, the memory manager per se, or the code that patches the kernel for ftrace, or the code that unwinds the stack pointer for a kernel splat clearly cannot be instrumented with KASan: it is part of the design of these facilities to poke around at random locations in memory, it's not a bug. Since KASan was added all of these sites in the generic kernel code have been de-instrumented, more or less.

Once these generic kernel instrumentations were in place, other architectures could be amended with KASan support, with ARM64 following x86 soon in the autumn of 2015.

Some per-architecture code, usually found in arch/xxxx/mm/kasan_init.c is needed for KASan. What this code does is to initalize the shadow memory during early initialization of the virtual memory to point to a “zero page” and later on to populate all the shadow memory with poisoned shadow pages.

The shadow memory is special and needs to be populated accessing the very lowest layer of the virtual memory abstraction: we manipulate the page tables from the top to bottom: pgd, p4d, pud, pmd, pte to make sure that the $(KASAN_SHADOW_OFFSET) points to memory that has valid page table entries.

We need to use the kernel memblock early memory management to set up memory to hold the page tables themselves in some cases. The memblock memory manager also provide us with a list of all the kernel RAM: we loop over it using for_each_mem_range() and populate the shadow memory for each range. As mentioned we first point all shadows to a zero page, and later on to proper KASan shadow memory, and then KASan kicks into action.

A special case happens when moving from using the “zero page” KASan memory to proper shadow memory: we would risk running kernel threads into partially initialized shadow memory and pull the ground out under ourselves. Not good. Therefore the global page table for the entire kernel (the one that has all shadow memory pointing to a zero page) is copied and used during this phase. It is then replaced, finally, with the proper KASan-instrumented page table with pointers to the shadow memory in a single atomic operation.

Further all optimized memory manipulation functions from the standard library need to be patched as these often have assembly-optimized versions in the architecture. This concerns memcpy(), memmove() and memset() especially. While the quick optimized versions are nice for production systems, we replace these with open-coded variants that do proper memory accesses in C and therefore will be intercepted by KASan.

All architectures follow this pattern, more or less. ARM64 supports hardware tags, which essentially means that the architecture supports hardware acceleration of KASan. As this is pretty fast, there is a discussion about using KASan even on production systems to capture problems seen very seldom.

KASan on ARM32

Then there was the attempt to provide KASan for ARM32.

The very first posting of KASan in 2014 was actually targeting x86 and ARM32 and was already working-kind-of-prototype-ish on ARM32. This did not proceed. The main reason was that when using modules, these are loaded into a designated virtual memory area rather than the kernel “vmalloc area” which is the main area used for memory allocations and what most architectures use. So when trying to use loadable modules the code would crash as this RAM was not shadowed.

The developers tried to create the option to move modules into the vmalloc area and enable this by default when using KASan to work around this limitation.

The special module area is however used for special reasons. Since it was placed in close proximity to the main kernel TEXT segment, the code could be accessed using short jumps rather than long jumps: no need to load the whole 32-bit program counter anew whenever a piece of code loaded from a module was accessed. It made code in modules as quick as normal compiled-in kernel code +/– cache effects. This provided serious performance benefits.

As a result KASan support for ARM was dropped from the initial KASan proposal and the scope was limited to x86, then followed by ARM64. “We will look into this later”.

In the spring of 2015 I started looking into KASan and was testing the patches on ARM64 for Linaro. In june I tried to get KASan working on ARM32. Andrey Ryabinin pointed out that he actually had KASan running on ARM32. After some iterations we got it working on some ARM32 platforms and I was successfully stressing it a bit using the Trinity syscall fuzzer. This version solved the problem of shadowing the loadable modules by simply shadowing all that memory as well.

The central problem with running KASan on a 32-bit platform as opposed to a 64-bit platform was that the simplest approach used up 1/8 of the whole address space which was not a problem for 64-bit platforms that have ample virtual address space available. (Notice that the amount of physical memory doesn't really matter, as KASan will use the trick to just point large chunks of virtual memory to a single physical poison page.) On 32-bit platforms this approach ate our limited address space for lunch.

We were setting aside several static assigned allocations in the virtual address space, so we needed to make sure that we only shadow the addresses actually used by the kernel. We would not need to shadow the addresses used by userspace and the shadow memory virtual range requirement could thus be shrunk from 512 MB to 130 MB for the traditional 3/1 GB kernel/userspace virtual address split used on ARM32. (To understand this split, read my article How the ARM32 Kernel Starts which tries to tell the story.)

Sleeping Beauty

This more fine-grained approach to assigning shadow memory would create some devil-in-the-details bugs that will not come out if you shadow the whole virtual address space, as the 64-bit platforms do.

A random access to some memory that should not be poked (and thus lacking shadow memory) will lead to a crash. While QEMU and some hardware was certainly working, there were some real hardware platforms that just would not boot. Things were getting tedious.

KASan for ARM32 development ground to a halt because we were unable to hash out the bugs. The initial patches from Andrey started trading hands and these out-of-tree patches were used by some vendors to test code on some hardware.

Personally, I had other assignments and could not take over and develop KASan at this point. I'm not saying that I was necessarily a good candidate at the time either, I was just testing and tinkering with KASan because ARM32 vendors had a generic interest in the technology.

As a result, KASan for ARM32 was pending out-of-tree for almost 5 years. In 2017 Abbot Liu was working on it and fixed up the support for LPAE (large physical address extension) and in 2019 Florian Fainelli picked up where Abbot left off.

Some things were getting fixed along the road, but what was needed was some focused attention and these people had other things on their plate as well.

Finally Fixing the Bugs

In April 2020 I decided to pick up the patches and have a go at it. I sloppily named my first iteration “v2” while it was something like v7.

I quickly got support from two key people: Florian Fainelli and Ard Biesheuvel. Florian had some systems with the same odd behaviour of just not working as my pet Qualcomm APQ8060 DragonBoard that I had been using all along for testing. Ard was using the patches for developing and debugging things like EFI and KASLR.

During successive iterations we managed to find and patch the remaining bugs, one by one:

A hard-coded bitmask assuming thread size order to be 1 (4096 bytes) on ARMv4 and ARMv5 silicon made the kernel crash when entering userspace. KASan increases the thread order so that there would be space for redzones before and after allocations, so it needed more space. After reading assembly one line at the time I finally figured this out and patched it.
The code was switching MMU table by simply altering the TTBR0 register. This worked in some machines, especially ARMv7 silicon, but the right way to do it was to use the per-CPU macro cpu_switch_mm() which looks intuitive but is an ARM32-ism which is why the original KASan authors didn't know about it. This macro accounts for tiny differences between different ARM cores, some even custom to certain vendors.
Something fishy was going on with the attached device tree. It turns out, after much debugging, that the attached devicetree could end up in memory that was outside of the kernel 1:1 physical-to-virtual mapping. The page table entries that would have translated the physical memory area where the device tree was stored was wiped clean yielding a page fault. The problem was not caused by KASan per se: it was a result of the kernel getting over a certain size, and all the instrumentation added to the kernel makes it bigger to the point that it revealed the bug. We were en route to fix a bug related to big compressed kernel images. I developed debugging code specifically to find this bug and then made a patch for this making sure not to wipe that part of the mapping. (This post gives a detailed explanation of the problem.) Ard quickly came up with a better fix: let's move the device tree to determined place in the fixed mappings and handle it as if it was a ROM.

These were the major roadblocks. Fixing these bugs created new bugs which we also fixed. Ard and Florian mopped up the fallout.

In the middle of development, five level page tables were introduced and Mike Rapoport made some generic improvements and cleanup to the core memory management code, so I had to account for these changes as well, effectively rewriting the KASan ARM32 shadow memory initialization code. At one point I also broke the LPAE support and had to repair it.

Eventually the v16 patch set was finalized in october 2020 and submitted to Russell Kings patch tracker and he merged it for Linux v5.11-rc1.

Retrospect

After the fact three things came out nice in the design of KASan for ARM32:

We do not shadow or intercept highmem allocations, which is nice because we want to get rid of highmem altogether.
We do not shadow the userspace memory, which is nice because we want to move userspace to its own address space altogether.
Personally I finally got a detailed idea of how the ARM32 kernel decompresses and starts, and the abstract concepts of highmem, lowmem, and the rest of those wild animals. I have written three different articles on this blog as a result, with ideas for even more of them. By explaining how things work to others I realize what I can't explain and as a result I go and research it.

Andrey and Dmitry has since worked on not just ASan and KASan but also on what was intially called the KernelThreadSanitizer (KTSAN) but which was eventually merged under the name KernelConcurrencySanitizer (KCSAN). The idea is again to use shadow memory, but now for concurrency debugging at runtime. I do not know more than this.

FOSDEM Testing and Automation CfP

December 8, 2020

from metan's blog

We are organizing Testing and Automation devroom on FOSDEM this year again details at: https://fosdem-testingautomation.github.io/

C++ rvalue references

October 26, 2020

from joelfernandes

The writer works in the ChromeOS kernel team, where most of the system libraries, low-level components and user space is written in C++. Thus the writer has no choice but to be familiar with C++. It is not that hard, but some things are confusing. rvalue references are definitely confusing.

In this post, I wish to document rvalue references by simple examples, before I forget it.

Refer to this article for in-depth coverage on rvalue references.

In a nutshell: An rvalue reference can be used to construct a C++ object efficiently using a “move constructor”. This efficiency is achieved by the object's move constructor by moving the underlying memory of the object efficiently to the destination instead of a full copy. Typically the move constructor of the object will copy pointers within the source object into the destination object, and null the pointer within the source object.

An rvalue reference is denoted by a double ampersand (&&) when you want to create an rvalue reference as a variable.

For example T &&y; defines a variable y which holds an rvalue reference of type T. I have almost never seen an rvalue reference variable created this way in real code. I also have no idea when it can be useful. Almost always they are created by either of the 2 methods in the next section. These methods create an “unnamed” rvalue reference which can be passed to a class's move constructor.

When is an rvalue reference created?

In the below example, we create an rvalue reference to a vector, and create another vector object from this.

This can happen in 2 ways (that I know off):

1. Using std::move

This converts an lvalue reference to an rvalue reference.

Example:

#include <iostream>
#include <vector>

int main()
{
    int *px, *py;
    std::vector<int> x = {4,3};
    px = &(x[0]);
 
    // Convert lvalue 'x' to rvalue reference and pass
    // it to vector's overloaded move constructor.
    std::vector<int> y(std::move(x)); 
    py = &(y[0]);

    // Confirm the new vector uses same storage
    printf("same vector? : %d\n", px == py); // prints 1
}

2. When returning something from a function

The returned object from the function can be caught as an rvalue reference to that object.

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int> y(myf(4)); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 1
}

Note on move asssignment

Interestingly, if you construct vector 'y' using the assignment operator: std::vector<int> y = myf(4);, the compiler may decide to use the move constructor automatically even though assignment is chosen. I believe this is because of vector's move assignment operator overload.

Further, the compiler may even not invoke a constructor at all and just perform RVO (Return Value Optimization).

Quiz

Question:

If I create a named rvalue reference using std::move and then use this to create a vector, the underlying storage of the new vector is different. Why?

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int>&& ref = myf(4);
    std::vector<int> y(ref); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 0
}

Answer

The answer is: because the value category of the id-expression 'ref' is lvalue, the copy constructor will be chosen. To use the move constructor, it has to be std::vector<int> y(std::move(ref));.

Conclusion

rvalue references are confusing and sometimes the compiler can do different optimizations to cause further confusion. It is best to follow well known design patterns when designing your code. It may be best to also try to avoid rvalue references altogether but hopefully this article helps you understand it a bit more when you come across large C++ code bases.

XDP vs OVS

August 17, 2020

from David Ahern

Long overdue blog post on XDP; so many details uncovered during testing causing tests to be redone.

This post focuses on a comparison of XDP and OVS in delivering packets to a VM from the perspective of CPU cycles spent by the host in processing those packets. There are a lot of variables at play, and changing any one of them radically affects the outcome, though it should be no surprise XDP is always lighter and faster.

Setup

I believe I am covering all of the settings here that I discovered over the past few months that caused variations in the data.

Host

The host is a standard, modern server (Dell PowerEdge R640) with an Intel® Xeon® Platinum 8168 CPU @ 2.70GHz with 96 hardware threads (48 cores + hyper threading to yield 96 logical cpus in the host). The server is running Ubuntu 18.04 with the recently released 5.8.0 kernel. It has a Mellanox Connectx4-LX ethernet card with 2 25G ports into an 802.3ad (LACP) bond, and the bond is connected to an OVS bridge.

Host setup

As discussed in [1] to properly compare the CPU costs of the 2 networking solutions, we need to consolidate packet processing to a single CPU. Handling all packets destined to the same VM on the same CPU avoids lock contention on the tun ring, so consolidating packets to a single CPU is actually best case performance.

Ensure RPS is disabled in the host:

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

and add flow rules in the NIC to push packets for the VM under test to a single CPU:

sudo ethtool -N eth0 flow-type ether dst 12:34:de:ad:ca:fe action 2
sudo ethtool -N eth1 flow-type ether dst 12:34:de:ad:ca:fe action 2

For this host and ethernet card, packets for queue 2 are handled on CPU 5 (consult /proc/interrupts for the mapping on your host).

XDP bypasses the qdisc layer, so to have a fair comparison make noqueue the default qdisc before starting the VM:

sudo sysctl -w net.core.default_qdisc=noqueue

(or add a udev rule [2]).

Finally, the host is fairly quiet with only one VM running (the one under test) and very little network traffic outside of the VM under test and a few, low traffic ssh sessions used to run commands to collect data about the tests.

Virtual Machine

The VM has 8 cpus and is also running Ubuntu 18.04 with a 5.8.0 kernel. It uses tap+vhost for networking with the tap device a port in the OVS bridge as shown in the picture above. The tap device has a single queue, and RPS is also disabled in the guest:

echo 00 | sudo tee /sys/class/net/eth0/queues -name rps_cpus

The VM is also quiet with no load running in the guest OS.

The point of this comparison is host side processing of packets, so packets are dropped in the guest as soon as possible using a bpf program [3] attached to eth0 as a tc filter. (Note: Theoretically, XDP should be used to drop the packets in the guest OS since it truly is the fewest cycles per packet. However, XDP in the VM requires a multi-queue NIC[5], and adding queues to the guest NIC has a huge affect on the results.)

In the host, the qemu threads corresponding to the guest CPUs (vcpus) are affined (as a set) to 8 hardware threads in the same NUMA node as CPU 5 (the host CPU processing packets per the RSS rules mentioned earlier). The vhost thread for the VM's tap device is also affined to a small set of host CPUs in the same NUMA node to avoid scheduling collisions with the vcpu threads, the CPU processing packets (5) and its sibling hardware thread (CPU 53 in my case) – all of which add variability to the results.

Forwarding with XDP

Packet forwarding with XDP is done by attaching an L2 forwarding program [4] to eth0 and eth1. The program pulls the VLAN and destination mac from the ethernet header, and uses the pair as a key for a lookup in a hash map. The lookup returns the next device index for the packet which for packets destined to the VM is the index of its tap device. If an entry is found, the packet is redirected to the device via XDP_REDIRECT. The use case was presented in depth at netdevconf 0x14 [5].

Packet generator

Packets are generated using 2 VMs on a server that is directly connected to the same TOR switches as the hypervisor running the VM under test. The point of the investigation is to measure the overhead of delivering packets to a VM, so memcpy is kept to a minimum by having the packet generator [6] in the VMs send 1-byte UDP packets.

Test setup

Each VM can generate a little over 1 million packets per sec (1M pps), for a maximum load of 2.2M pps based on 2 separate source addresses.

CPU Measurement

As discussed in [1] a fair number of packets are processed in the context of some interrupted, victim process or when handled on an idle CPU the cycles are not fully accounted in the softirq time shown in tools like mpstat.

This test binds openssl speed, a purely userspace command[1], to the CPU handling packets to fully consume 100% of all CPU cycles which makes the division of CPU time between user, system and softirq more transparent. In this case, the output of mpstat -P 5 shows how all of the cycles for CPU 5 were spent (within the resolution of system accounting): * %softirq is the time spent handling packets. This data is shown in the graphs below. * %usr represents the usable CPU time for processes to make progress on their workload. In this test, it shows the percentage of CPU consumed by openssl and compares to the times shown by openssl within 1-2%. * %sys is the percentage of kernel time and for the data shown below was always <0.2%.

As an example, in this mpstat output openssl is only getting 14.2% of the CPU while 85.8% was spent handling the packet load:

CPU    %usr   %nice    %sys  %iowait   %irq   %soft   %idle
  5   14.20    0.00    0.00     0.00   0.00   85.80   0.00

(%steal, %guest and %gnice dropped were always 0 and dropped for conciseness.)

Let's get to the data.

CPU Comparison

This chart shows a comparison of the %softirq required to handle various PPS rates for both OVS and XDP. Lower numbers are better (higher percentages mean more CPU cycles).

1-VM softirq

There is 1-2% variability in ksoftirqd percentages despite the 5-second averaging, but the variability does not really affect the important points of this comparison.

The results should not be that surprising. OVS has well established scaling problems and the chart shows that as packet rates increase. In my tests it was not hard to saturate a CPU with OVS, reaching a maximum packet rate to the VM of 1.2M pps. The 100% softirq at 1.5M pps and up is saturation of ksoftirqd alone with nothing else running on that CPU. Running another process on CPU 5 immediately affects the throughput rate as the CPU splits time between processing packets and running that process. With openssl, the packet rate to the VM is cut in half with packet drops at the host ingress as it can no longer keep up with the packet rate given the overhead of OVS.

XDP on the other hand could push 2M pps to the VM before the guest could no longer keep up with packet drops at the tap device (ie., no room in the tun ring meaning the guest has not processed the previous packets). As shown above, the host still has plenty of CPU to handle more packets or run workloads (preferred condition for a cloud host).

One thing to notice about the chart above is the apparent flat lining of CPU usage between 50k pps and 500k pps. That is not a typo, and the results are very repeatable. This needs more investigation, but I believe it shows the efficiencies kicking in from a combination of more packets getting handled per napi poll cycle (closer to maximum of the netdev budget) and the kernel side bulking in XDP before a flush is required.

Hosts typically run more than 1 VM, so let's see the effect of adding a second VM to the mix. For this case a second VM is started with the same setup as mentioned earlier, but now the traffic load is split equally between 2 VMs. The key point here is a single CPU processing interleaved network traffic for 2 different destinations.

2-VM softirq

For OVS, CPU saturation with ksoftirqd happens with a maximum packet rate to each VM of 800k pps (compared to 1.2M with only a single VM). The saturation is in the host with packet drops shown at host ingress, and again any competition for the CPU processing packets cuts the rate in half.

Meanwhile, XDP is barely affected by the second VM with a modest increase of 3-4% in softirq at the upper packet rates. In this case, the redirected packets are just hitting separate bulking queues in the kernel. The two packet generators are not able to hit 4+M pps to find the maximum per-VM rate.

Final Thoughts

CPU cycles are only the beginning for comparing network solutions. A full OVS-vs-XDP comparison needs to consider all the resources consumed – e.g., memory as well as CPU. For example, OVS has ovs-vswitchd which consumes a high amount of memory (>750MB RSS on this server with only the 2 VMs) and additional CPU cycles to handle upcalls (flow misses) and revalidate flow entries in the kernel which on an active hypervisor can easily consume 50+% cpu (not counting increased usage from various bugs[7]).

Meanwhile, XDP is still early in its lifecycle. Right now, using XDP for this setup requires VLAN acceleration in the NIC [5] to be disabled meaning the VLAN header has to be removed by the ebpf program before forwarding to the VM. Using the proposed hardware hints solution reduces the softirq time by another 1-2% meaning 1-2% more usable CPU by leveraging hardware acceleration with XDP. This is just an example of how XDP will continue to get faster as it works better with hardware offloads.

Acronyms

LACP Link Aggregation Control Protocol NIC Nework Interface Card NUMA Non-Uniform Memory Access OVS Open VSwitch PPS Packets per Second RPS Receive Packet Steering RSS Receive Side Scaling TOR Top-of-Rack VM Virtual Machine XDP Express Data Path in Linux

References

[1] https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host [2] https://people.kernel.org/dsahern/rss-rps-locking-qdisc [3] https://github.com/dsahern/bpf-progs/blob/master/ksrc/rx_acl.c [4] https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdp_l2fwd.c [5] https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud [6] https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c [7] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html

The Seccomp Notifier - Cranking up the crazy with bpf()

August 7, 2020

from Christian Brauner

In my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))

Last time I mentioned two new features that we had landed:

The 2. feature just landed in the merge window for v5.9. So what better time than now to boot a v5.9 pre-rc1 kernel and play with the new features.

I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are open(), connect(), dup2(), but also bpf(). People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than bpf().

As we know, bpf() and unprivileged containers don't get along too well. But that doesn't need to be the case. For the demo you're about to see I enabled LXD to supervise the bpf() syscalls for tasks running in unprivileged containers. We will intercept the bpf() syscalls for the BPF_PROG_LOAD command for BPF_PROG_TYPE_CGROUP_DEVICE program types and the BPF_PROG_ATTACH, and BPF_PROG_DETACH commands for the BPF_CGROUP_DEVICE attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.

This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up here. Let's see if the rest of the team thinks I'm going crazy. :)

RSS/RPS + locking qdisc

August 6, 2020

from David Ahern

I recently learned this fun fact: With RSS or RPS enabled [1] and a lock-based qdisc on a VM's tap device (e.g., fq_codel) a UDP packet storm targeted at the VM can severely impact the entire server.

The point of RSS/RPS is to distribute the packet processing load across all hardware threads (CPUs) in a server / host. However, when those packets are forwarded to a single device that has a lock-based qdisc (e.g., virtual machines and a tap device or a container and veth based device) that distributed processing causes heavy spinlock contention resulting in ksoftirqd spinning on all CPUs trying to handle the packet load.

As an example, my server has 96 cpus and 1 million udp packets per second targeted at the VM is enough to push all of the ksoftirqd threads to near 100%:

  PID %CPU COMMAND               P
   58 99.9 ksoftirqd/9           9
  128 99.9 ksoftirqd/23         23
  218 99.9 ksoftirqd/41         41
  278 99.9 ksoftirqd/53         53
  318 99.9 ksoftirqd/61         61
  328 99.9 ksoftirqd/63         63
  358 99.9 ksoftirqd/69         69
  388 99.9 ksoftirqd/75         75
  408 99.9 ksoftirqd/79         79
  438 99.9 ksoftirqd/85         85
 7411 99.9 CPU 7/KVM            64
   28 99.9 ksoftirqd/3           3
   38 99.9 ksoftirqd/5           5
   48 99.9 ksoftirqd/7           7
   68 99.9 ksoftirqd/11         11
   78 99.9 ksoftirqd/13         13
   88 99.9 ksoftirqd/15         15
   ...

perf top shows the spinlock contention:

    96.79%  [kernel]          [k] queued_spin_lock_slowpath
     0.40%  [kernel]          [k] _raw_spin_lock
     0.23%  [kernel]          [k] __netif_receive_skb_core
     0.23%  [kernel]          [k] __dev_queue_xmit
     0.20%  [kernel]          [k] __qdisc_run

With the callchain leading to

    94.25%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --94.25%--queued_spin_lock_slowpath
                       |
                        --93.83%--__dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

A little code analysis shows this is the qdisc lock in __dev_xmit_skb.

The overloaded ksoftirqd threads means it takes longer to process packets resulting in budget limits getting hit and packet drops at ingress. The packet drops can cause ssh sessions to stall or drop or cause disruptions in protocols like LACP.

Changing the qdisc on the device to a lockless one (e.g., noqueue) dramatically lowers the ksoftirqd load. perf top still shows the hot spot as a spinlock:

    25.62%  [kernel]          [k] queued_spin_lock_slowpath
     6.87%  [kernel]          [k] tasklet_action_common.isra.21
     3.28%  [kernel]          [k] _raw_spin_lock
     3.15%  [kernel]          [k] tun_net_xmit

but this time it is the lock for the tun ring:

    25.10%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --25.05%--queued_spin_lock_slowpath
                       |
                        --24.93%--tun_net_xmit
                                  dev_hard_start_xmit
                                  __dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

which is a much lighter lock in the sense of the amount of work done with the lock held.

systemd commit e6c253e363dee, released in systemd 217, changed the default qdisc from pfifo_fast (kernel default) to fq_codel (/usr/lib/sysctl.d/50-default.conf for Ubuntu). As of v5.8 kernel fq_codel still has a lock to enqueue packets, so systems using fq_codel with RSS/RPS are hitting this lock contention which affects overall system performance. pfifo_fast is lockless as of v4.16 so for newer kernels the kernel's default is best.

But, it begs the question why have a qdisc for a VM tap device (or a container's veth device) at all? To the VM a host is just part of the network. You would not want a top-of-rack switch to buffer packets for the server, so why have the host buffer packets for a VM? (The “Tx” path for a tap device represents packets going to the VM.)

You can change the default via:

sysctl -w net.core.default_qdisc=noqueue

or add that to a sysctl file (e.g., /etc/sysctl.d/90-local.conf). sysctl changes affect new devices only.

Alternatively, the default can be changed for selected devices via a udev rule:

cat > /etc/udev/rules.d/90-tap.rules <<EOF
ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="tap*", PROGRAM="/sbin/tc qdisc add dev $env{INTERFACE} root handle 1000: noqueue"

Running sudo udevadm trigger should update existing devices. Check using tc qdisc sh dev <NAME>:

$ tc qdisc sh dev tapext4798884
qdisc noqueue 1000: root refcnt 2
qdisc ingress ffff: parent ffff:fff1 ----------------

[1] https://www.kernel.org/doc/Documentation/networking/scaling.txt

Seccomp Notify - New Frontiers in Unprivileged Container Development

July 23, 2020

from Christian Brauner

Introduction

As most people know by know we do a lot of upstream kernel development. This stretches over multiple areas and of course we also do a lot of kernel work around containers. In this article I'd like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users. I've talked about this feature quite a few times at various conferences (just recently again at OSS NA) over the last two years but never actually sat down to write a blogpost about it. This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.

The Limits of Unprivileged Containers

That (Linux) Containers are a userspace fiction is a well-known dictum nowadays. It simply expresses the fact that there is no container kernel object in the Linux kernel. Instead, userspace is relatively free to define what a container is. But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system. This is achieved by combining a multitude of Linux kernel features. One of the better known kernel features that is used to build containers are namespaces. The number of namespaces the kernel supports has grown over time and we are currently at eight. Before you go and look them up on namespaces(7) here they are:

cgroup: cgroup_namespaces(7)
ipc: ipc_namespaces(7)
network: network_namespaces(7)
mount: mount_namespaces(7)
pid: pid_namespaces(7)
time: time_namespaces(7)
user: user_namespaces(7)
uts: uts_namespaces(7)

Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.

Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. CAP_MKNOD is required to create device nodes) and it seems that when the answer is “yes” then the task is actually a privileged task. But as usual things aren't that simple. What the task thinks it's checking for and what the kernel really is checking for are possibly two very different things. A naive task, i.e. a task not aware of user namespaces, might think it's asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.

In most cases the kernel will not check whether the task is privileged with respect to the whole system. Instead, it will almost always call a function called ns_capable() which is the kernel's way of checking whether the calling task has privilege in its current user namespace.

For example, when a new user namespace is created by setting the CLONE_NEWUSER flag in unshare(2) or in clone3(2) the kernel will grant a full set of capabilities to the task that called unshare(2) or the newly created child task via clone3(2) within the new user namespace. When this task now e.g. checks whether it has the CAP_MKNOD capability the kernel will report back that it indeed has that capability. The key point though is that this “yes” is not a global “yes”, i.e. the question “Am I privileged enough to perform this operation?” only applies to the current user namespace (and technically any nested user namespaces) not the host itself.

This distinction is important when trying to understand why a task running as root in a new user namespace with all capabilities raised will still see EPERM when e.g. trying to call mknod("/dev/mem", makedev(1, 1)) even though it seems to have all necessary privileges. The reason for this counterintuitive behavior is that the kernel isn't always checking whether you are privileged against your current user namespace. Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host's user namespace.

Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create /dev/kmem or any other critical device and use the device to take over the host. So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace. This is of course technically inconsistent since capabilities are per user namespace as we observed above.

Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices. So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it. On the other hand, some filesystems like cgroup, cgroup2, tmpfs, proc, sysfs, and fuse can be mounted in user namespace (with some caveats for proc and sys but we're ignoring those details for now) because the kernel can guarantee that this is safe.

But of course these restrictions are annoying. Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely. Quite often a container manager like LXD will know better than the kernel when an operation that a container tries to perform is safe.

A good example are device nodes. Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:

/dev/console
/dev/full
/dev/null
/dev/random
/dev/tty
/dev/urandom
/dev/zero

Allowing a container to create these devices would be safe. Of course, the container will simply bind-mount these devices during container startup into the container so this isn't really a serious problem. But any program running inside the container that wants to create these harmless devices nodes would fail.

The other example that was mentioned earlier is mounting of block-based filesystems. Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe. For example, they could have a dedicated disk for the container or they want to share data with or among containers. But the container could not mount any of those disks.

For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers. For example, here is how you'd hotplug /dev/zero into a running container:

 brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero

brauner@wittgenstein|~
> lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero
Device zero-device added to f5

brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero
crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero

But of course, that doesn't help at all when a random application inside the container calls mknod(2) itself. In these cases LXD has no way of helping the application by hotplugging the device as it's unaware that a mknod syscall has been performed.

So the root of the problem seems to be: – A task inside the container performs a syscall that will fail. – The syscall would not need to fail since the container manager knows that it is safe. – The container manager has no way of knowing when such a syscall is performed. – Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.

So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail. So somehow we need to be able to interact with the syscalls of another task.

Seccomp – The Basics of Syscall Interception

The obvious candidate to look at is seccomp. Short for “secure computing” it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question. But seccomp allows even more advanced configurations through so-called “filters”. Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel. For example, a task could use a seccomp filter to only allow the mount() syscall and only those mount syscalls that create bind mounts. This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs. Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully. Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.

At its core seccomp is nothing but a syscall interception mechanism. One way or another every operating system has something that is at least roughly comparable. The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths. So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.

Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.

But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture's syscall table the kernel will first call into seccomp and run the loaded seccomp filter.

Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task. If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code. This way, it is for example possible to have the kernel pretend like it doesn't know the mount(2) syscall by creating a seccomp filter that reports back ENOSYS whenever the task tries to call mount(2).

But the way seccomp used to work isn't very dynamic. Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter. So there is no way to make a case-by-case decision which might come in handy in some scenarios.

In addition seccomp itself can't make a syscall actually succeed other than in the trivial way of reporting back success to the caller. So seccomp will only allow the kernel to pretend that a syscall succeeded. So while it is possible to instruct the kernel to return 0 for the mount(2) syscall it cannot actually be instructed to make the mount(2) syscall succeed. So just making the seccomp filter return 0 for mounting a dedicated ext4 disk device to /mnt will still not actually mount it at /mnt; it just pretends to the caller that it did. Of course that is in itself already a useful property for a bunch of use-cases but it doesn't really help with the mknod(2) or mount(2) problem outlined above.

Extending Seccomp

So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our mknod(2) and mount(2) problem. Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task. What is missing though is a way to bring another task such as the LXD container manager into the picture. Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.

The answer to these questions is seccomp notify. This is as good a time as any to bring in some historical context. The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized. The first concrete details around seccomp notify were conceived in early 2017 in the LXD team. The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container's Microconference in Los Angeles. The details of this talk are still listed here here and I'm sure Stéphane can still provide the slides we came up with. I didn't find a video recording even though I somehow thought we did have one. If someone is really curious I can try to investigate with the Linux Plumbers committee. After this talk implementation specifics were discussed in a hallway meeting later that day. And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team. The rest is history^wchangelog.

Seccomp Notify – Syscall Interception 2.0

In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter. When a container starts it will usually load a seccomp filter to restrict its attack surface. That is even done for unprivileged containers even though it is not strictly necessary.

With the addition of seccomp notify a container wishing to have a subset of syscalls handled by another process can set the new SECCOMP_RET_USER_NOTIF flag on its seccomp filter. This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter. This file descriptor is a seccomp notify file descriptor.

Of course, the seccomp notify fd is not very useful to the task itself. First, since it doesn't make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls. Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.

But what the task can do with the seccomp notify fd is to hand to another task. Usually the task that it will hand the seccomp notify fd to will be more privileged than itself. For a container the most obvious candidate would be the container manager of course.

Since the seccomp notify fd is pollable it is possible to put it into an event loop such as epoll(7), poll(2), or select(2) and wait for the file descriptor to become readable, i.e. for the kernel to return EPOLLIN to userspace. For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements. This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. mknod(2) or mount(2).

Put another way, this means the container manager can listen for syscall events for tasks running in the container. Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.

After the seccomp notify fd indicates that it is readable the container manager can use the new SECCOMP_IOCTL_NOTIF_RECV ioctl() associated with seccomp notify fds to read a struct seccomp_notif message for the syscall. Currently the data to be read from the seccomp notify fd includes the following pieces. But please be aware that we are in the process of discussing potentially intrusive changes for future versions:

struct seccomp_notif {
	__u64 id;
	__u32 pid;
	__u32 flags;
	struct seccomp_data data;
};

Let's look at this in a little more detail. The pid field is the pid of the task that performed the syscall as seen in the caller's pid namespace. To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called mknod(2) as seen in the pid namespace of the container manager. The id field is a unique identifier for the performed syscall. This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling. The flags argument is currently unused and reserved for future extensions.

The struct seccomp_data argument is probably the most interesting one as it contains the really exciting bits and pieces:

struct seccomp_data {
	int nr;
	__u32 arch;
	__u64 instruction_pointer;
	__u64 args[6];
};

The int field is the syscall number which can only be correctly interpreted relative to the arch field. The arch field is the (audit) architecture for which this syscall was made. This field is very relevant since compatible architectures (For the x86 architectures this encompasses at least x32, i386, and x86_64. The arm, mips, and power architectures also have compatible “sub” architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall foo() might have syscall number 1 on 32 bit and 2 on 64 bit. So the task reading the seccomp data can't simply assume that since it itself is running in a 32 bit environment the syscall number must be 1. Rather, it must check what the audit arch is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit. Otherwise the container manager might end up emulating mount() when it should be emulating mknod().). The instruction_pointer is set to the address of the instruction that performed the syscall. This is of course also architecture specific. And last the args member are the syscall arguments that the task performed the syscall with.

The args need to be interpreted and treated differently depending on the syscall layout and their type. If they are non-pointer arguments (unsigned int etc.) they can be copied into a local variable and interpreted right away. But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall. In the latter case the memory needs to be read and copied before it can be interpreted.

Let's look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments. Say the performed syscall was mount(2). In order to interpret the args field correctly we look at the syscall layout of mount(). (Please note, that I'm stressing that we need to look at the layout of syscall and the only reliable source for this is actually the kernel source code. The Linux manpages often list the wrapper provided by the system's libc and these wrapper do not necessarily line-up with the syscall itself (compare the waitid() wrapper and the waitid() syscall or the various clone() syscall layouts).) From the layout of mount(2) we see that args[0] is a pointer argument identifying the source path, args[1] is another pointer argument identifying the target path, args[2] is a pointer argument identifying the filesystem type, args[3] is a non-pointer argument identifying the options, and args[4] is another pointer argument identifying additional mount options.

So if we were to be interested in the source path of this mount(2) syscall we would need to open the /proc/<pid>/mem file of the task that performed this syscall and e.g. use the pread(2) function with args[0] as the offset into the task's virtual memory and read it into a buffer at least the length of a standard path. Alternatively, we can use a single syscall like process_vm_readv(2) to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.

A friendly advice: in general it is a good idea for the container manager to read all syscall arguments once into a local buffer and base its decisions on how to proceed on the data in this local buffer. Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it's also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to /proc/<pid>/mem and change the contents of e.g. args[0] or any other syscall argument. Also note, that the container manager should ensure that /proc/<pid> still refers to the same task after opening it by checking the validity of the syscall request via the id field and the associated SECCOMP_IOCTL_NOTIF_ID_VALID ioctl() to exclude the possibility of the task having exited, been reaped and its pid having been recycled.

But let's assume we have done all that. Now that the container manager has the task's syscall arguments available in a local buffer it can interpret the syscall arguments. While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed. After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the SECCOMP_IOCTL_NOTIF_SEND ioctl() on the seccomp notify fd to tell the kernel what it should do with the blocked task's syscall. The response is given in the form struct seccomp_notif_resp:

struct seccomp_notif_resp {
	__u64 id;
	__s64 val;
	__s32 error;
	__u32 flags;
};

Let's look at this struct in a little more detail too. The id field is set to the id of the syscall request to respond to and should correspond to the received id in the struct seccomp_notif that the container manager read via the SECCOMP_IOCTL_NOTIF_RECV ioctl() when the seccomp notify fd became readable. The val field is the return value of the syscall and is only set if the error field is set to 0. The error field is the error to return from the syscall and should be set to a negative errno(3) code if the syscall is supposed to fail (For example, to trick the caller into thinking that mount(2) is not supported on this kernel set error to -ENOSYS.). The flags value can be used to tell the kernel to continue the syscall by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag which I added to be able to intercept mount(2) and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments. More on that in a little bit.

With this machinery in place we are for now ;) done with the kernel bits.

Emulating Syscalls In Userspace

So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue. Probably emulate it. Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a very bad idea.).

Emulating syscalls in userspace is not a very new thing to do. It has been done for a long time. For example, libc's can choose to emulate the execveat(2) syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path. On a kernel that doesn't support the execveat(2) syscall the libc can emulate it by calling exec(3) with the path set to /proc/self/fd/<nr>. The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (fexecve(3) for our example). Any task using syscall(__NR_execveat, [...]) to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn't know that the task wants to perform the execveat(2) syscall and will not be able to emulate it in case the kernel doesn't support it.

Seccomp notify doesn't suffer from this problem since its syscall interception abilities aren't located in userspace at the library level but directly in the syscall path as we have seen. This greatly expands the abilities to emulate syscalls.

So now we have all the kernel pieces in place to solve our mknod(2) and mount(2) problem in unprivileged containers. Instead of simply letting the container fail on such harmless requests as creating the /dev/zero device node we can use seccomp notify to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it. Similarly, we can intercept mount(2) requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container. We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem. We actually support this feature in LXD. Since fuse is a safe way for unprivileged users to mount filesystems rewriting mount(2) requests is a great way to expose filesystems to containers.

In general, the possibilities of seccomp notify can't be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC. As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with seccomp notify seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to seccomp notify.

Seccomp Notify in action – LXD

After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses. Instead of simply exposing the raw seccomp notify fd for the container's seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients. Clients here are new containers who the administrator has signed up for syscall supervisions through LXD. Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.

When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd. The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials. LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments. If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall. For mknod(2) it will create the device node for the container and for mount(2) it will mount the filesystem for the container. Either by directly mounting it or by using a specified fuse binary for additional security.

If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container. The container then parses the message, verifying the credentials and will use the SECCOMP_IOCTL_NOTIF_SEND ioctl() sending a struct seccomp_notif_resp causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded. Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.

Show Me!

Ok, enough talk. Let's intercept some syscalls. The following demo shows how LXD uses the seccomp notify fd to emulate the mknod(2) and mount(2) syscalls for an unprivileged container:

Current Work and Future Directions

`SECCOMP_USER_NOTIF_FLAG_CONTINUE`

After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly. We realized we couldn't intercept the mount syscall. Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept. This is caused by seccomp not being able to handle pointer arguments. They are opaque for seccomp. So while it is possible to tell seccomp to only intercept mount(2) requests for real filesystems by only intercepting mount(2) syscalls where the MS_BIND flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about mount(2) syscalls for the ext4 or btrfs filesystem because the filesystem argument is a pointer.

But this means we will inadvertently intercept syscalls that we didn't intend to intercept. That is a generic problem but for some syscalls it's not really a big deal. For example, we know that mknod(2) fails for all character and block devices in unprivileged containers. So as long was we write a seccomp filter that intercepts only character and block device mknod(2) syscalls but no socket or fifo mknod() syscalls we don't have a problem. For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report EPERM and since the syscalls would fail anyway there's no problem.

But any system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace. But this would of course include all mount(2) syscalls for filesystems that can be mounted in unprivileged containers. I've listed a subset of them above. It includes at least tmpfs, proc, sysfs, devpts, cgroup, cgroup2 and probably a few others I'm forgetting. That's not ideal. We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.

The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall. To instruct the kernel to continue the syscall the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag can be set in struct seccomp_notif_resp's flag argument when instructing the kernel to unblock the task.

This is of course a very exciting feature and has a few readers probably thinking “Hm, I could implement a dynamic userspace seccomp policy.” to which I want to very loudly respond “No, you can't!”. In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace. I'm now going to mostly quote verbatim from my comment for the extension: The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with extreme caution! If set by the task supervising the syscalls of another task the syscall will continue. This is problematic is inherent because of TOCTOU (Time of Check-Time of Use). An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall. It should be absolutely clear that this means that seccomp notify cannot be used to implement a security policy on syscalls that read from dereferenced pointers in user space! It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe. In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.

Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the same syscall, the most recently added filter takes precedence. This means that the new SECCOMP_RET_USER_NOTIF filter can override any SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.

Retrieving file descriptors `pidfd_getfd()`

Another extension that was added by Sargun Dhillon recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task. This works even without seccomp notify since it is a new syscall but is of course especially useful in conjunction with it.

Often we would like to intercept syscalls such as connect(2). For example, the container manager might want to rewrite the connect(2) request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint. In these cases pidfd_getfd(2) can be used to retrieve a copy of the file descriptor of the task and perform the connect(2) for it. This unblocks another wide range of use-cases.

For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run getsockopt(2) on the file descriptor, and you can use options like TCP_INFO to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like TCP_NODELAY, to tune the socket without requiring the user's intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where connect(2) calls are intercepted, and the destination is changed to a real server instead.

Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.

Injecting file descriptors `SECCOMP_NOTIFY_IOCTL_ADDFD`

Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task. As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as open(2) that cause new file descriptors to be installed in the task performing the syscall.

The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on. This way it is possible to intercept syscalls such as open(2) or accept(2), and install (or replace, like dup2(2)) the container manager's resulting fd in the target task.

This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to perf_event_open(2), and bpf(2) for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.

Last, I want to note that Sargun Dhillon was kind enough to contribute paragraphs to the pidfd_getfd(2) and SECCOMP_NOTIFY_IOCTL_ADDFD sections. He also provided the graphic in the pidfd_getfd(2) sections to illustrate the performance benefits of this solution.

Christian

The CPU Cost of Networking on a Host

May 15, 2020

from David Ahern

When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.

This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.

Packet Processing in Linux

Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, 'top' can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload.

net_rx_action, the handler for network Rx traffic, usually runs really fast – like under 25 usecs[1] – dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see __do_softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).

One way to see the above description is using perf:

sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit \
        -e irq:softirq_entry --filter="vec == 3" \
        -e irq:softirq_exit --filter="vec == 3"  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script

The output is something like:

swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0

In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.

The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?

With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000's to 10,000's of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.

Let's take a look at one way to see this impact.

Undo the Distributed Processing

First, let's undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.

RPS is disabled on the 2 nics using

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

Next, add flow rules to push packets for the VM under test to a single CPU

DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2

Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.

openssl speed

I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as 'openssl speed' and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting).

'openssl speed' is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344

so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tms_utime will match the test time (10 seconds in this case).

Since it is is a pure userspace benchmark ANY system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896

shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead – time stolen from this process for “system needs.”

Using openssl to Infer Networking Overhead

With an understanding of how 'openssl speed' works, let's look at a near idle server:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc's in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc's in 10.00s

so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let's add network load, netperf TCP_STREAM from 2 sources, and re-do the test:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc's in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc's in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc's in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc's in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc's in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc's in 1.89s

Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets – either in the context of openssl or via ksoftirqd.

Looking at top for the previous openssl run:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5

one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.

If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5

but openssl reports ~4 seconds of userspace time:

Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc's in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc's in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc's in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc's in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc's in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc's in 4.16s

Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.

In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything – virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.

Up Next

This post is the basis for a follow up post where I will discuss a comparison of the overhead of “full stack” on a VM host to “XDP”.

[1] Measured using ebpf program on entry and exit. See net_rx_action in https://github.com/dsahern/bpf-progs

[2] Assuming no bugs in the networking stack and driver. I have examined systems where net_rx_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).

[3] https://github.com/isc-projects/ethq

GUS (Global Unbounded Sequences)

April 24, 2020

from joelfernandes

GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD's slab implementation) is here: https://www.youtube.com/watch?v=ZXUIFj4nRjk

The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.

Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can't see the photon yet, we can only see what we've been shown so far, and in a way, if we've not seen something because enough “time” has not passed, then we may not reclaim it yet. If we've not seen something, we will see it at some point in the future. Till then we need to sit tight.

Roughly, an implementation has 2+N counters (with N CPUs): 1. Global write sequence. 2. Global read sequence. 3. Per-cpu read sequence (read from #1 when a reader starts)

On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free'ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs. Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.

Note the scalability drawbacks of this reclaim scheme:

Expensive poll operation if you have 1000s of CPUs. (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)
Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.
There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.

One advantage of the scheme could be implementation simplicity.

RCU (not SRCU or Userspace RCU) doesn't suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.

Docker and Management VRF

March 13, 2020

from David Ahern

Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.

This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:

Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS
Install docker-ce

Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:

ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
--containerd=/run/containerd/containerd.sock

Tell systemd about the change and restart docker

systemctl daemon-reload
systemctl restart docker

With that, docker pull should work fine – in mgmt vrf or default vrf.

Management VRF and DNS

March 13, 2020

from David Ahern

Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).

The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:

$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic

despite VRF being enabled and built:

$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m

which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: $ sudo apt-get install --install-recommends linux-generic-hwe-18.04

although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.

First, let's enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:

    echo "net.ipv4.tcp_l3mdev_accept=1" >> /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf

Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]

Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018: https://bugs.launchpad.net/netplan/+bug/1773522

One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:

#!/bin/bash

ip link show dev mgmt 2>/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default

The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]

Reboot node to make the changes take effect.

WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.

After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).

Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use 'ip vrf exec' as a helper to do the VRF binding. 'ip vrf exec' uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device ('mgmt' in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).

Let's see what happens:

$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'

Theoretically, this should Just Work, but it clearly does not. Why?

Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0

So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.

Where does the theory breakdown? In 3 places.

First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:

$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache

That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let's add the loopback addresses to it:

    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache

The second problem is systemd-resolved binds its socket to the loopback device:

    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=13))

The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.

The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.

Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.

$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf > /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]

When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding: $ ip vrf exec mgmt su – dsahern

Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].

Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:

NS=$(ip netns identify)
[ -n "$NS" ] && NS=":${NS}"

VRF=$(ip vrf identify)
[ -n "$VRF" ] && VRF=":${VRF}"

And then adding '${NS}${VRF}' after the host in PS1:

PS1='${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ '

For example, now the prompt becomes: dsahern@myhost:mgmt:~$

References:

[1] VRF tutorial, Open Source Summit, North America, Sept 2017 http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf

[2] VRF helpers, e.g., systemd instances for VRF https://github.com/CumulusNetworks/vrf

[3] Example using VRF in dhclient-exit-hook https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf

[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.

SRCU state double scan

March 6, 2020

from joelfernandes

The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

There's are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part's lock counter during locking and likewise for unlock.

During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with 'A MUST' in the below pseudo-code:

        reader 1        writer                        reader 2
        -------------------------------------------------------
        // read_lock
        // enter
        Read: idx = 0;
        <long delay>    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                        done.
                                                      Read: idx = 1;
        lock[0]++;
                                                      lock[1]++;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] <- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] <- A MUST!

NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its 'srcu_idx' equivalent.

Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic? Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the 'flip' logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a 'flipped' part would only monotonically decrease since new readers go to its counterpart.

Modeling (lack of) store ordering using PlusCal - and a wishlist

October 18, 2019

from joelfernandes

The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available — by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0's stores and P1's loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (“message”).

        int buf = 0, flag = 0;

        P0()
        {
                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);
        }

        P1()
        {
                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);
        }

Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney's perfbook for details on that).

EXTENDS Integers, TLC
(*--algorithm mp_pattern
variables
    buf = 0,
    flag = 0;

process Writer = 1
variables
    begin
e0:
       either
e1:        buf := 1;
e2:        flag := 1;
        or
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
variables
    r1 = 0,
    r2 = 0;  
    begin
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)

Sure enough, the assert r1 = 0 \/ r2 = 1; fires when the PlusCal program is run through the TLC model checker.

I do find the either or block clunky, and wish I could just do something like:

non_deterministic {
        buf := 1;
        flag := 1;
}

And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.

Other notes:

PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal's syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don't want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don't specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.
I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.

Now how many USB-C™ to USB-C™ cables are there? (USB4™ Update, September 12, 2019)

September 6, 2019

from Benson Leung

tl;dr: There are now 8. Thunderbolt 3 cables officially count too. It's getting hard to manage, but help is on the way.

Edited lightly 09-16-2019: Tables 3-1 and 5-1 from USB Type-C Spec reproduced as tables instead of images. Made an edit to clarify that Thunderbolt 3 passive cables have always been compliant USB-C cables.

If you recall my first cable post, there were 6 kinds of cables with USB-C plugs on both ends. I was also careful to preface that it was true as of USB Type-C™ Specification 1.4 on June 2019.

Last week, the USB-IF officially published the USB Type-C™ Specification Version Revision 2.0, August 29, 2019.

This is a major update to USB-C and contains required amendments to support the new USB4™ Spec.

One of those amendments? Introducing a new data rate, 20Gbps per lane, or 40Gbps total. This is called “USB4 Gen 3” in the new spec. One more data rate means the matrix of cables increases by a row, so we now have 8 C-to-C cable kinds, see Table 3-1:

Table 3-1 USB Type-C Standard Cable Assemblies

Cable Ref	Plug 1	Plug 2	USB Version	Cable Length	Current Rating	USB Power Delivery	USB Type-C Electronically Marked
CC2-3	C	C	USB 2.0	≤ 4 m	3 A	Supported	Optional
CC2-5	C	C	USB 2.0	≤ 4 m	5 A	Supported	Required
CC3G1-3	C	C	USB 3.2 Gen1 and USB4 Gen2	≤ 2 m	3 A	Supported	Required
CC3G1-5	C	C	USB 3.2 Gen1 and USB4 Gen2	≤ 2 m	5 A	Supported	Required
CC3G2-3	C	C	USB 3.2 Gen2 and USB4 Gen2	≤ 1 m	3 A	Supported	Required
CC3G2-5	C	C	USB 3.2 Gen2 and USB4 Gen2	≤ 1 m	5 A	Supported	Required
CC3G3-3	C	C	USB4 Gen3	≤ 0.8 m	3 A	Supported	Required
CC3G3-5	C	C	USB4 Gen3	≤ 0.8 m	5 A	Supported	Required

Listed, with new cables in bold: 1. USB 2.0 rated at 3A 2. USB 2.0 rated at 5A 3. USB 3.2 Gen 1 rated at 3A 4. USB 3.2 Gen 1 rated at 5A 5. USB 3.2 Gen 2 rated at 3A 6. USB 3.2 Gen 2 rated at 5A 7. USB4 Gen 3 rated at 3A 8. USB4 Gen 3 rated at 5A

New cables 7 and 8 have the same number of wires as cables 3 through 6, but are built to tolerances such that they can sustain 20Gbps per set of differential pairs, or 40Gbps for the whole cable. This is the maximum data rate in the USB4 Spec.

Also, please notice in the table above that (informative) maximum cable length shrinks as speed increases. Gen 1 cables can be 2M long, while Gen 3 cables can be 0.8m. This is just a practical consequence of physics and signal integrity when it comes to passive cables.

Data Rates

Data rates require some explanation too, as advancements since USB 3.1 means that the same physical cable is capable of way more when used in a USB4 system.

A USB 3.1 Gen 1 cable built and sold in 2015 would have been advertised to support 5Gbps operation in 2015. Fast forward to 2019 or 2020, that exact same physical cable (Gen 1), will actually allow you to hit 20gbps using USB4. This is due to advancements in the underlying phy on the host and client-side, but also because USB4 uses all 8 SuperSpeed wires simultaneously, while USB 3.1 only used 4 (single lane operation versus dual-lane operation).

The same goes for USB 3.1 Gen 2 cables, which would have been sold as 10gbps cables. They are able to support 20gbps operation in USB4, again, because of dual-lane.

Table 5-1 Certified Cables Where USB4-compatible Operation is Expected

	Cable Signaling	USB4 Operation	Notes
USB Type-C Full-Featured Cables (Passive)	USB 3.2 Gen1	20 Gbps	This cable will indicate support for USB 3.2 Gen1 (001b) in the USB Signaling field of its Passive Cable VDO response. Note: even though this cable isn’t explicitly tested, certified or logo’ed for USB 3.2 Gen2 operation, USB4 Gen2 operation will generally work.
	USB 3.2 Gen2 (USB4 Gen2)	20 Gbps	This cable will indicate support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
	USB4 Gen3	40 Gbps	This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Passive Cable VDO response.
Thunderbolt™ 3 Cables (Passive)	TBT3 Gen2	20 Gbps	This cable will indicate support for USB 3.2 Gen1 (001b) or USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
Thunderbolt™ 3 Cables (Passive)	TBT3 Gen3	40 Gbps	In addition to indicating support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response, this cable will indicate that it supports TBT3 Gen3 in the Discover Mode VDO response.
USB Type-C Full-Featured Cables (Active)	USB4 Gen2	20 Gbps	This cable will indicate support for USB4 Gen2 (010b) in the USB Signaling field of its Active Cable VDO response.
USB Type-C Full-Featured Cables (Active)	USB4 Gen3	40 Gbps	This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Active Cable VDO response.

What about Thunderbolt 3 cables? Thunderbolt 3 cables physically look the same as a USB-C to USB-C cable and the passive variants of the cables comply with the existing USB-C spec and are to be regarded as USB-C cables of kinds 3 through 6. In addition to being compliant USB-C cables, Intel needed a way to mark some of their cables as 40Gbps capable, years before USB-IF defined the Gen 3 40gbps data rate level. They did so using extra alternate mode data objects in the Thunderbolt 3 cables' electronic marker, amounting to extra registers that mark the cable as high speed capable.

The good news is that since Intel decided to open up the Thunderbolt 3 spec, the USB-IF was able to completely take in and make Passive 20Gbps and 40Gbps Thunderbolt 3 cables supported by USB4 devices. A passive 40Gbps TBT3 cable you bought in 2016 or 2017 will just work at 40Gbps on a USB4 device in 2020.

How Linux USB PD and USB4 systems can help identify cables for users

By now, you are likely ever so confused by this mess of cable and data rate possibilities. The fact that I need a matrix and a decoder ring to explain the landscape of USB-C cables is a bad sign.

In the real world, your average user will pick a cable and will simply not be able to determine the capabilities of the cable by looking at it. Even if the cable has the appropriate logo to distinguish them, not every user will understand what the hieroglyphs mean.

Software, however, and Power Delivery may very well help with this. I've been looking very closely at the kernel's USB Type-C Connector Class.

The connector class creates the following structure in sysfs, populating these nodes with important properties queried from the cable, the USB-C port, and the port's partner:

/sys/class/typec/
/sys/class/typec/port0 <---------------------------Me
/sys/class/typec/port0/port0-partner/ <------------My Partner
/sys/class/typec/port0/port0-cable/ <--------------Our Cable
/sys/class/typec/port0/port0-cable/port0-plug0 <---Cable SOP'
/sys/class/typec/port0/port0-cable/port0-plug1 <---Cable SOP"

You may see where I'm going from here. Once user space is able to see what the cable and its e-marker chip has advertised, an App or Settings panel in the OS could tell the user what the cable is, and hopefully in clear language what the cable can do, even if the cable is unlabeled, or the user doesn't understand the obscure logos.

Lots of work remains here. The present Type-C Connector class needs to be synced with the latest version of the USB-C and PD spec, but this gives me hope that users will have a tool (any USB-C phone with PD) in their pocket to quickly identify cables.

Notes about Netiquette

September 3, 2019

from tglx

E-Mail interaction with the community

You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.

Private mail

Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:

Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.
Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.
Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.
Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.

Reply to all

Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.

HTML e-mail

HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.

Multipart e-mail

Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

Trim replies

If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It's an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.

Quoting code

If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.

If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.

Notes about Netiquette

September 3, 2019

from tglx

E-Mail interaction with the community

Private mail

Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.
Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.
Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.
Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

Reply to all

HTML e-mail

Multipart e-mail

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

Trim replies

Quoting code

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Reader

Where It All Began

KASan Arrives

KASan on ARM32

Sleeping Beauty

Finally Fixing the Bugs

Retrospect

When is an rvalue reference created?

1. Using std::move

2. When returning something from a function

Note on move asssignment

Quiz

Question:

Answer

Conclusion

Setup

Host

Virtual Machine

Forwarding with XDP

Packet generator

CPU Measurement

CPU Comparison

Final Thoughts

Acronyms

References

Introduction

The Limits of Unprivileged Containers

Seccomp – The Basics of Syscall Interception

Extending Seccomp

Seccomp Notify – Syscall Interception 2.0

Emulating Syscalls In Userspace

Seccomp Notify in action – LXD

Show Me!

Current Work and Future Directions

SECCOMP_USER_NOTIF_FLAG_CONTINUE

Retrieving file descriptors pidfd_getfd()

Injecting file descriptors SECCOMP_NOTIFY_IOCTL_ADDFD

Packet Processing in Linux

Undo the Distributed Processing

openssl speed

Using openssl to Infer Networking Overhead

Up Next

Table 3-1 USB Type-C Standard Cable Assemblies

Data Rates

Table 5-1 Certified Cables Where USB4-compatible Operation is Expected

How Linux USB PD and USB4 systems can help identify cables for users

E-Mail interaction with the community

Private mail

Confidentiality disclaimers

Reply to all

HTML e-mail

Multipart e-mail

Text mail formatting

Top-posting

Trim replies

Quoting code

Does not work for you?

E-Mail interaction with the community

Private mail

Confidentiality disclaimers

Reply to all

HTML e-mail

Multipart e-mail

Text mail formatting

Top-posting

Trim replies

Quoting code

Does not work for you?

`SECCOMP_USER_NOTIF_FLAG_CONTINUE`

Retrieving file descriptors `pidfd_getfd()`

Injecting file descriptors `SECCOMP_NOTIFY_IOCTL_ADDFD`