people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from linusw

My previous article on how the kernel decompresses generated a lot of traffic and commentary, much to my surprise. I suppose that this may be because musings of this kind fill the same niche as the original Lions' Commentary on UNIX 6th Edition, with Source Code which was a major hit in the late 1970s. Operating system developers simply like to read expanded code comments, which is what this is.

When I’m talking about “ARM32” the proper ARM name for this is Aarch32 and what is implemented physically in the ARMv4 thru ARMv7 ARM architectures.

In this post I will discuss how the kernel bootstraps itself from executing in physical memory after decompression/boot loader and all the way to executing generic kernel code written in C from virtual memory.

This Is Where It All Begins

After decompression and after augmenting and passing the device tree blob (DTB) the ARM32 kernel is invoked by putting the program counter pc at the physical address of the symbol stext(), start of text segment. This code can be found in arch/arm/kernel/head.S.

A macro __HEAD places the code here in a linker section called .head.text and if you inspect the linker file for the ARM architecture at arch/arm/kernel/vmlinux.lds.S you will see that this means that the object code in this section will be, well, first.

This will be at a physical address which is evenly divisible by 16MB and an additional 32KB TEXT_OFFSET (which will be explained more later) so you will find stext() at an address such as 0x10008000, which is the address we will use in our example.

head.S contains a small forest of exceptions for different old ARM platforms making it hard to tell the trees from the forest. The standards for ATAGs and device tree boots came about since it was written so this special code has become increasingly complex over the years.

To understand the following you need a basic understanding of paged virtual memory. If Wikipedia is too terse, refer to Hennesy & Patterson’s book “Computer Architecture: A Quantitative Approach”. Knowing a bit of ARM assembly language and Linux kernel basics is implied.

ARM’s Virtual Memory Split

Let’s first figure out where in virtual memory the kernel will actually execute. The kernel RAM base is defined in the PAGE_OFFSET symbol, which is something you can configure. The name PAGE_OFFSET shall be understood as the virtual memory offset for the first page of kernel RAM.

You choose 1 out of 4 memory split alternatives, which reminds me about fast food restaurants. It is currently defined like this in arch/arm/Kconfig:

config PAGE_OFFSET
        hex
        default PHYS_OFFSET if !MMU
        default 0x40000000 if VMSPLIT_1G
        default 0x80000000 if VMSPLIT_2G
        default 0xB0000000 if VMSPLIT_3G_OPT
        default 0xC0000000

First notice that if we don’t have an MMU (such as when running on ARM Cortex-R class devices or old ARM7 silicon) we will create a 1:1 mapping between physical and virtual memory. The page table will then only be used to populate the cache and addresses are not rewritten. PAGE_OFFSET could very typically be at address 0x00000000 for such set-ups. Using the Linux kernel without virtual memory is referred to as “uClinux” and was for some years a fork of the Linux kernel before being brought in as part of the mainline kernel.

Not using virtual memory is considered an oddity when using Linux or indeed any POSIX-type system. Let us from here on assume a boot using virtual memory.

The PAGE_OFFSET virtual memory split symbol creates a virtual memory space at the addresses above it, for the kernel to live in. So the kernel keep all its code, state and data structures (including the virtual-to-physical memory translation table) at one of these locations in virtual memory:

  • 0x40000000-0xFFFFFFFF
  • 0x80000000-0xFFFFFFFF
  • 0xB0000000-0xFFFFFFFF
  • 0xC0000000-0xFFFFFFFF

Of these four, the last location at 0xC0000000-0xFFFFFFFF is by far the most common. So the kernel has 1GB of address space to live in.

The memory below the kernel, from 0x00000000-PAGE_OFFSET-1, i.e. typically at address 0x00000000-0xBFFFFFFF (3 GB) is used for userspace code. This is combined with the Unix habit to overcommit, which means that you optimistically offer programs more virtual memory space than what you have physical memory available for. Every time a new userspace process starts up it thinks that it has 3 gigabytes of memory to use! This type of overcommit has been a characteristic of Unix systems since their inception in the 1970s.

Why are there four different splits?

This is pretty straightforward to answer: ARM is heavily used for embedded systems and these can be either userspace-heavy (such as an ordinary tablet or phone, or even a desktop computer) or kernel-heavy (such as a router). Most systems are userspace-heavy or have so little memory that the split doesn’t really matter (neither will be very occupied), so the most common split is to use PAGE_OFFSET 0xC0000000.

Kernelspace userspace split The most common virtual memory split between kernelspace and userspace memory is at 0xC0000000. A notice on these illustrations: when I say memory is “above” something I mean lower in the picture, along the arrow, toward higher addresses. I know that some people perceive this as illogical and turn the figure upside-down with 0xFFFFFFFF on the top, but it is my personal preference and also the convention used in most hardware manuals.

It could happen that you have a lot of memory and a kernel-heavy use case, such as a router or NAS with lots of memory, say a full 4GB of RAM. You would then like the kernel to be able to use some of that memory for page cache and network cache to speed up your most typical use case, so you will select a split that gives you more kernel memory, such as in the extreme case PAGE_OFFSET 0x40000000.

This virtual memory map is always there, also when the kernel is executing userspace code. The idea is that by keeping the kernel constantly mapped in, context switches from userspace to kernelspace become really quick: when a userspace process want to ask the kernel for something, you do not need to replace any page tables. You just issue a software trap to switch to supervisor mode and execute kernel code, and the virtual memory set-up stays the same.

Context switches between different userspace processes also becomes quicker: you will only need to replace the lower part of the page table, and typically the kernel mapping as it is simple, uses a predetermined chunk of physical RAM (the physical RAM where it was loaded to begin with) and is linearly mapped, is even stored in a special place, the translation lookaside buffer. The translation lookaside buffer, “special fast translation tables in silicon”, makes it even faster to get into kernelspace. These addresses are always present, always linearly mapped, and will never generate a page fault.

Where Are We Executing?

We continue looking at arch/arm/kernel/head.S, symbol stext().

The next step is to deal with the fact that we are running in some unknown memory location. The kernel can be loaded anywhere (provided it is a reasonably even address) and just executed, so now we need to deal with that. Since the kernel code is not position-independent, it has been linked to execute at a certain address by the linker right after compilation. We don’t know that address yet.

The kernel first checks for some special features like virtualization extensions and LPAE (large physical address extension) then does this little dance:

    adr        r3, 2f
    ldmia      r3, {r4, r8}
    sub        r4, r3, r4            @ (PHYS_OFFSET - PAGE_OFFSET)
    add        r8, r8, r4            @ PHYS_OFFSET
    (...)
    2:         .long        .
               .long      PAGE_OFFSET

.long . is link-time assigned to the address of the label 2: itself, so . is resolved to the address that the label 2: is actually linked to, where the linker imagines that it will be located in memory. This location will be in the virtual memory assigned to the kernel, i.e. typically somewhere above 0xC0000000.

After that is the compiled-in constant PAGE_OFFSET, and we already know that this will be something like 0xC0000000.

We load in the compile-time address if 2: into r4 and the constant PAGE_OFFSET into r8. Then we subtract that from the actual address of 2:, which we have obtained in r3 using relative instructions, from r4. Remember that ARM assembly argument order is like a pocket calculator: sub ra, rb, rc => ra = rb - rc.

The result is that we have obtained in r4 the offset between the address the kernel was compiled to run at, and the address it is actually running at. So the comment @ (PHYS_OFFSET - PAGE_OFFSET) means we have this offset. If the kernel symbol 2: was compiled to execute at 0xC0001234 in virtual memory but we are now executing at 0x10001234 r4 will contain 0x10001234 - 0xC0001234 = 0x50000000. This can be understood as “-0xB0000000” because the arithmetic is commutative: 0xC0001234 + 0x50000000 = 0x10001234, QED.

Next we add this offset to the compile-time assigned PAGE_OFFSET that we know will be something like 0xC0000000. With wrap-around arithmetic, if we are again executing at 0x10000000 we get 0xC0000000 + 0x50000000 = 0x10000000 and we obtained the base physical address where the kernel is currently actually executing in r8 – hence the comment @PHYS_OFFSET. This value in r8 is what we are actually going to use.

The elder ARM kernel had a symbol called PLAT_PHYS_OFFSET that contained exactly this offset (such as 0x10000000), albeit compile-time assigned. We don’t do that anymore: we assign this dynamically as we shall see. If you are working with operating systems less sophisticated than Linux you will find that the developers usually just assume something like this in order to make things simple: the physical offset is a constant. Linux has evolved to this because we need to handle booting of one single kernel image on all kinds of memory layouts.

Physical virtual split 1 I do not know if this picture makes things easier or harder to understand. It illustrates physical to virtual memory mapping in our example.

There are some rules around PHYS_OFFSET: it needs to adhere to some basic alignment requirements. When we determine the location of the first block of physical memory in the uncompress code we do so by performing PHYS = pc & 0xF8000000, which means that the physical RAM must start at an even 128 MB boundary. If it starts at 0x00000000 that’s great for example.

There are some special considerations around this code for the XIP “execute in place” case when the kernel is executing from ROM (read only memory) but we leave that aside, it is another oddity, even less common than not using virtual memory.

Notice one more thing: you might have attempted to load an uncompressed kernel and boot it and noticed that the kernel is especially picky about where you place it: you better load it to a physical address like 0x00008000 or 0x10008000 (assuming that your TEXT_OFFSET is 0x8000). If you use a compressed kernel you get away from this problem because the decompressor will decompress the kernel to a suitable location (very often 0x00008000) and solve this problem for you. This is another reason why people might feel that compressed kernels “just work” and is kind of the norm.

Patching Physical to Virtual (P2V)

Now that we have the offset between virtual memory where we will be executing and the physical memory where we are actually executing, we reach the first sign of the Kconfig symbol CONFIG_ARM_PATCH_PHYS_VIRT.

This symbol was created because developers were creating kernels that needed to boot on systems with different memory configurations without recompiling the kernel. The kernel would be compiled to execute at a certain virtual address, like 0xC0000000, but can be loaded into memory at 0x10000000 like in our case, or maybe at 0x40000000, or some other address.

Most of the symbols in the kernel will not need to care of course: they are executed in virtual memory at an adress to which they were linked and to them we are always running at 0xC0000000. But now we are not writing some userspace program: things are not easy. We have to know about the physical memory where we are executing, because we are the kernel, which means that among other things we need to set up the physical-to-virtual mapping in the page tables, and regularly update these page tables.

Also, since we don’t know where in the physical memory we will be running, we cannot rely on any cheap tricks like compile-time constants, that is cheating and will create hard to maintain code full of magic numbers.

For converting between physical and virtual addresses the kernel has two functions: __virt_to_phys() and __phys_to_virt() converting kernel addresses in each direction (not any other addresses than those used by the kernel memory). This conversion is linear in the memory space (uses an offset in each direction) so it should be possible to achieve this with a simple addition or subtraction. And this is what we set out to do, and we give it the name “P2V runtime patching” . This scheme was invented by Nicolas Pitre, Eric Miao and Russell King in 2011 and in 2013 Santosh Shilimkar extended the scheme to also apply to LPAE systems, specifically the TI Keystone SoC.

The main observation is that if it holds that, for a kernel physical address PHY and a kernel virtual address VIRT (you can convince yourself of these two by looking at the last illustration):

PHY = VIRT – PAGE_OFFSET + PHYS_OFFSET VIRT = PHY – PHYS_OFFSET + PAGE_OFFSET

Then by pure laws of arithmetics it also holds that:

PHY = VIRT + (PHYS_OFFSET – PAGE_OFFSET) VIRT = PHY – (PHYS_OFFSET – PAGE_OFFSET)

So it is possible to always calculate the physical address from a virtual address by an addition of a constant and to calculate the virtual address from a physical address by a subtraction, QED. That is why the initial stubs looked like this:

static inline unsigned long __virt_to_phys(unsigned long x)
{
    unsigned long t;
    __pv_stub(x, t, "add");
    return t;
}

static inline unsigned long __phys_to_virt(unsigned long x)
{
    unsigned long t;
    __pv_stub(x, t, "sub");
    return t;
}

The __pv_stub() would contain an assembly macro to do add or sub. Since then the LPAE support for more than 32-bit addresses made this code a great deal more complex, but the general idea is the same.

Whenever __virt_to_phys() or __phys_to_virt() is called in the kernel, it is replaced with a piece of inline assembly code from arch/arm/include/asm/memory.h, then the linker switches section to a section called .pv_table, and then adds a entry to that section with a pointer back to the assembly instruction it just added. This means that the section .pv_table will expand into a table of pointers to any instance of these assembly inlines.

During boot, we will go through this table, take each pointer, inspect each instruction it points to, patch it using the offset to the physical and virtual memory where we were actually loaded.

Patching phys to virt Each and every location with an assembly macro doing a translation from physical to virtual memory is patched in the early boot process.

Why are we doing this complex manouver instead of just storing the offset in a variable? This is for efficiency reasons: it is on the hot data path of the kernel. Calls updating page tables and cross-referencing physical to virtual kernel memory are extremely performance-critical, all use cases exercising the kernel virtual memory, whether block layer or network layer operations, or user-to-kernelspace translations, in principle any data passing through the kernel, will at some point call these functions. They must be fast.

It would be wrong to call this a simple solution to the problem. Alas it is a really complicated solution to the problem. But it works and it is very efficient!

Looping through the patch table

We perform the actual patching by figuring out the offset like illustrated above, and iteratively patching all the assembly stubs. This is done in the call to the symbol __fixup_pv_table, where our just calculated offset in r8 comes into play: a table of 5 symbols that need to refer directly to physical memory known as the __pv_table is read into registers r3..r7 augmented using the same method described above (that is why the table is preceded by a .long .):

__fixup_pv_table:
	adr	r0, 1f
	ldmia	r0, {r3-r7}
	mvn	ip, #0
	subs	r3, r0, r3	@ PHYS_OFFSET - PAGE_OFFSET
	add	r4, r4, r3	@ adjust table start address
	add	r5, r5, r3	@ adjust table end address
	add	r6, r6, r3	@ adjust __pv_phys_pfn_offset address
	add	r7, r7, r3	@ adjust __pv_offset address
	mov	r0, r8, lsr #PAGE_SHIFT	@ convert to PFN
	str	r0, [r6]	@ save computed PHYS_OFFSET to __pv_phys_pfn_offset
	(...)
	b	__fixup_a_pv_table

1:	.long	.
	.long	__pv_table_begin
	.long	__pv_table_end
2:	.long	__pv_phys_pfn_offset
	.long	__pv_offset

The code uses the first value loaded into r3 to calculate the offset to physical memory, then adds that to each of the other registers so that r4 thru r7 now point directly to the physical memory contained in each of these labels. So r4 is a pointer to the physical memory storing __pv_table_begin, r5 points to __pv_table_end, r6 point to __pv_phys_pfn_offset and r7 points to __pv_offset. In C these would be u32 *, so pointers to some 32 bit numbers.

__pv_phys_pfn_offset is especially important, this means patch physical to virtual page frame number offset, so we first calculate this by making a logical shift right with mov r0, r8, lsr #PAGE_SHIFT using the value we earlier calculated in r8 (the offset from 0 to the kernel memory in our case 0x10000000), and writing it right into whatever location actually stores that variable with str r0, [r6]. This isn't used at this early stage of the kernel start-up but it will be used by the virtual memory manager later.

Next we call __fixup_a_pv_table which will loop over the addresses from r4 thru r5 (the table of pointers to instructions to be patched) and patch each of them in order using a custom binary patcher that convert the instructions whether in ARM or THUMB2 (the instruction set is known at compile time) to one with an immediate offset between physical and virtual memory encoded right into it. The code is pretty complex and contains quirks to accommodate for big endian byte order as well.

Notice that this also has to happen every time the kernel loads a module, who knows if the new module needs to convert between physical and virtual addresses! For this reason all module ELF files will contain a .pv_table section with the same kind of table and the very same assembly loop is called also every time we load a module.

Setting Up the Initial Page Table

Before we can start to execute in this virtual memory we have just chiseled out, we need to set up a MMU translation table for mapping the physical memory to virtual memory. This is commonly known as a page table, albeit what will use for the initial map is not pages but sections. The ARM architecture mandates that this must be placed on an even 16KB boundary in physical memory. Since it is also always 16KB in size this makes sense.

The location of the initial page table is defined in a symbol named swapper_pg_dir, “swapper page directory”, which is an idiomatic name for the initial page table for the kernel. As you realize it will later on be replaced (swapped, hence “swapper”) by a more elaborate page table.

The name “page table” is a bit misleading because what the initial mapping is using in ARM terminology is actually called sections, not pages. But the term is used a bit fuzzily to refer to “that thing which translates physical addresses to virtual addresses at boot”.

The symbol swapper_pg_dir is defined to be at KERNEL_RAM_VADDR - PG_DIR_SIZE. Let’s examine each of these.

KERNEL_RAM_VADDR is just what you would suspect: the address in the virtual memory where the kernel is located. It is the address to which the kernel was linked during compile.

The KERNEL_RAM_VADDR is defined to (PAGE_OFFSET + TEXT_OFFSET). PAGE_OFFSET can be one of four locations from the Kconfig symbol we just investigated, typically 0xC0000000. The TEXT_OFFSET is typically 0x8000, so the KERNEL_RAM_VADDR will typically be 0xC0008000 but with a different virtual memory split setting or exotic TEXT_OFFSET it will be something different. TEXT_OFFSET comes from textof-y in arch/arm/Makefile and is usually 0x8000, but can also be 0x00208000 on some Qualcomm platforms and 0x00108000 again on some Broadcom platforms, making KERNEL_RAM_VADDR 0xC0208000 or similar.

What we know for sure is that this address was passed to the linker when linking the kernel: if you inspect the linker file for the ARM architecture at arch/arm/kernel/vmlinux.lds.S like we did in the beginning of the article you can see that the linker is indeed instructed to place this at . = PAGE_OFFSET + TEXT_OFFSET. The kernel has been compiled to execute at the address given in KERNEL_RAM_VADDR, even if the very earliest code, that we are analyzing right now, has been written in assembly to be position-independent.

TEXT OFFSET illustration The TEXT_OFFSET is a small, typically 32KB area above the kernel RAM in both physical and virtual memory space. The location of the kernel in the physical RAM at 0x10000000 is just an example, this can be on any even 16MB boundary.

Notice that the little gap above the kernel, most commonly at 0xC0000000-0xC0007FFF (the typical 32KB if TEXT_OFFSET). While part of the kernelspace memory this is not memory that the kernel was compiled into: it contains the initial page table, potentially ATAGs from the boot loader and some scratch area and could be used for interrupt vectors if the kernel is located at 0x00000000. The initial page table here is certainly programmed and used by the kernel, but it is abandoned afterwards.

Notice that we add the same TEXT_OFFSET to physical and virtual memory alike. This is even done in the decompression code, which will decompress the kernel first byte of the kernel to PHYS_OFFSET + TEXT_OFFSET, so the linear delta between a kernel location in physical memory and the same location in virtual memory is always expressed in a few high bits of a 32bit word, such as bits 24 thru 31 (8 bits) making it possible to use immediate arithmetic and include the offset directly in the instruction we use when facilitating ARM_PATCH_PHYS_VIRT. From this comes the restriction that kernel RAM must begin at an address evenly divisible by 16MB (0x01000000).

So we find the physical location of swapper_pg_dir in virtual memory by adding TEXT_OFFSET to PAGE_OFFSET and then backing off PG_DIR_SIZE from there. In the common case this will be 0xC0000000 + 0x8000 - 0x4000 so the initial page table will be at 0xC0004000 in virtual memory and at the corresponding offset in physical memory, in my example with PHYS_OFFSET being 0x10000000 it will be at 0x10004000.

Inital page translation table in swapper pg dir The swapper_pg_dir symbol will in our example reside 16KB (0x4000 bytes) before the .text segment in physical memory, at 0x10004000 given that our PHYS_OFFSET is 0x10000000 and TEXT_OFFSET is 0x8000 while using classic ARM MMU.

If you use LPAE the page table PG_DIR_SIZE is 0x5000 so we end up at 0xC0003000 virtual and 0x10003000 physical. The assembly macro pgtbl calculates this for us: we take the physical address we have calculated in r8, add TEXT_OFFSET and subtract PG_DIR_SIZE and we end up on the physical address for the initial translation table.

The initial page table is constructed in a very simple way: we first fill the page table with zeroes, then build the initial page table. This happens at symbol __create_page_tables.

The ARM32 Page Table Format

The page table layout on ARM32 consists of 2 or 3 levels. The 2-level page table is referred to as the short format page table and the 3-level format is called the long format page table in ARM documentation. The long format is a feature of the large physical address extension, LPAE and as the name says it is there to handle larger physical memories requiring up to 40 bits of physical address.

These translation tables can translate memory in sections of 1MB (this is 2MB on LPAE but let’s leave that aside for now) or in pages of 16kB. The initial page table uses sections, so memory translation in the initial page table is exclusively done in sections of 1MB.

This simplifies things for the initial mapping: sections can be encoded directly in the first level of the page table. This way there is in general no need to deal with the complex 2-level hierarchy, and we are essentially treating the machine as something with 1-level 1MB section tables.

However if we are running on LPAE there is another level inbetween, so we need to deal with the first two levels and not just one level. That is why you find some elaborate LPAE code here. All it does is insert a 64bit pointer to our crude 2MB-sections translation table.

Page table format In the classic MMU we just point the MMU to a 1st level table with 1MB sections while for LPAE we need an intermediate level to get to the similar format with 2MB. The entries are called section descriptors.

Linux Page Table Lingo

At this point the code starts to contain some three-letter acronyms. It is dubious whether these are actually meaningful when talking about this crude initial page table, but developers refer to these short-hands out of habit.

  • PGD page global directory is Linux lingo for the uppermost translation table, where the whole MMU traversal to resolve sections and pages start. In our case this is at 0x10004000 in physical memory. In the ARM32 world, this is also the address that we write into the special CP15 translation table register (sometimes called TTBR0) to tell the MMU where to find the translations. If we were using LPAE this would be at 0x10003000 just to make space for that 0x1000 with a single 64bit pointer to the next level called PMD.
  • P4D page 4th level directory and PUD page upper directory are concepts in the Linux VMM (virtual memory manager) unused on ARM as these deal with translation table hierarchies of 4 or 5 levels and we use only 2 or 3.
  • PMD page middle directory is the the name for 3rd level translation stage only used in LPAE. This is the reason we need to reserve another 0x1000 bytes for the LPAE initial page table. For classic ARM MMU (non-LPAE) the PMD and the PGD is the same thing. This is why the code refers to PMD_ORDER for classic MMU and LPAE alike. Because of Linux VMM terminology this is understood as the “format of the table right above the PTE:s”, and since we’re not using PTEs but section mappings, the “PMD section table” is the final product of our mappings.
  • PTE:s, page table entries maps pages of RAM from physical to virtual memory. The initial boot translation table does not use these. We use sections.

Notice again: for classic ARM MMU PGD and PMD is the same thing. For LPAE it is two different things. This is sometimes referred to as “folding” PMD into PGD. Since in a way we are “folding” P4D and PUD as well, it becomes increasingly confusing.

If you work with virtual memory you will need to repeat the above from time to time. For our purposes of constructing the initial “page” table none of this really matters. What we are constructing is a list of 1MB sections mapping virtual memory to physical memory. We are not even dealing with pages at this point, so the entire “pages” terminology is mildly confusing.

The Binary Format of the Translation Table

Let use consider only the classic ARM MMU in the following.

From physical address 0x10004000 to 0x10007FFF we have 0x1000 (4096) section descriptors of 32 bits (4 bytes) each. How are these used?

The program counter and all CPU access will be working on virtual addresses after we turn on the MMU, so the translation will work this way: a virtual address is translated into a physical address before we access the bus.

The way the physical address is determined from the virtual address is as follows:

We take bits 31..20 of the virtual address and use this as index into the translation table to look up a 32bit section descriptor.

  • Addresses 0x00000000-0x000FFFFF (the first 1MB) in virtual memory will be translated by the index at 0x10004000-0x10004003, the first four bytes of the translation table, we call this index 0.
  • Virtual address 0x00100000-0x001FFFFF will be translated by index 1 at 0x10004004-0x10004007, so index times four is the byte address of the descriptor.
  • Virtual address 0xFFF00000-0xFFFFFFFF will be translated by index 0xFFF at 0x10003FFC-0x10003FFF

Oh. Clever. 0x4000 (16KB) of memory exactly spans a 32bit i.e. 4GB memory space. So using this MMU table we can make a map of any virtual memory address 1MB chunk to any physical memory 1MB chunk. This is not a coincidence.

This means for example that for the kernel virtual base 0xC0000000 the index into the table will be 0xC0000000 >> 20 = 0xC00 and since the index need to be multiplied by 4 to get the actual byte index we get 0xC00 * 4 = 0x3000 so at physical address 0x10004000 + 0x3000 = 0x10007000 we find the section descriptor for the first 1MB of the kernelspace memory.

The 32-bit 1MB section descriptors we will use for our maps may have something like this format:

Mock section descriptor

The MMU will look at bits 1,0 and determine that “10” means that this is a section mapping. We set up some default values for the other bits. Other than that we only really care about setting bits 31..20 to the right physical address and we have a section descriptor that will work. And this is what the code does.

For LPAE this story is a bit different: we use 64bit section descriptors (8 bytes) but at the same time, the sections are bigger: 2 MB instead of 1 MB, so in the end the translation table will have exactly the same size of 0x4000 bytes.

Identity mapping around MMU enablement code

First of all we create an identity mapping around the symbol __turn_mmu_on, meaning this snippet of code will be executing in memory that is mapped 1:1 between physical and virtual addresses. If the code is at 0x10009012 then the virtual address for this code will also be at 0x10009012. If we inspect the code we see that it is put into a separate section called .idmap.text. Creating a separate section means that this gets linked into a separate physical page with nothing else in it, so we map an entire 1MB section for just this code (maybe even two, if it happens to align just across a section boundary in memory), so that an identity mapping is done for this and this code only.

If we think about it this is usually cool, even if it spans say 2MB of memory: if we loaded the kernel at 0x10000000 as in our example, this code will be at something like 0x10000120 with an identity mapping at the same address this isn’t ever going to disturb the kernel at 0xC0000000 or even at 0x40000000 which is an extreme case of kernel memory split. If someone would start to place the start of physical memory at say 0xE0000000 we would be in serious trouble. Let’s pray that this never happens.

Mapping the rest

Next we create the physical-to virtual memory mapping for the main matter, starting at PHYS_OFFSET in the physical memory (the value we calculated in r8 in the section “Where Are We Executing”) and PAGE_OFFSET (a compile-time constant) in the virtual memory, then we move along one page at the time all the way until we reach the symbol _end in virtual memory which is set to the end of the .bss section of the kernel object itself.

Initial virtual memory mapping The initial page table swapper_pg_dir and the 1:1 mapped one-page-section __turn_mmu_on alongside the physical to virtual memory mapping at early boot. In this example we are not using LPAE so the initial page table is -0x4000 from PHYS_OFFSET and memory ends at 0xFFFFFFFF.

BSS is an idiomatic name for the section at the very end of the binary kernel footprint in memory and this is where the C compiler assigns locations for all runtime variables. The addresses for this section are well defined, but there is no binary data for them: this memory has undefined contents, i.e. whatever was in it when we mapped it in.

It is worth studying this loop of assembly in detail to understand how the mapping code works:

    ldr    r7, [r10, #PROCINFO_MM_MMUFLAGS] @ mm_mmuflags
    (...)
    add    r0, r4, #PAGE_OFFSET >> (SECTION_SHIFT - PMD_ORDER)
    ldr    r6, =(_end - 1)
    orr    r3, r8, r7
    add    r6, r4, r6, lsr #(SECTION_SHIFT - PMD_ORDER)
1:  str    r3, [r0], #1 << PMD_ORDER
    add    r3, r3, #1 << SECTION_SHIFT
    cmp    r0, r6
    bls    1b

Let us take this step by step. We assume that we use non-LPAE classic ARM MMU in this example (you can convince yourself that the same holds for LPAE).

add    r0, r4, #PAGE_OFFSET >> (SECTION_SHIFT - PMD_ORDER)

r4 contains the physical address of the page table (PGD or PMD) where we will set up our sections. (SECTION_SHIFT - PMD_ORDER) resolves to (20 – 2) = 18, so we take PAGE_OFFSET 0xC0000000 >> 18 = 0x3000 which is incidentally the absolute index into the translation table for 0xC0000000 as we saw earlier: aha. Well that makes sense since the index is 4 bytes. So whenever we see (SECTION_SHIFT - PMD_ORDER) that means “convert to absolute physical index into the translation table for this virtual address”, in our example case this will be 0x10003000.

So the first statement generates the physical address for the first 32bit section descriptor for the kernelspace memory in r0.

ldr    r6, =(_end - 1)

r6 is obviously set to the very end of the kernelspace memory.

    ldr    r7, [r10, #PROCINFO_MM_MMUFLAGS] @ mm_mmuflags
    (...)
    orr    r3, r8, r7

r8 contains the PHYS_OFFSET, in our case 0x10000000 (we count on bits 19..0 to be zero) and we OR that with r7 which represents what is jokingly indicated as “who cares” in the illustration above, is a per-CPU setting for the MMU flags defined for each CPU in arch/arm/mm/proc-*.S. Each of these files contain a special section named .proc.info.init and at index PROCINFO_MM_MMUFLAGS (which will be something like 0x08) there is the right value to OR into the section descriptor for the specific CPU we are using. The struct itself is named struct proc_info_list and can be found in arch/arm/include/asm/procinfo.h. Since assembly cannot really handle C structs some indexing trickery is used to get at this magic number.

So the section descriptor has a physical address in bits 31-20 and this value in r7 sets some more bits, like the two lowest, so that the MMU can handle this section descriptor right.

add    r6, r4, r6, lsr #(SECTION_SHIFT - PMD_ORDER)

This will construct the absolute physical index address for the section descriptor for the last megabyte of memory that we map. We are not OR:in on the value from r7: we till just use this for a loop comparison, not really write it into the translation table, so it is not needed.

So now r0 is the physical address of the first section descriptor we are setting up and r6 is the physical address of the last section descriptor we will set up. We enter the loop:

1:  str    r3, [r0], #1 << PMD_ORDER
    add    r3, r3, #1 << SECTION_SHIFT
    cmp    r0, r6
    bls    1b

This writes the first section descriptor into the MMU table, to the address in r0 which will start at 0x10003000 in our example. Then we post-increment the address in r0 with (1 << PMD_ORDER) which will be 4. Then we increase the descriptor physical address portion (bit 20 and upward) with 1 MB (1 << SECTION_SHIFT), check we we reached the last descriptor else loop to 1:.

This creates a virtual-to-physical map of the entire kernel including all segments and .bss in 1MB chunks.

Final mappings

Next we map some other things: we map the boot parameters specifically: this can be ATAGs or an appended device tree blob (DTB). ATAGs are usually the very first page of memory (at PHYS_OFFSET + 0x0100) While contemporary DTBs are usually somewhere below the kernel. You will notice that it would probably be unwise to place the DTB too far above the kernel, lest it will wrap around to lower addresses.

If we are debugging we also map the serial port specifically from/to predefined addresses in physical and virtual memory. This is done so that debugging can go on as we execute in virtual memory.

There is again an exception for “execute in place”: if we are executing from ROM we surely need to map the kernel from that specific address rather than the compile-time memory locations.

Jump to Virtual Memory

We are nearing the end of the procedure stext where we started executing the kernel.

First we call the per-cpu-type “procinit” function. This is some C/assembly clever low-level CPU management code that is collected per-CPU in arch/arm/mm/proc-*.S. For example most v7 CPUs have the initialization code in proc-v7.S and the ARM920 has its initialization code in proc-arm920.S. This will be useful later, but the “procinit” call is usually empty: only XScale really does anything here an it is about bugs pertaining to the boot loader initial state.

The way the procinit code returns is by the conventional ret lr which actually means that the value of the link register (lr) is assigned to the program counter (pc).

Before entering the procinit function we set lr to the physical address of label 1: which will make a relative branch to the symbol __enable_mmu. We also assign r13 the address of __mmap_switched, which is the compiletime-assigned, non-relative virtual address of the next execution point after the MMU has been enabled. We are nearing the end of relative code constructions.

We jump to __enable_mmu. r4 contains the address of the initial page table. We load this using a special CP15 instruction designed to load the page table pointer (in physical memory) into the MMU:

mcr    p15, 0, r4, c2, c0, 0

Nothing happens so far. The page table address is set up in the MMU but it is not yet translating between physical and virtual addresses. Next we jump to __turn_mmu_on. This is where the magic happens. __turn_mmu_on as we said is cleverly compiled into the section .idmap.text which means it has the same address in physical and virtual memory. Next we enable the MMU:

    mcr    p15, 0, r0, c1, c0, 0        @ write control reg
    mrc    p15, 0, r3, c0, c0, 0        @ read id reg

BAM! The MMU is on. The next instruction (which is incidentally an instruction cache flush) will be executed from virtual memory. We don’t notice anything at first, but we are executing in virtual memory. When we return by jumping to the address passed in r13, we enter __mmap_switched at the virtual memory address of this function, somewhere below PAGE_OFFSET (typically 0xC0nnnnnn). We can now facilitate absolute addressing: the kernel is executing as intended.

Executing in virtual memory The little “dance” of the program counter as we switch execution from physical to virtual memory.

We have successfully bootstrapped the initial page table and we are now finally executing the kernel at the location the C compiler thought the kernel was going to execute at.

Now Let’s Get Serious

__mmap_switched can be found in the file arch/arm/kernel/head-common.S and will do a few specific things before we call the C runtime-enabled kernel proper.

First there is an exceptional clause, again for execute-in-place (XIP) kernels: while the .text segment of the kernel can certainly continue executing from ROM we cannot store any kind of variables from the .data segment there. So first this is set up by either just copying the segment to RAM or, using some fancy extra code, uncompressed to RAM (saving more silicon memory).

Next we zero out the .bss segment we got to know earlier: in the Linux kernel we rely on static variables being initialized to zero. Other C runtime environments may not do this, but when you run the Linux kernel you can rely on static variables being zero the first time you enter a function.

The machine has now switched to virtual memory and fully prepared the C runtime environment. We have also patched all cross-references between physical and virtual memory. We are ready to go.

So next we save the processor ID, machine type and ATAG or DTB pointer and branch to the symbol start_kernel(). This symbol resolves to an absolute address and is a C function found a bit down in the init/main.c file. It is fully generic: it is the same point that all other Linux architectures call, so we have now reached the level of generic kernel code written in C.

Let’s see where this is: to disassemble the kernel I just use the objdump tool from the toolchain and pipe it to less:

arm-linux-gnueabihf-objdump -D vmlinux |less

By searching for start_kernel by /start_kernel in less and skipping to the second hit we find:

c088c9d8 <start_kernel>:
c088c9d8:       e92d4ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, lr}
c088c9dc:       e59f53e8        ldr     r5, [pc, #1000] ; c088cdcc
c088c9e0:       e59f03e8        ldr     r0, [pc, #1000] ; c088cdd0
c088c9e4:       e5953000        ldr     r3, [r5]
c088c9e8:       e24dd024        sub     sp, sp, #36     ; 0x24
c088c9ec:       e58d301c        str     r3, [sp, #28]
c088c9f0:       ebde25e8        bl      c0016198 <set_task_stack_end_magic>

Oh this is nice! We are executing C at address 0xC088C9D8, and we can now disassemble and debug the kernel as we like. When I have random crash dumps of the kernel I usually use exactly this method of combining objdump and less to disassemble the kernel and just search for the symbol that crashed me to figure out what is going on.

Another common trick among kernel developers is to enable low-level kernel debugging and putting a print at start_kernel() so that we know that we reach this point. My own favourite construction looks like this (I just insert these lines first in start_kernel()):

#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
{
    extern void printascii(char *);
    printascii("start_kernel\n");
}
#endif

As can be seen, getting low-level debug prints like this to work involves enabling CONFIG_DEBUG_LL, and then you should be able to see a life sign from the kernel before even the kernel banner is “Linux …” is printed.

This file and function should be well known to Linux kernel developers and a good thing to do on a boring day is to read through the code in this file, as this is how the generic parts of Linux bootstrap themselves.

The happiness of generic code will not last as we soon call setup_arch() and we are then back down in arch/arm again. We know for sure that our crude initial translation table will be replaced by a more elaborate proper translation table. There is not yet any such thing as a virtual-to-physical mapping of the userspace memory for example. But this is the topic of another discussion.

 
Read more...

from David Ahern

Long overdue blog post on XDP; so many details uncovered during testing causing tests to be redone.

This post focuses on a comparison of XDP and OVS in delivering packets to a VM from the perspective of CPU cycles spent by the host in processing those packets. There are a lot of variables at play, and changing any one of them radically affects the outcome, though it should be no surprise XDP is always lighter and faster.

Setup

I believe I am covering all of the settings here that I discovered over the past few months that caused variations in the data.

Host

The host is a standard, modern server (Dell PowerEdge R640) with an Intel® Xeon® Platinum 8168 CPU @ 2.70GHz with 96 hardware threads (48 cores + hyper threading to yield 96 logical cpus in the host). The server is running Ubuntu 18.04 with the recently released 5.8.0 kernel. It has a Mellanox Connectx4-LX ethernet card with 2 25G ports into an 802.3ad (LACP) bond, and the bond is connected to an OVS bridge.

Host setup

As discussed in [1] to properly compare the CPU costs of the 2 networking solutions, we need to consolidate packet processing to a single CPU. Handling all packets destined to the same VM on the same CPU avoids lock contention on the tun ring, so consolidating packets to a single CPU is actually best case performance.

Ensure RPS is disabled in the host:

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

and add flow rules in the NIC to push packets for the VM under test to a single CPU:

sudo ethtool -N eth0 flow-type ether dst 12:34:de:ad:ca:fe action 2
sudo ethtool -N eth1 flow-type ether dst 12:34:de:ad:ca:fe action 2

For this host and ethernet card, packets for queue 2 are handled on CPU 5 (consult /proc/interrupts for the mapping on your host).

XDP bypasses the qdisc layer, so to have a fair comparison make noqueue the default qdisc before starting the VM:

sudo sysctl -w net.core.default_qdisc=noqueue

(or add a udev rule [2]).

Finally, the host is fairly quiet with only one VM running (the one under test) and very little network traffic outside of the VM under test and a few, low traffic ssh sessions used to run commands to collect data about the tests.

Virtual Machine

The VM has 8 cpus and is also running Ubuntu 18.04 with a 5.8.0 kernel. It uses tap+vhost for networking with the tap device a port in the OVS bridge as shown in the picture above. The tap device has a single queue, and RPS is also disabled in the guest:

echo 00 | see tee /sys/class/net/eth0/queues -name rps_cpus

The VM is also quiet with no load running in the guest OS.

The point of this comparison is host side processing of packets, so packets are dropped in the guest as soon as possible using a bpf program [3] attached to eth0 as a tc filter. (Note: Theoretically, XDP should be used to drop the packets in the guest OS since it truly is the fewest cycles per packet. However, XDP in the VM requires a multi-queue NIC[5], and adding queues to the guest NIC has a huge affect on the results.)

In the host, the qemu threads corresponding to the guest CPUs (vcpus) are affined (as a set) to 8 hardware threads in the same NUMA node as CPU 5 (the host CPU processing packets per the RSS rules mentioned earlier). The vhost thread for the VM's tap device is also affined to a small set of host CPUs in the same NUMA node to avoid scheduling collisions with the vcpu threads, the CPU processing packets (5) and its sibling hardware thread (CPU 53 in my case) – all of which add variability to the results.

Forwarding with XDP

Packet forwarding with XDP is done by attaching an L2 forwarding program [4] to eth0 and eth1. The program pulls the VLAN and destination mac from the ethernet header, and uses the pair as a key for a lookup in a hash map. The lookup returns the next device index for the packet which for packets destined to the VM is the index of its tap device. If an entry is found, the packet is redirected to the device via XDP_REDIRECT. The use case was presented in depth at netdevconf 0x14 [5].

Packet generator

Packets are generated using 2 VMs on a server that is directly connected to the same TOR switches as the hypervisor running the VM under test. The point of the investigation is to measure the overhead of delivering packets to a VM, so memcpy is kept to a minimum by having the packet generator [6] in the VMs send 1-byte UDP packets.

Test setup

Each VM can generate a little over 1 million packets per sec (1M pps), for a maximum load of 2.2M pps based on 2 separate source addresses.

CPU Measurement

As discussed in [1] a fair number of packets are processed in the context of some interrupted, victim process or when handled on an idle CPU the cycles are not fully accounted in the softirq time shown in tools like mpstat.

This test binds openssl speed, a purely userspace command[1], to the CPU handling packets to fully consume 100% of all CPU cycles which makes the division of CPU time between user, system and softirq more transparent. In this case, the output of mpstat -P 5 shows how all of the cycles for CPU 5 were spent (within the resolution of system accounting): * %softirq is the time spent handling packets. This data is shown in the graphs below. * %usr represents the usable CPU time for processes to make progress on their workload. In this test, it shows the percentage of CPU consumed by openssl and compares to the times shown by openssl within 1-2%. * %sys is the percentage of kernel time and for the data shown below was always <0.2%.

As an example, in this mpstat output openssl is only getting 14.2% of the CPU while 85.8% was spent handling the packet load:

CPU    %usr   %nice    %sys  %iowait   %irq   %soft   %idle
  5   14.20    0.00    0.00     0.00   0.00   85.80   0.00

(%steal, %guest and %gnice dropped were always 0 and dropped for conciseness.)

Let's get to the data.

CPU Comparison

This chart shows a comparison of the %softirq required to handle various PPS rates for both OVS and XDP. Lower numbers are better (higher percentages mean more CPU cycles).

1-VM softirq

There is 1-2% variability in ksoftirqd percentages despite the 5-second averaging, but the variability does not really affect the important points of this comparison.

The results should not be that surprising. OVS has well established scaling problems and the chart shows that as packet rates increase. In my tests it was not hard to saturate a CPU with OVS, reaching a maximum packet rate to the VM of 1.2M pps. The 100% softirq at 1.5M pps and up is saturation of ksoftirqd alone with nothing else running on that CPU. Running another process on CPU 5 immediately affects the throughput rate as the CPU splits time between processing packets and running that process. With openssl, the packet rate to the VM is cut in half with packet drops at the host ingress as it can no longer keep up with the packet rate given the overhead of OVS.

XDP on the other hand could push 2M pps to the VM before the guest could no longer keep up with packet drops at the tap device (ie., no room in the tun ring meaning the guest has not processed the previous packets). As shown above, the host still has plenty of CPU to handle more packets or run workloads (preferred condition for a cloud host).

One thing to notice about the chart above is the apparent flat lining of CPU usage between 50k pps and 500k pps. That is not a typo, and the results are very repeatable. This needs more investigation, but I believe it shows the efficiencies kicking in from a combination of more packets getting handled per napi poll cycle (closer to maximum of the netdev budget) and the kernel side bulking in XDP before a flush is required.

Hosts typically run more than 1 VM, so let's see the effect of adding a second VM to the mix. For this case a second VM is started with the same setup as mentioned earlier, but now the traffic load is split equally between 2 VMs. The key point here is a single CPU processing interleaved network traffic for 2 different destinations.

2-VM softirq

For OVS, CPU saturation with ksoftirqd happens with a maximum packet rate to each VM of 800k pps (compared to 1.2M with only a single VM). The saturation is in the host with packet drops shown at host ingress, and again any competition for the CPU processing packets cuts the rate in half.

Meanwhile, XDP is barely affected by the second VM with a modest increase of 3-4% in softirq at the upper packet rates. In this case, the redirected packets are just hitting separate bulking queues in the kernel. The two packet generators are not able to hit 4+M pps to find the maximum per-VM rate.

Final Thoughts

CPU cycles are only the beginning for comparing network solutions. A full OVS-vs-XDP comparison needs to consider all the resources consumed – e.g., memory as well as CPU. For example, OVS has ovs-vswitchd which consumes a high amount of memory (>750MB RSS on this server with only the 2 VMs) and additional CPU cycles to handle upcalls (flow misses) and revalidate flow entries in the kernel which on an active hypervisor can easily consume 50+% cpu (not counting increased usage from various bugs[7]).

Meanwhile, XDP is still early in its lifecycle. Right now, using XDP for this setup requires VLAN acceleration in the NIC [5] to be disabled meaning the VLAN header has to be removed by the ebpf program before forwarding to the VM. Using the proposed hardware hints solution reduces the softirq time by another 1-2% meaning 1-2% more usable CPU by leveraging hardware acceleration with XDP. This is just an example of how XDP will continue to get faster as it works better with hardware offloads.

Acronyms

LACP Link Aggregation Control Protocol NIC Nework Interface Card NUMA Non-Uniform Memory Access OVS Open VSwitch PPS Packets per Second RPS Receive Packet Steering RSS Receive Side Scaling TOR Top-of-Rack VM Virtual Machine XDP Express Data Path in Linux

References

[1] https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host [2] https://people.kernel.org/dsahern/rss-rps-locking-qdisc [3] https://github.com/dsahern/bpf-progs/blob/master/ksrc/rx_acl.c [4] https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdp_l2fwd.c [5] https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud [6] https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c [7] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html

 
Read more...

from Jakub Kicinski

All modern NICs implement IRQ coalescing (ethtool -c/-C), which delays RX/TX interrupts hoping more frames arrive in the meantime to allow for batch processing. IRQ coalescing trades off latency for system throughput.

It’s commonly believed that the higher the packet rate, the more batching system needs to keep up. At lower rates the batching would be limited, anyway. This leads to the idea of adaptive IRQ coalescing where the NIC itself – or more likely the driver – adjusts the IRQ timeouts based on the recent rate of packet arrivals.

Unfortunately adaptive coalescing is not a panacea, as it often has predefined range of values it chooses from, and it costs extra CPU processing to continuously recalculate and update the rate (especially with modern NICs which often need to talk to firmware to change settings rather than simply writing to device registers).

The summary above – while correct (I hope :)) misses one important point. There are two sets of IRQ coalescing settings, and only one of them has a significant latency impact. NICs (and the Linux kernel ethtool API) have separate settings for RX and TX. While RX processing is more costly, and therefore playing with RX settings feels more significant – RX batching costs latency. For TX processing (or actually TX completion processing) the latency matters much, much less.

With a simple bpftrace command:

bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

we can check how many RX packets get received on every NAPI poll (that’s to say how many packets get coalesced). On a moderately loaded system the top entries may look something like:

@[4]: 750
@[3]: 2180
@[2]: 15828
@[1]: 233080
@[0]: 298525

where the first number (in square brackets) is the number of packets coalesced, and the second number is a counter of occurrences.

The 0 work done entries (@[0]: 298525) usually mean that the driver received a TX interrupt, and there were no RX packets to process. Drivers will generally clear their TX rings while doing RX processing – so with TX processing being less latency sensitive – in an ideal scenario we’d like to see no TX interrupts at all, but rather have TX processing piggy back on the RX interrupts.

How high can we set the TX coalescing parameters, then? If the workload is mostly using TCP all we really need to ensure is that we don’t run awry of TCP Small Queues (/proc/sys/net/ipv4/tcp_limit_output_bytes) which is the number of bytes TCP stack is willing to queue up to the NIC without getting a TX completion.

For example recent upstream kernels have TSQ of 1MB, so even with a 50GB NIC – delaying TX interrupts for up to 350us should be fine. Obviously we want to give ourselves a safety margin for scheduling delays, timer slack etc. Additionally, according to my experiments the gains of TX coalescing above 200us are perhaps too low for the risk.

Repeating the bpftrace command from above after setting coalescing to 150us / 128 frames:

@[4]: 831
@[3]: 2066
@[2]: 16056
@[0]: 177186
@[1]: 228985

We see far less @[0] occurrences compared to the first run. The gain in system throughput depends on the workload, I’ve seen an increase of 6% on the workload I tested with.

A word of warning – even though upstream reviewers try to make sure drivers behave sanely and return errors for unsupported configurations – there are vendors out there who will silently ignore TX coalescing settings...

 
Read more...

from linusw

ARM traditionally uses compressed kernels. This is done for two major reasons:

  • It saves space on the flash memory or other storage media holding the kernel, and memory is money. For example for the Gemini platform that I work on, the vmlinux uncompressed kernel is 11.8 MB while the compressed zImage is a mere 4.8 MB, we save more than 50%
  • It is faster to load because the time it takes for the decompression to run is shorter than the time that it takes to transfer an uncompressed image from the storage media, such as flash. For NAND flash controllers this can easily be the case.

This is intended as a comprehensive rundown of how the Linux kernel self-decompresses on ARM 32-bit legacy systems. All machines under arch/arm/* uses this method if they are booted using a compressed kernel, and most of them are using compressed kernels.

Bootloader

The bootloader, whether RedBoot, U-Boot or EFI places the kernel image somewhere in physical memory and executes it passing some parameters in the lower registers.

Russell King defined the ABI for booting the Linux kernel from a bootloader in 2002 in the Booting ARM Linux document. The boot loader puts 0 into register r0, an architecture ID into register r1 and a pointer to the ATAGs in register r2. The ATAGs would contain the location and size of the physical memory. The kernel would be placed somewhere in this memory. It can be executed from any address as long as the decompressed kernel fits. The boot loader then jumps to the kernel in supervisor mode, with all interrupts, MMUs and caches disabled.

On contemporary device tree kernels, r2 is repurposed as a pointer to the device tree blob (DTB) in physical memory. (In this case r1 is ignored.) A DTB can also be appended to the kernel image, and optionally amended using the ATAGs from r2. We will discuss this more below.

Decompression of the zImage

If the kernel is compressed, execution begins in arch/arm/boot/compressed/head.S in the symbol start: a little bit down the file. (This is not immediately evident.) It begins with 8 or 7 NOP instructions for legacy reasons. It jumps over some magic numbers and saves the pointer to the ATAGs. So now the kernel decompression code is executing from the physical address of the physical memory where it was loaded.

The decompression code then locates the start of physical memory. On most modern platforms this is done with the Kconfig-selected code AUTO_ZRELADDR, which means a logical AND between the program counter and 0xf8000000. This means that the kernel readily assumes that it has been loaded and executed in the first part of the first block of physical memory.

There are patches being made that would instead attempt to get this information from the device tree.

Then the TEXT_OFFSET is added to the pointer to the start of physical memory. As the name says, this is where the kernel .text segment (as output from the compiler) should be located. The .text segment contains the executable code so this is the actual starting address of the kernel after decompression. The TEXT_OFFSET is usually 0x8000 so the kernel will be located 0x8000 bytes into the physical memory. This is defined in arch/arm/Makefile.

The 0x8000 (32KB) offset is a convention, because usually there is some immobile architecture-specific data placed at 0x00000000 such as interrupt vectors, and many elder systems place the ATAGs at 0x00000100. There also must be some space, because when the kernel finally boots, it will subtract 0x4000 (or 0x5000 for LPAE) from this address and store the initial kernel page table there.

For some specific platforms the TEXT_OFFSET will be pushed downwards in memory, notably some Qualcomm platforms will push it to 0x00208000 because the first 0x00200000 (2 MB) of the physical memory is used for shared memory communication with the modem CPU.

Next the decompression code sets up a page table, if it is possible to fit one over the whole uncompressed+compressed kernel image. The page table is not for virtual memory, but for enabling cache, which is then turned on. The decompression will for natural reasons be much faster if we can use cache.

Next the kernel sets up a local stack pointer and malloc() area so we can handle subroutine calls and small memory allocation going forward, executing code written in C. This is set to point right after the end of the kernel image.

Memory set-up Compressed kernel in memory with an attached DTB.

Next we check for an appended DTB blob enabled by the ARM_APPENDED_DTB symbol. This is a DTB that is added to the zImage during build, often with the simple cat foo.dtb >> zImage. The DTB is identified using a magic number, 0xD00DFEED.

If an appended DTB is found, and CONFIG_ARM_ATAG_DTB_COMPAT is set, we first expand the DTB by 50% and call atagstofdt that will augment the DTB with information from the ATAGs, such as memory blocks and sizes.

Next. the DTB pointer (what was passed in as r2 in the beginning) is overwritten with a pointer to the appended DTB, we also save the size of the DTB, and set the end of the kernel image after the DTB so the appended DTB (optionally modified with the ATAGs) is included in the total size of the compressed kernel. If an appended DTB was found, we also bump the stack and the malloc() location so we don’t destroy the DTB.

Notice: if a device tree pointer was passed in in r2, and an appended DTB was also supplied, the appended DTB “wins” and is what the system will use. This can sometimes be used to override a default DTB passed by a boot loader.

Notice: if ATAGs were passed in in r2, there certainly was no DTB passed in through that register. You almost always want the CONFIG_ARM_ATAG_DTB_COMPAT symbol if you use an elder boot loader that you do not want to replace, as the ATAGs properly defines the memory on elder platforms. It is possible to define the memory in the device tree, but more often than not, people skip this and rely on the boot loader to provide this, one way (the bootloader alters the DTB) or another (the ATAGs augment the appended DTB at boot).

Memory overlap The decompressed kernel may overlap the compressed kernel.

Next we check if we would overwrite the compressed kernel with the uncompressed kernel. That would be unfortunate. If this would happen, we check where in the memory the uncompressed kernel would end, and then we copy ourselves (the compressed kernel) past that location.

Then the code simply does a trick to jump back to the relocated address of a label called restart: which is the start of the code to set up the stack pointer and malloc() area, but now executing at the new physical address.

This means it will again set up the stack and malloc() area and look for the appended DTB and everything will look like the kernel was loaded in this location to begin with. (With one difference though: we have already augmented the DTB with ATAGs, so that will not be done again.) This time the uncompressed kernel will not overwrite the compressed kernel.

Moving the compressed kernel We move the compressed kernel down so the decompressed kernel can fit.

There is no check for if the memory runs out, i.e. if we would happen to copy the kernel beyond the end of the physical memory. If this happens, the result is unpredictable. This can happen if the memory is 8MB or less, in these situations: do not use compressed kernels.

Moving the compressed kernel below the decompressed kernel The compressed kernel is moved below the decompressed kernel.

Now we know that the kernel can be decompressed into a memory that is below the compressed image and that they will not collide during decompression and we execute at the label wont_overwrite:.

We check if we are executing on the address the decompressor was linked to, and possibly alter some pointer tables. This is for the C runtime environment executing the decompressor.

We make sure that the caches are turned on. (There is not certainly space for a page table.)

We clear the BSS area (so all uninitialized variables will be 0), also for the C runtime environment.

Next we call the decompress_kernel() symbol in boot/compressed/misc.c which in turn calls do_decompress() which calls __decompress() which will perform the actual decompression.

This is implemented in C and the type of decompression is different depending on Kconfig options: the same decompressor as the compression selected when building the kernel will be linked into the image and executed from physical memory. All architectures share the same decompression library. The __decompress() function called will depend on which of the decompressors in lib/decompress_*.c that was linked into the image. The selection of decompressor happens in arch/arm/boot/compressed/decompress.c by simply including the whole decompressor into the file.

All the variables the decompressor needs about the location of the compressed kernel are set up in the registers before calling the decompressor.

After decompression After decompression, the decompressed kernel is at TEXT_OFFSET and the appended DTB (if any) remains where the compressed kernel was.

After the decompression, we call get_inflated_image_size() to get the size of the final, decompressed kernel. We then flush and turn off the caches again.

We then jump to the symbol __enter_kernel which sets r0, r1 and r2 as the boot loader would have left them, unless we have an attached device tree blob, in which case r2 now points to that DTB. We then set the program counter to the start of the kernel, which will be the start of physical memory plus TEXT_OFFSET, typically 0x00008000 on a very conventional system, maybe 0x20008000 on some Qualcomm systems.

We are now at the same point as if we had loaded an uncompressed kernel, the vmlinux file, into memory at TEXT_OFFSET, passing (typically) a device tree in r2.

Kernel startup: executing vmlinux

The uncompressed kernel begins executing at the symbol stext(), start of text segment. This code can be found in arch/arm/kernel/head.S.

This is a subject of another discussion. However notice that the code here does not look for an appended device tree! If an appended device tree should be used, you must use a compressed kernel. The same goes for augmenting any device tree with ATAGs. That must also use a compressed kernel image, for the code to do this is part of the assembly that bootstraps a compressed kernel.

Looking closer at a kernel uncompress

Let us look closer at a Qualcomm APQ8060 decompression.

First you need to enable CONFIG_DEBUG_LL, which enables you to hammer out characters on the UART console without any intervention of any higher printing mechanisms. All it does is to provide a physical address to the UART and routines to poll for pushing out characters. It sets up DEBUG_UART_PHYS so that the kernel knows where the physical UART I/O area is located. Make sure these definitions are correct.

First enable a Kconfig option called CONFIG_DEBUG_UNCOMPRESS. All this does is to print the short message “Uncompressing Linux…” before decompressing the kernel and , “done, booting the kernel” after the decompression. It is a nice smoke test to show that the CONFIG_DEBUG_LL is set up and DEBUG_UART_PHYS is correct and decompression is working but not much more. This does not provide any low-level debug.

The actual kernel decompression can be debugged and inspected by enabling the DEBUG define in arch/arm/boot/compressed/head.S, this is easiest done by tagging on -DDEBUG to the AFLAGS (assembler flags) for head.S in the arch/arm/boot/compressed/Makefile like this:

AFLAGS_head.o += -DTEXT_OFFSET=$(TEXT_OFFSET) -DDEBUG

We then get this message when booting:

C:0x403080C0-0x40DF0CC0->0x41801D00-0x422EA900

This means that as we were booting I loaded the kernel to 0x40300000 which would collide with the uncompressed kernel. Therefore the kernel was copied to 0x41801D00 which is where the uncompressed kernel will end. Adding some further debug prints we can see that an appended DTB is first found at 0x40DEBA68 and after moving the kernel down it is found at 0x422E56A8, which is where it remains when the kernel is booted.

 
Read more...

from Christian Brauner

In my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))

Last time I mentioned two new features that we had landed:

  1. Retrieving file descriptors from another task via pidfd_getfd()
  2. Injection file descriptors via the new SECCOMP_IOCTL_NOTIF_ADDFD ioctl on the seccomp notifier

The 2. feature just landed in the merge window for v5.9. So what better time than now to boot a v5.9 pre-rc1 kernel and play with the new features.

I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are open(), connect(), dup2(), but also bpf(). People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than bpf().

As we know, bpf() and unprivileged containers don't get along too well. But that doesn't need to be the case. For the demo you're about to see I enabled LXD to supervise the bpf() syscalls for tasks running in unprivileged containers. We will intercept the bpf() syscalls for the BPF_PROG_LOAD command for BPF_PROG_TYPE_CGROUP_DEVICE program types and the BPF_PROG_ATTACH, and BPF_PROG_DETACH commands for the BPF_CGROUP_DEVICE attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.

This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up here. Let's see if the rest of the team thinks I'm going crazy. :)

asciicast

 
Read more...

from David Ahern

I recently learned this fun fact: With RSS or RPS enabled [1] and a lock-based qdisc on a VM's tap device (e.g., fq_codel) a UDP packet storm targeted at the VM can severely impact the entire server.

The point of RSS/RPS is to distribute the packet processing load across all hardware threads (CPUs) in a server / host. However, when those packets are forwarded to a single device that has a lock-based qdisc (e.g., virtual machines and a tap device or a container and veth based device) that distributed processing causes heavy spinlock contention resulting in ksoftirqd spinning on all CPUs trying to handle the packet load.

As an example, my server has 96 cpus and 1 million udp packets per second targeted at the VM is enough to push all of the ksoftirqd threads to near 100%:

  PID %CPU COMMAND               P
   58 99.9 ksoftirqd/9           9
  128 99.9 ksoftirqd/23         23
  218 99.9 ksoftirqd/41         41
  278 99.9 ksoftirqd/53         53
  318 99.9 ksoftirqd/61         61
  328 99.9 ksoftirqd/63         63
  358 99.9 ksoftirqd/69         69
  388 99.9 ksoftirqd/75         75
  408 99.9 ksoftirqd/79         79
  438 99.9 ksoftirqd/85         85
 7411 99.9 CPU 7/KVM            64
   28 99.9 ksoftirqd/3           3
   38 99.9 ksoftirqd/5           5
   48 99.9 ksoftirqd/7           7
   68 99.9 ksoftirqd/11         11
   78 99.9 ksoftirqd/13         13
   88 99.9 ksoftirqd/15         15
   ...

perf top shows the spinlock contention:

    96.79%  [kernel]          [k] queued_spin_lock_slowpath
     0.40%  [kernel]          [k] _raw_spin_lock
     0.23%  [kernel]          [k] __netif_receive_skb_core
     0.23%  [kernel]          [k] __dev_queue_xmit
     0.20%  [kernel]          [k] __qdisc_run

With the callchain leading to

    94.25%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --94.25%--queued_spin_lock_slowpath
                       |
                        --93.83%--__dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

A little code analysis shows this is the qdisc lock in __dev_xmit_skb.

The overloaded ksoftirqd threads means it takes longer to process packets resulting in budget limits getting hit and packet drops at ingress. The packet drops can cause ssh sessions to stall or drop or cause disruptions in protocols like LACP.

Changing the qdisc on the device to a lockless one (e.g., noqueue) dramatically lowers the ksoftirqd load. perf top still shows the hot spot as a spinlock:

    25.62%  [kernel]          [k] queued_spin_lock_slowpath
     6.87%  [kernel]          [k] tasklet_action_common.isra.21
     3.28%  [kernel]          [k] _raw_spin_lock
     3.15%  [kernel]          [k] tun_net_xmit

but this time it is the lock for the tun ring:

    25.10%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --25.05%--queued_spin_lock_slowpath
                       |
                        --24.93%--tun_net_xmit
                                  dev_hard_start_xmit
                                  __dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

which is a much lighter lock in the sense of the amount of work done with the lock held.

systemd commit e6c253e363dee, released in systemd 217, changed the default qdisc from pfifo_fast (kernel default) to fq_codel (/usr/lib/sysctl.d/50-default.conf for Ubuntu). As of v5.8 kernel fq_codel still has a lock to enqueue packets, so systems using fq_codel with RSS/RPS are hitting this lock contention which affects overall system performance. pfifo_fast is lockless as of v4.16 so for newer kernels the kernel's default is best.

But, it begs the question why have a qdisc for a VM tap device (or a container's veth device) at all? To the VM a host is just part of the network. You would not want a top-of-rack switch to buffer packets for the server, so why have the host buffer packets for a VM? (The “Tx” path for a tap device represents packets going to the VM.)

You can change the default via:

sysctl -w net.core.default_qdisc=noqueue

or add that to a sysctl file (e.g., /etc/sysctl.d/90-local.conf). sysctl changes affect new devices only.

Alternatively, the default can be changed for selected devices via a udev rule:

cat > /etc/udev/rules.d/90-tap.rules <<EOF
ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="tap*", PROGRAM="/sbin/tc qdisc add dev $env{INTERFACE} root handle 1000: noqueue"

Running sudo udevadm trigger should update existing devices. Check using tc qdisc sh dev <NAME>:

$ tc qdisc sh dev tapext4798884
qdisc noqueue 1000: root refcnt 2
qdisc ingress ffff: parent ffff:fff1 ----------------

[1] https://www.kernel.org/doc/Documentation/networking/scaling.txt

 
Read more...

from linusw

Recently my less-used desktop computer became sluggish, and would randomly crash. It seemed to be fully occupied with disk activity and quickly became uninteractive to the point that not even ssh login would work. This is easily identified as thrashing: constantly swapping to disk because of short core memory.

When Linux runs out of memory processes will of course be killed by the OOM killer, but if you have ample swap space, instead you will get thrashing. In this case the OOM killer would have been better: the system was so uninteractive that there is no point in trying to use swap. This was on a flash drive but still would just thrash.

Normally you would interact with the machine throught the UI or a terminal to shut down some processes, but it would not work: the memory used by the interactive processes like the desktop itself or even an SSH terminal was subject to swap!

After an update to the latest Fedora distribution the thrashing was the same but with one difference: the UI did not become completely uninteractive, making it possible to close down e.g. the web browser and recover the system.

The thrashing was caused by one of the DIMMs in the computer starting to malfunction reducing the core memory to a mere 4GB. (I have since replaced the memory.)

Something happened in Fedora that made it cope better with thrashing.

The most likely improvement in Fedora is BFQ the Budget Fair Queue block scheduler, that will use heuristics to keep the interactive processes higher in priority. This was recently made default for single queue devices in Fedora using a udev ruleset.

My flash drive was a single queue elder device – no fancy NVME – so it would become the bottleneck while constantly swapping, but with BFQ inbetween the interactive processes got a priority boost and the system remains interactive under this heavy stress, and swapping is again a better alternative to the OOM killer.

Having worked a bit with BFQ over the years this is a welcome surprise: the user-percieved stability of the system is better.

This might also illustrate the rule to make swap space around double the physical memory: now that my swap space suddenly became 4 times the physical memory the OOM killer would never step in, if it was just 2 times the physical memory, maybe it would. (I do not know if this holds or if the thrashing would be the same.)

 
Read more...

from Konstantin Ryabitsev

With git, a cryptographic signature on a commit provides strong integrity guarantees of the entire history of that branch going backwards, including all metadata and all contents of the repository, all the way back to the initial commit. This is possible because git records the hash of the previous commit in each next commit's metadata, creating an unbreakable cryptographic chain of records. If you can verify the cryptographic signature at the tip of the branch, you effectively verify that branch's entire history.

For example, let's take a look at linux.git, where the latest tag at the time of writing, v5.8-rc7, is signed by Linus Torvalds. (Tag signatures are slightly different from commit signatures — but in every practical sense they offer the same guarantees.)

If you have a git checkout of torvalds/linux.git, you are welcome to follow along.

$ git cat-file -p v5.8-rc7
object 92ed301919932f777713b9172e525674157e983d
type commit
tag v5.8-rc7
tagger Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700

Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----

iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
171s5Hs=
=BQIl
-----END PGP SIGNATURE-----

If we have Linus's key in our GnuPG keyring, we can easily verify that this tag is valid:

$ git verify-tag v5.8-rc7
gpg: Signature made Sun 26 Jul 2020 05:14:06 PM EDT
gpg:                using RSA key ABAF11C65A2970B130ABE3C479BE3E4300411886
gpg:                issuer "torvalds@linux-foundation.org"
gpg: Good signature from "Linus Torvalds <torvalds@kernel.org>" [unknown]
gpg:                 aka "Linus Torvalds <torvalds@linux-foundation.org>" [full]

The entire contents of this tag are signed, so this tells us that when Linus signed the tag, the “object hash” on his system was 92ed301919932f777713b9172e525674157e983d. But what exactly is that “object hash?” What are the contents that are hashed here? We can find out by asking git to tell us more about that object:

$ git cat-file -p 92ed301919932f777713b9172e525674157e983d
tree f16e3e4bcea2d875a17d2278ff67364b3277b10a
parent 1c8594b8427290c178c5d39885eacd9e41f68743
author Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700
committer Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700

Linux 5.8-rc7

The above contents in their entirety (slightly differently formatted) is what gives us the sha1 hash 92ed301919932f777713b9172e525674157e983d. So, thus far, we have unbroken cryptographic attestation from Linus's PGP signature to two other important bits about his git repository:

  • information about the state of his source code (tree)
  • information about the previous commit in the history (parent)
  • information about the author of the commit and the committer, which are the one and the same in this particular case
  • information about the date and time when the commit was made

Let's take a look a the tree line — what contents were hashed to arrive at that checksum? Let's ask git:

$ git cat-file -p f16e3e4bcea2d875a17d2278ff67364b3277b10a
100644 blob a0a96088c74f49a961a80bc0851a84214b0a9f83    .clang-format
100644 blob 43967c6b20151ee126db08e24758e3c789bcb844    .cocciconfig
100644 blob a64d219137455f407a7b1f2c6b156c5575852e9e    .get_maintainer.ignore
100644 blob 4b32eaa9571e64e47b51c43537063f56b204d8b3    .gitattributes
100644 blob d5f4804ed07cd36336a5e80f2a24e45104f902cf    .gitignore
100644 blob db4f2295bd9d792b47eb77aab179a9db0d968454    .mailmap
100644 blob a635a38ef9405fdfcfe97f3a435393c1e9cae971    COPYING
100644 blob 0787b5872906c8a92a63cde3961ed630e2ec93b6    CREDITS
040000 tree 37e1b4166d912d69738beca645d3d539da4bbf30    Documentation
[...]
040000 tree ba6955ee6228666d9ef117fdd45df2e53ba0e221    virt

This is the entirety of the top-level Linux kernel directory contents. The blob entries are sha1sum's of the actual file contents in that directory, so these are straightforward. Subdirectories are represented as other tree entries, which also consist of blob and tree records going all the way down to the last sublevel, which will only contain blobs.

So, tree f16e3e4bcea2d875a17d2278ff67364b3277b10a in the commit record is a checksum of other checksums and it allows us to verify that each and every file in linux.git is exactly the same as it was on Linus Torvalds' system when he created the commit. If any file is changed, the tree checksum would be different and the whole repository would be considered invalid, because the object hash would be different than in the commit.

Finally, if we look at the object mentioned in parent 1c8594b8427290c178c5d39885eacd9e41f68743, we will see that it is a hash of another commit, containing its own tree and parent records:

$ git cat-file -p 1c8594b8427290c178c5d39885eacd9e41f68743
tree d56de40028d9ecdbebfc2121fd1ce1213fa09fa2
parent 40c60ac32174f0c0c090cd31d0d1712f2478e689
parent ca9b31f6bb9c6aa9b4e5f0792f39a97bbffb8c51
author Linus Torvalds <torvalds@linux-foundation.org> 1595796417 -0700
committer Linus Torvalds <torvalds@linux-foundation.org> 1595796417 -0700
mergetag object ca9b31f6bb9c6aa9b4e5f0792f39a97bbffb8c51
[...]

If we cared to, we could walk each commit all the way back to the beginning of Linux git history, but we don't need to do that — verifying the checksum of the latest commit is sufficient to provide us all the necessary assurances about the entire history of that tree.

So, if we verify the signature on the tag and confirm that it matches the key belonging to Linus Torvalds, we will have strong cryptographic assurances that the repository on our disk is byte-for-byte the same as the repository on the computer belonging to Linus Torvalds — with all its contents and its entire history going back to the initial commit.

The difference between signed tags and signed commits is minimal — in the case of commit signing, the signature goes into the commit object itself. It is generally a good practice to PGP-sign commits, particularly in environments where multiple people can push to the same repository branch. Signed commits provide easy forensic proof of code origins (e.g. without commit signing Alice can fake a commit to pretend that it was actually authored by Bob). It also allows for easy verification in cases where someone wants to cherry-pick specific commits into their own tree without performing a git merge.

If you are looking to get started with git and PGP signatures, I can recommend my own Protecting Code Integrity guide, or its kernel-specific adaptation that ships with the kernel docs: Kernel Maintainer PGP Guide.

Obligatory note: sha1 is not considered sufficiently strong for hashing purposes these days, and this is widely acknowledged by the git development community. Significant efforts are under way to migrate git to stronger cryptographic hashes, but they require careful planning and implementation in order to minimize disruption to various projects using git. To my knowledge, there are no effective attacks against sha1 as used by git, and git developers have added further precautions against sha1 collision attacks in git itself, which helps buy some time until stronger hashing implementations are considered ready for real-world use.

 
Read more...

from Christian Brauner

Introduction

As most people know by know we do a lot of upstream kernel development. This stretches over multiple areas and of course we also do a lot of kernel work around containers. In this article I'd like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users. I've talked about this feature quite a few times at various conferences (just recently again at OSS NA) over the last two years but never actually sat down to write a blogpost about it. This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.

The Limits of Unprivileged Containers

That (Linux) Containers are a userspace fiction is a well-known dictum nowadays. It simply expresses the fact that there is no container kernel object in the Linux kernel. Instead, userspace is relatively free to define what a container is. But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system. This is achieved by combining a multitude of Linux kernel features. One of the better known kernel features that is used to build containers are namespaces. The number of namespaces the kernel supports has grown over time and we are currently at eight. Before you go and look them up on namespaces(7) here they are:

  • cgroup: cgroup_namespaces(7)
  • ipc: ipc_namespaces(7)
  • network: network_namespaces(7)
  • mount: mount_namespaces(7)
  • pid: pid_namespaces(7)
  • time: time_namespaces(7)
  • user: user_namespaces(7)
  • uts: uts_namespaces(7)

Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.

Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. CAP_MKNOD is required to create device nodes) and it seems that when the answer is “yes” then the task is actually a privileged task. But as usual things aren't that simple. What the task thinks it's checking for and what the kernel really is checking for are possibly two very different things. A naive task, i.e. a task not aware of user namespaces, might think it's asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.

In most cases the kernel will not check whether the task is privileged with respect to the whole system. Instead, it will almost always call a function called ns_capable() which is the kernel's way of checking whether the calling task has privilege in its current user namespace.

For example, when a new user namespace is created by setting the CLONE_NEWUSER flag in unshare(2) or in clone3(2) the kernel will grant a full set of capabilities to the task that called unshare(2) or the newly created child task via clone3(2) within the new user namespace. When this task now e.g. checks whether it has the CAP_MKNOD capability the kernel will report back that it indeed has that capability. The key point though is that this “yes” is not a global “yes”, i.e. the question “Am I privileged enough to perform this operation?” only applies to the current user namespace (and technically any nested user namespaces) not the host itself.

This distinction is important when trying to understand why a task running as root in a new user namespace with all capabilities raised will still see EPERM when e.g. trying to call mknod("/dev/mem", makedev(1, 1)) even though it seems to have all necessary privileges. The reason for this counterintuitive behavior is that the kernel isn't always checking whether you are privileged against your current user namespace. Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host's user namespace.

Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create /dev/kmem or any other critical device and use the device to take over the host. So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace. This is of course technically inconsistent since capabilities are per user namespace as we observed above.

Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices. So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it. On the other hand, some filesystems like cgroup, cgroup2, tmpfs, proc, sysfs, and fuse can be mounted in user namespace (with some caveats for proc and sys but we're ignoring those details for now) because the kernel can guarantee that this is safe.

But of course these restrictions are annoying. Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely. Quite often a container manager like LXD will know better than the kernel when an operation that a container tries to perform is safe.

A good example are device nodes. Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:

/dev/console
/dev/full
/dev/null
/dev/random
/dev/tty
/dev/urandom
/dev/zero

Allowing a container to create these devices would be safe. Of course, the container will simply bind-mount these devices during container startup into the container so this isn't really a serious problem. But any program running inside the container that wants to create these harmless devices nodes would fail.

The other example that was mentioned earlier is mounting of block-based filesystems. Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe. For example, they could have a dedicated disk for the container or they want to share data with or among containers. But the container could not mount any of those disks.

For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers. For example, here is how you'd hotplug /dev/zero into a running container:

 brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero

brauner@wittgenstein|~
> lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero
Device zero-device added to f5

brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero
crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero

But of course, that doesn't help at all when a random application inside the container calls mknod(2) itself. In these cases LXD has no way of helping the application by hotplugging the device as it's unaware that a mknod syscall has been performed.

So the root of the problem seems to be: – A task inside the container performs a syscall that will fail. – The syscall would not need to fail since the container manager knows that it is safe. – The container manager has no way of knowing when such a syscall is performed. – Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.

So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail. So somehow we need to be able to interact with the syscalls of another task.

Seccomp – The Basics of Syscall Interception

The obvious candidate to look at is seccomp. Short for “secure computing” it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question. But seccomp allows even more advanced configurations through so-called “filters”. Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel. For example, a task could use a seccomp filter to only allow the mount() syscall and only those mount syscalls that create bind mounts. This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs. Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully. Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.

At its core seccomp is nothing but a syscall interception mechanism. One way or another every operating system has something that is at least roughly comparable. The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths. So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.

Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.

But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture's syscall table the kernel will first call into seccomp and run the loaded seccomp filter.

Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task. If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code. This way, it is for example possible to have the kernel pretend like it doesn't know the mount(2) syscall by creating a seccomp filter that reports back ENOSYS whenever the task tries to call mount(2).

But the way seccomp used to work isn't very dynamic. Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter. So there is no way to make a case-by-case decision which might come in handy in some scenarios.

In addition seccomp itself can't make a syscall actually succeed other than in the trivial way of reporting back success to the caller. So seccomp will only allow the kernel to pretend that a syscall succeeded. So while it is possible to instruct the kernel to return 0 for the mount(2) syscall it cannot actually be instructed to make the mount(2) syscall succeed. So just making the seccomp filter return 0 for mounting a dedicated ext4 disk device to /mnt will still not actually mount it at /mnt; it just pretends to the caller that it did. Of course that is in itself already a useful property for a bunch of use-cases but it doesn't really help with the mknod(2) or mount(2) problem outlined above.

Extending Seccomp

So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our mknod(2) and mount(2) problem. Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task. What is missing though is a way to bring another task such as the LXD container manager into the picture. Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.

The answer to these questions is the seccomp notifier. This is as good a time as any to bring in some historical context. The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized. The first concrete details around the seccomp notifier were conceived in early 2017 in the LXD team. The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container's Microconference in Los Angeles. The details of this talk are still listed here here and I'm sure Stéphane can still provide the slides we came up with. I didn't find a video recording even though I somehow thought we did have one. If someone is really curious I can try to investigate with the Linux Plumbers committee. After this talk implementation specifics were discussed in a hallway meeting later that day. And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team. The rest is history^wchangelog.

Seccomp Notify – Syscall Interception 2.0

In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter. When a container starts it will usually load a seccomp filter to restrict its attack surface. That is even done for unprivileged containers even though it is not strictly necessary.

With the addition of the seccomp notifier a container wishing to have a subset of syscalls handled by another process can set the new SECCOMP_RET_USER_NOTIF flag on its seccomp filter. This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter. This file descriptor is a seccomp notify file descriptor.

Of course, the seccomp notify fd is not very useful to the task itself. First, since it doesn't make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls. Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.

But what the task can do with the seccomp notifier is to hand to another task. Usually the task that it will hand the seccomp notify fd to will be more privileged than itself. For a container the most obvious candidate would be the container manager of course.

Since the seccomp notify fd is pollable it is possible to put it into an event loop such as epoll(7), poll(2), or select(2) and wait for the file descriptor to become readable, i.e. for the kernel to return EPOLLIN to userspace. For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements. This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. mknod(2) or mount(2).

Put another way, this means the container manager can listen for syscall events for tasks running in the container. Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.

After the seccomp notify fd indicates that it is readable the container manager can use the new SECCOMP_IOCTL_NOTIF_RECV ioctl() associated with seccomp notify fds to read a struct seccomp_notif message for the syscall. Currently the data to be read from the seccomp notify fd includes the following pieces. But please be aware that we are in the process of discussing potentially intrusive changes for future versions:

struct seccomp_notif {
	__u64 id;
	__u32 pid;
	__u32 flags;
	struct seccomp_data data;
};

Let's look at this in a little more detail. The pid field is the pid of the task that performed the syscall as seen in the caller's pid namespace. To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called mknod(2) as seen in the pid namespace of the container manager. The id field is a unique identifier for the performed syscall. This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling. The flags argument is currently unused and reserved for future extensions.

The struct seccomp_data argument is probably the most interesting one as it contains the really exciting bits and pieces:

struct seccomp_data {
	int nr;
	__u32 arch;
	__u64 instruction_pointer;
	__u64 args[6];
};

The int field is the syscall number which can only be correctly interpreted relative to the arch field. The arch field is the (audit) architecture for which this syscall was made. This field is very relevant since compatible architectures (For the x86 architectures this encompasses at least x32, i386, and x86_64. The arm, mips, and power architectures also have compatible “sub” architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall foo() might have syscall number 1 on 32 bit and 2 on 64 bit. So the task reading the seccomp data can't simply assume that since it itself is running in a 32 bit environment the syscall number must be 1. Rather, it must check what the audit arch is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit. Otherwise the container manager might end up emulating mount() when it should be emulating mknod().). The instruction_pointer is set to the address of the instruction that performed the syscall. This is of course also architecture specific. And last the args member are the syscall arguments that the task performed the syscall with.

The args need to be interpreted and treated differently depending on the syscall layout and their type. If they are non-pointer arguments (unsigned int etc.) they can be copied into a local variable and interpreted right away. But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall. In the latter case the memory needs to be read and copied before it can be interpreted.

Let's look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments. Say the performed syscall was mount(2). In order to interpret the args field correctly we look at the syscall layout of mount(). (Please note, that I'm stressing that we need to look at the layout of syscall and the only reliable source for this is actually the kernel source code. The Linux manpages often list the wrapper provided by the system's libc and these wrapper do not necessarily line-up with the syscall itself (compare the waitid() wrapper and the waitid() syscall or the various clone() syscall layouts).) From the layout of mount(2) we see that args[0] is a pointer argument identifying the source path, args[1] is another pointer argument identifying the target path, args[2] is a pointer argument identifying the filesystem type, args[3] is a non-pointer argument identifying the options, and args[4] is another pointer argument identifying additional mount options.

So if we were to be interested in the source path of this mount(2) syscall we would need to open the /proc/<pid>/mem file of the task that performed this syscall and e.g. use the pread(2) function with args[0] as the offset into the task's virtual memory and read it into a buffer at least the length of a standard path. Alternatively, we can use a single syscall like process_vm_readv(2) to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.

A friendly advice: in general it is a good idea for the container manager to read all syscall arguments once into a local buffer and base its decisions on how to proceed on the data in this local buffer. Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it's also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to /proc/<pid>/mem and change the contents of e.g. args[0] or any other syscall argument. Also note, that the container manager should ensure that /proc/<pid> still refers to the same task after opening it by checking the validity of the syscall request via the id field and the associated SECCOMP_IOCTL_NOTIF_ID_VALID ioctl() to exclude the possibility of the task having exited, been reaped and its pid having been recycled.

But let's assume we have done all that. Now that the container manager has the task's syscall arguments available in a local buffer it can interpret the syscall arguments. While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed. After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the SECCOMP_IOCTL_NOTIF_SEND ioctl() on the seccomp notify fd to tell the kernel what it should do with the blocked task's syscall. The response is given in the form struct seccomp_notif_resp:

struct seccomp_notif_resp {
	__u64 id;
	__s64 val;
	__s32 error;
	__u32 flags;
};

Let's look at this struct in a little more detail too. The id field is set to the id of the syscall request to respond to and should correspond to the received id in the struct seccomp_notif that the container manager read via the SECCOMP_IOCTL_NOTIF_RECV ioctl() when the seccomp notify fd became readable. The val field is the return value of the syscall and is only set if the error field is set to 0. The error field is the error to return from the syscall and should be set to a negative errno(3) code if the syscall is supposed to fail (For example, to trick the caller into thinking that mount(2) is not supported on this kernel set error to -ENOSYS.). The flags value can be used to tell the kernel to continue the syscall by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag which I added to be able to intercept mount(2) and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments. More on that in a little bit.

With this machinery in place we are for now ;) done with the kernel bits.

Emulating Syscalls In Userspace

So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue. Probably emulate it. Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a very bad idea.).

Emulating syscalls in userspace is not a very new thing to do. It has been done for a long time. For example, libc's can choose to emulate the execveat(2) syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path. On a kernel that doesn't support the execveat(2) syscall the libc can emulate it by calling exec(3) with the path set to /proc/self/fd/<nr>. The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (fexecve(3) for our example). Any task using syscall(__NR_execveat, [...]) to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn't know that the task wants to perform the execveat(2) syscall and will not be able to emulate it in case the kernel doesn't support it.

The seccomp notifier doesn't suffer from this problem since its syscall interception abilities aren't located in userspace at the library level but directly in the syscall path as we have seen. This greatly expands the abilities to emulate syscalls.

So now we have all the kernel pieces in place to solve our mknod(2) and mount(2) problem in unprivileged containers. Instead of simply letting the container fail on such harmless requests as creating the /dev/zero device node we can use the seccomp notifier to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it. Similarly, we can intercept mount(2) requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container. We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem. We actually support this feature in LXD. Since fuse is a safe way for unprivileged users to mount filesystems rewriting mount(2) requests is a great way to expose filesystems to containers.

In general, the possibilities of the seccomp notifier can't be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC. As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with the seccomp notifier seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to the seccomp notifier.

Seccomp Notify in action – LXD

After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses. Instead of simply exposing the raw seccomp notify fd for the container's seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients. Clients here are new containers who the administrator has signed up for syscall supervisions through LXD. Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.

When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd. The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials. LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments. If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall. For mknod(2) it will create the device node for the container and for mount(2) it will mount the filesystem for the container. Either by directly mounting it or by using a specified fuse binary for additional security.

If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container. The container then parses the message, verifying the credentials and will use the SECCOMP_IOCTL_NOTIF_SEND ioctl() sending a struct seccomp_notif_resp causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded. Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.

Show Me!

Ok, enough talk. Let's intercept some syscalls. The following demo shows how LXD uses the seccomp notify fd to emulate the mknod(2) and mount(2) syscalls for an unprivileged container:

asciicast

Current Work and Future Directions

SECCOMP_USER_NOTIF_FLAG_CONTINUE

After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly. We realized we couldn't intercept the mount syscall. Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept. This is caused by seccomp not being able to handle pointer arguments. They are opaque for seccomp. So while it is possible to tell seccomp to only intercept mount(2) requests for real filesystems by only intercepting mount(2) syscalls where the MS_BIND flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about mount(2) syscalls for the ext4 or btrfs filesystem because the filesystem argument is a pointer.

But this means we will inadvertently intercept syscalls that we didn't intend to intercept. That is a generic problem but for some syscalls it's not really a big deal. For example, we know that mknod(2) fails for all character and block devices in unprivileged containers. So as long was we write a seccomp filter that intercepts only character and block device mknod(2) syscalls but no socket or fifo mknod() syscalls we don't have a problem. For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report EPERM and since the syscalls would fail anyway there's no problem.

But any system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace. But this would of course include all mount(2) syscalls for filesystems that can be mounted in unprivileged containers. I've listed a subset of them above. It includes at least tmpfs, proc, sysfs, devpts, cgroup, cgroup2 and probably a few others I'm forgetting. That's not ideal. We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.

The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall. To instruct the kernel to continue the syscall the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag can be set in struct seccomp_notif_resp's flag argument when instructing the kernel to unblock the task.

This is of course a very exciting feature and has a few readers probably thinking “Hm, I could implement a dynamic userspace seccomp policy.” to which I want to very loudly respond “No, you can't!”. In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace. I'm now going to mostly quote verbatim from my comment for the extension: The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with extreme caution! If set by the task supervising the syscalls of another task the syscall will continue. This is problematic is inherent because of TOCTOU (Time of Check-Time of Use). An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall. It should be absolutely clear that this means that the seccomp notifier cannot be used to implement a security policy on syscalls that read from dereferenced pointers in user space! It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe. In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.

Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the same syscall, the most recently added filter takes precedence. This means that the new SECCOMP_RET_USER_NOTIF filter can override any SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.

Retrieving file descriptors pidfd_getfd()

Another extension that was added by Sargun Dhillon recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task. This works even without the seccomp notifier since it is a new syscall but is of course especially useful in conjunction with it.

Often we would like to intercept syscalls such as connect(2). For example, the container manager might want to rewrite the connect(2) request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint. In these cases pidfd_getfd(2) can be used to retrieve a copy of the file descriptor of the task and perform the connect(2) for it. This unblocks another wide range of use-cases.

For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run getsockopt(2) on the file descriptor, and you can use options like TCP_INFO to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like TCP_NODELAY, to tune the socket without requiring the user's intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where connect(2) calls are intercepted, and the destination is changed to a real server instead.

Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.

Plot 63
Injecting file descriptors SECCOMP_NOTIFY_IOCTL_ADDFD

Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task. As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as open(2) that cause new file descriptors to be installed in the task performing the syscall.

The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on. This way it is possible to intercept syscalls such as open(2) or accept(2), and install (or replace, like dup2(2)) the container manager's resulting fd in the target task.

This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to perf_event_open(2), and bpf(2) for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.

Last, I want to note that Sargun Dhillon was kind enough to contribute paragraphs to the pidfd_getfd(2) and SECCOMP_NOTIFY_IOCTL_ADDFD sections. He also provided the graphic in the pidfd_getfd(2) sections to illustrate the performance benefits of this solution.

Christian

 
Read more...

from David Ahern

When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.

This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.

Packet Processing in Linux

Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, 'top' can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload.

net_rx_action, the handler for network Rx traffic, usually runs really fast – like under 25 usecs[1] – dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see __do_softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).

One way to see the above description is using perf:

sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit \
        -e irq:softirq_entry --filter="vec == 3" \
        -e irq:softirq_exit --filter="vec == 3"  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script

The output is something like:

swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0

In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.

The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?

With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000's to 10,000's of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.

Let's take a look at one way to see this impact.

Undo the Distributed Processing

First, let's undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.

RPS is disabled on the 2 nics using

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

Next, add flow rules to push packets for the VM under test to a single CPU

DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2

Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.

openssl speed

I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as 'openssl speed' and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting).

'openssl speed' is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344

so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tms_utime will match the test time (10 seconds in this case).

Since it is is a pure userspace benchmark ANY system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896

shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead – time stolen from this process for “system needs.”

Using openssl to Infer Networking Overhead

With an understanding of how 'openssl speed' works, let's look at a near idle server:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc's in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc's in 10.00s

so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let's add network load, netperf TCP_STREAM from 2 sources, and re-do the test:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc's in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc's in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc's in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc's in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc's in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc's in 1.89s

Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets – either in the context of openssl or via ksoftirqd.

Looking at top for the previous openssl run:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5

one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.

If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5

but openssl reports ~4 seconds of userspace time:

Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc's in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc's in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc's in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc's in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc's in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc's in 4.16s

Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.

In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything – virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.

Up Next

This post is the basis for a follow up post where I will discuss a comparison of the overhead of “full stack” on a VM host to “XDP”.

[1] Measured using ebpf program on entry and exit. See net_rx_action in https://github.com/dsahern/bpf-progs

[2] Assuming no bugs in the networking stack and driver. I have examined systems where net_rx_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).

[3] https://github.com/isc-projects/ethq

 
Read more...

from metan's blog

What's a race anyways?

Most of the readers, for sure, do know what a race condition is. Let me however include short description just for the sake of completeness. Race condition, in terms of a computer programming, is a bug where two pieces of code cause an error if executed concurrently.

As a kernel QA I'm mostly interested in writing testcases that can reproduce once fixed races in kernel code in order to avoid regressions and also to make sure all code streams, such as stable kernels, are bug free.

The main problem with these tests is that they are notoriously unreliable. Especially when the race window is very small, just a few instructions in length, it's nearly impossible to write a test that can reasonably reliably trigger the problem.

Naive approach to races

Race reproducer tests are usually implemented so that two threads runs two different loops each with a different piece of code in a hope that the race would trigger. In a case of a kernel this usually means two threads each calling a different syscall in a loop. The problem with this approach is that we are depending on system jitter in order to hit the race window. And as you may know computers are mostly deterministic, hence we are strongly depending on luck in this case.

Fuzzy sync to the rescue

We started to ponder if we can do better while we were converting a CVE proof of concepts into a LTP testcases. We ended up, after three redesigns and rewrites, with a nice library that can greatly improve the likelihood of triggering a race successfully. As a side effect we need much less iterations and our testcases finish much faster on race-free systems.

The overall idea is quite simple. First we sample how long it takes for the two pieces of code that cause the race to run. Once that is settled all we need to do is to synchronize the code sections accordingly. As we do not know which exact parts of the two sections needs to be aligned to trigger the race, we synchronize the code sections so that the alignment is random in each iteration. But we also make sure that the two sections overlap, otherwise there is no chance to trigger the race.

The implementation is a bit more complicated though. In the sampling phase we are using moving average and we wait for the deviation to settle down so that we have reasonable approximation of the race sections duration. Synchronization of the sections depends on atomic increments and spinlocks and the delay, used for alignment randomization, is introduced by a calibrated busy loops.

Problem solved?

No, not completely. The real world problem is a bit more complex than that. It's true that fuzzy sync library applies well to many race reproducers but some cases does not work at all. One of the problems we found is that the syscall duration may vary considerably depending on the alignment of the second piece of the racing code.

Consider for example recvmsg() racing with close(). If we align these two syscalls in a way that the file descriptor would be closed at the beginning of recvmsg() the syscall will quickly return EBADF which has zero chance of hitting the race. This was the case for CVE-2016-7117 so had to introduce a function to bias the offset of the code sections in order to finish sampling phase successfully in this case.

Real world example

Piece of code that attempts to reproduce d90a10e2444b “fsnotify: Fix fsnotify_mark_connector race” follows.

static struct tst_fzsync_pair fzsync_pair;
static int fd;

static void *write_seek(void *unused)
{
        char buf[64];

        while (tst_fzsync_run_b(&fzsync_pair)) {
                tst_fzsync_start_race_b(&fzsync_pair);
                SAFE_WRITE(0, fd, buf, sizeof(buf));
                SAFE_LSEEK(fd, 0, SEEK_SET);
                tst_fzsync_end_race_b(&fzsync_pair);
        }
        return unused;
}

static void setup(void)
{
        fd = SAFE_OPEN(FNAME, O_CREAT | O_RDWR, 0600);
        tst_fzsync_pair_init(&fzsync_pair);
}

static void cleanup(void)
{
        if (fd > 0)
                SAFE_CLOSE(fd);

        tst_fzsync_pair_cleanup(&fzsync_pair);
}

static void verify_inotify(void)
{
        int inotify_fd;
        int wd;

        inotify_fd = SAFE_MYINOTIFY_INIT1(0);

        tst_fzsync_pair_reset(&fzsync_pair, write_seek);
        while (tst_fzsync_run_a(&fzsync_pair)) {
                wd = SAFE_MYINOTIFY_ADD_WATCH(inotify_fd, FNAME, IN_MODIFY);

                tst_fzsync_start_race_a(&fzsync_pair);
                wd = myinotify_rm_watch(inotify_fd, wd);
                tst_fzsync_end_race_a(&fzsync_pair);
                if (wd < 0)
                        tst_brk(TBROK | TERRNO, "inotify_rm_watch() failed.");
        }
        SAFE_CLOSE(inotify_fd);
        /* We survived for given time - test succeeded */
        tst_res(TPASS, "kernel survived inotify beating");
}

static struct tst_test test = {
        .needs_tmpdir = 1,
        .setup = setup,
        .cleanup = cleanup,
        .test_all = verify_inotify,
};

As you can see the initialization and exit are handled in the setup() and cleanup() functions.

The syscalls that are racing here are enclosed between tst_fzsync_start_race_*() and tst_fzsync_end_race_*().

The tst_fzync_pair_reset() functions clears the counters and also starts a thread that is racing against the main thread.

And that's about it, all the complexity is neatly packed in the fzsync library.

 
Read more...

from joelfernandes

GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD's slab implementation) is here: https://www.youtube.com/watch?v=ZXUIFj4nRjk

The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.

Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can't see the photon yet, we can only see what we've been shown so far, and in a way, if we've not seen something because enough “time” has not passed, then we may not reclaim it yet. If we've not seen something, we will see it at some point in the future. Till then we need to sit tight.

Roughly, an implementation has 2+N counters (with N CPUs): 1. Global write sequence. 2. Global read sequence. 3. Per-cpu read sequence (read from #1 when a reader starts)

On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free'ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs. Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.

Note the scalability drawbacks of this reclaim scheme:

  1. Expensive poll operation if you have 1000s of CPUs. (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)

  2. Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.

  3. There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.

One advantage of the scheme could be implementation simplicity.

RCU (not SRCU or Userspace RCU) doesn't suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.

 
Read more...

from Konstantin Ryabitsev

For the past few weeks I've been working on a tool to fetch patches from lore.kernel.org and perform the kind of post-processing that is common for most maintainers:

  • rearrange the patches in proper order
  • tally up various follow-up trailers like Reviewed-by, Acked-by, etc
  • check if a newer series revision exists and automatically grab it

The tool started out as get-lore-mbox, but has now graduated into its own project called b4 — you can find it on git.kernel.org and pypi.

To use it, all you need to know is the message-id of one of the patches in the thread you want to grab. Once you have that, you can use the lore.kernel.org archive to grab the whole thread and prepare an mbox file that is ready to be fed to git-am:

$ b4 am 20200312131531.3615556-1-christian.brauner@ubuntu.com
Looking up https://lore.kernel.org/r/20200312131531.3615556-1-christian.brauner@ubuntu.com
Grabbing thread from lore.kernel.org
Analyzing 26 messages in the thread
Found new series v2
Will use the latest revision: v2
You can pick other revisions using the -vN flag
---
Writing ./v2_20200313_christian_brauner_ubuntu_com.mbx
  [PATCH v2 1/3] binderfs: port tests to test harness infrastructure
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
  [PATCH v2 2/3] binderfs_test: switch from /dev to a unique per-test mountpoint
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
  [PATCH v2 3/3] binderfs: add stress test for binderfs binder devices
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
---
Total patches: 3
---
 Link: https://lore.kernel.org/r/20200313152420.138777-1-christian.brauner@ubuntu.com
 Base: 2c523b344dfa65a3738e7039832044aa133c75fb
       git checkout -b v2_20200313_christian_brauner_ubuntu_com 2c523b344dfa65a3738e7039832044aa133c75fb
       git am ./v2_20200313_christian_brauner_ubuntu_com.mbx

As you can see, it was able to:

  • grab the whole thread
  • find the latest revision of the series (v2)
  • tally up the Reviewed-by trailers from Kees Cook and insert them into proper places
  • save all patches into an mbox file
  • show the commit-base (since it was specified)
  • show example git checkout and git am commands

Pretty neat, eh? You don't even need to know on which list the thread was posted — lore.kernel.org, through the magic of public-inbox, will try to find it automatically.

If you want to try it out, you can install b4 using:

pip install b4

(If you are wondering about the name, then you should click the following links: V'ger, Lore, B-4.)

The same, but now with patch attestation

On top of that, b4 also introduces support for cryptographic patch attestation, which makes it possible to verify that patches (and their metadata) weren't modified in transit between developers. This is still an experimental feature, but initial tests have been pretty encouraging.

I tried to design this mechanism so it fulfills the following requirements:

  • it must be unobtrusive and not pollute the mailing lists with attestation data
  • it must be possible to submit attestation after the patches were already sent off to the list (for example, from a different system, or after being asked to do so by the maintainer/reviewer)
  • it must not invent any new crypto or key distribution routines; this means sticking with PGP/GnuPG — at least for the time being

If you are curious about the technical details, I refer you to my original RFC where I describe the implementation.

If you simply want to start using it, then read on.

Submitting patch attestation

If you would like to submit attestation for a patch or a series of patches, the best time to do that is right after you use git send-email to submit your patches to the list. Simply run the following:

b4 attest *.patch

This will do the following:

  • create a set of 3 hashes per each patch (for the metadata, for the commit message, and for the patch itself)
  • add these hashes to a YAML-style document
  • PGP-sign the attestation document using the PGP key you set up with git
  • connect to mail.kernel.org:587 and send the attestation document to the signatures@kernel.org mailing list.

If you don't want to send that attestation right away, use the -n flag to simply generate the message and save it locally for review.

Verifying patch attestation

When running b4 am, the tool will automatically check if attestation is available by querying the signatures archive on lore.kernel.org. If it finds the attestation document, it will run gpg --verify on it. All of the following checks must pass before attestation is accepted:

  1. The signature must be “good” (signed contents weren't modified)
  2. The signature must be “valid” (not done with a revoked/expired key)
  3. The signature must be “trusted” (more on this below)

If all these checks pass, b4 am will show validation checkmarks next to the patches as it processes them:

$ b4 am 202003131609.228C4BBEDE@keescook
Looking up https://lore.kernel.org/r/202003131609.228C4BBEDE@keescook
Grabbing thread from lore.kernel.org
...
---
Writing ./v2_20200313_keescook_chromium_org.mbx
  [✓] [PATCH v2 1/2] selftests/harness: Move test child waiting logic
  [✓] [PATCH v2 2/2] selftests/harness: Handle timeouts cleanly
  ---
  [✓] Attestation-by: Kees Cook <keescook@chromium.org> (pgp: 8972F4DFDC6DC026)
---
Total patches: 2
---
...

These checkmarks give you assurance that all patches are exactly the same as when they were generated by the developer on their system.

Trusting on First Use (TOFU)

The most bothersome part of PGP is key management. In fact, it's the most bothersome part of any cryptographic attestation scheme — you either have to delegate your trust management to some shadowy Certification Authority, or you have to do a lot of decision making of your own when evaluating which keys to trust.

GnuPG tries to make it a bit easier by introducing the “Trust on First Use” (TOFU) model. The first time you come across a key, it is considered automatically trusted. If you suddenly come across a different key with the same identity on it, GnuPG will mark both keys as untrusted and let you decide on your own which one is “the right one.”

If you want to use the TOFU trust policy for patch attestation, you can add the following configuration parameter to your $HOME/.gitconfig:

[b4]
  attestation-trust-model = tofu

Alternatively, you can use the traditional GnuPG trust model, where you rely on cross-certification (“key signing”) to make a decision on which keys you trust.

Where to get help

If either b4 or patch attestation are breaking for you — or with any questions or comments — please reach out for help on the kernel.org tools mailing list:

  • tools@linux.kernel.org
 
Read more...

from David Ahern

Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.

This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:

  • Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS

  • Install docker-ce

  • Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:

    ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
    --containerd=/run/containerd/containerd.sock
    
  • Tell systemd about the change and restart docker

    systemctl daemon-reload
    systemctl restart docker
    

With that, docker pull should work fine – in mgmt vrf or default vrf.

 
Read more...

from David Ahern

Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).

The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:

$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic

despite VRF being enabled and built:

$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m

which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: $ sudo apt-get install --install-recommends linux-generic-hwe-18.04

although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.

First, let's enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:

    echo "net.ipv4.tcp_l3mdev_accept=1" >> /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf

Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]

Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018: https://bugs.launchpad.net/netplan/+bug/1773522

One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:

#!/bin/bash

ip link show dev mgmt 2>/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default

The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]

Reboot node to make the changes take effect.

WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.

After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).

Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use 'ip vrf exec' as a helper to do the VRF binding. 'ip vrf exec' uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device ('mgmt' in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).

Let's see what happens:

$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'

Theoretically, this should Just Work, but it clearly does not. Why?

Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0

So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.

Where does the theory breakdown? In 3 places.

First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:

$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache

That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let's add the loopback addresses to it:

    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache

The second problem is systemd-resolved binds its socket to the loopback device:

    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=13))

The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.

The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.

Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.

$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf > /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]

When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding: $ ip vrf exec mgmt su – dsahern

Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].

Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:

NS=$(ip netns identify)
[ -n "$NS" ] && NS=":${NS}"

VRF=$(ip vrf identify)
[ -n "$VRF" ] && VRF=":${VRF}"

And then adding '${NS}${VRF}' after the host in PS1:

PS1='${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ '

For example, now the prompt becomes: dsahern@myhost:mgmt:~$

References:

[1] VRF tutorial, Open Source Summit, North America, Sept 2017 http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf

[2] VRF helpers, e.g., systemd instances for VRF https://github.com/CumulusNetworks/vrf

[3] Example using VRF in dhclient-exit-hook https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf

[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.

 
Read more...