How the ARM32 Linux kernel decompresses

August 12, 2020

ARM traditionally uses compressed kernels. This is done for two major reasons:

It saves space on the flash memory or other storage media holding the kernel, and memory is money. For example for the Gemini platform that I work on, the vmlinux uncompressed kernel is 11.8 MB while the compressed zImage is a mere 4.8 MB, we save more than 50%
It is faster to load because the time it takes for the decompression to run is shorter than the time that it takes to transfer an uncompressed image from the storage media, such as flash. For NAND flash controllers this can easily be the case.

This is intended as a comprehensive rundown of how the Linux kernel self-decompresses on ARM 32-bit legacy systems. All machines under arch/arm/* uses this method if they are booted using a compressed kernel, and most of them are using compressed kernels.

Bootloader

The bootloader, whether RedBoot, U-Boot or EFI places the kernel image somewhere in physical memory and executes it passing some parameters in the lower registers.

Russell King defined the ABI for booting the Linux kernel from a bootloader in 2002 in the Booting ARM Linux document. The boot loader puts 0 into register r0, an architecture ID into register r1 and a pointer to the ATAGs in register r2. The ATAGs would contain the location and size of the physical memory. The kernel would be placed somewhere in this memory. It can be executed from any address as long as the decompressed kernel fits. The boot loader then jumps to the kernel in supervisor mode, with all interrupts, MMUs and caches disabled.

On contemporary device tree kernels, r2 is repurposed as a pointer to the device tree blob (DTB) in physical memory. (In this case r1 is ignored.) A DTB can also be appended to the kernel image, and optionally amended using the ATAGs from r2. We will discuss this more below.

Decompression of the zImage

If the kernel is compressed, execution begins in arch/arm/boot/compressed/head.S in the symbol start: a little bit down the file. (This is not immediately evident.) It begins with 8 or 7 NOP instructions for legacy reasons. It jumps over some magic numbers and saves the pointer to the ATAGs. So now the kernel decompression code is executing from the physical address of the physical memory where it was loaded.

The decompression code then locates the start of physical memory. On most modern platforms this is done with the Kconfig-selected code AUTO_ZRELADDR, which means a logical AND between the program counter and 0xf8000000. This means that the kernel readily assumes that it has been loaded and executed in the first part of the first block of physical memory.

There are patches being made that would instead attempt to get this information from the device tree.

Then the TEXT_OFFSET is added to the pointer to the start of physical memory. As the name says, this is where the kernel .text segment (as output from the compiler) should be located. The .text segment contains the executable code so this is the actual starting address of the kernel after decompression. The TEXT_OFFSET is usually 0x8000 so the kernel will be located 0x8000 bytes into the physical memory. This is defined in arch/arm/Makefile.

The 0x8000 (32KB) offset is a convention, because usually there is some immobile architecture-specific data placed at 0x00000000 such as interrupt vectors, and many elder systems place the ATAGs at 0x00000100. There also must be some space, because when the kernel finally boots, it will subtract 0x4000 (or 0x5000 for LPAE) from this address and store the initial kernel page table there.

For some specific platforms the TEXT_OFFSET will be pushed downwards in memory, notably some Qualcomm platforms will push it to 0x00208000 because the first 0x00200000 (2 MB) of the physical memory is used for shared memory communication with the modem CPU.

Next the decompression code sets up a page table, if it is possible to fit one over the whole uncompressed+compressed kernel image. The page table is not for virtual memory, but for enabling cache, which is then turned on. The decompression will for natural reasons be much faster if we can use cache.

Next the kernel sets up a local stack pointer and malloc() area so we can handle subroutine calls and small memory allocation going forward, executing code written in C. This is set to point right after the end of the kernel image.

Memory set-up Compressed kernel in memory with an attached DTB.

Next we check for an appended DTB blob enabled by the ARM_APPENDED_DTB symbol. This is a DTB that is added to the zImage during build, often with the simple cat foo.dtb >> zImage. The DTB is identified using a magic number, 0xD00DFEED.

If an appended DTB is found, and CONFIG_ARM_ATAG_DTB_COMPAT is set, we first expand the DTB by 50% and call atagstofdt that will augment the DTB with information from the ATAGs, such as memory blocks and sizes.

Next. the DTB pointer (what was passed in as r2 in the beginning) is overwritten with a pointer to the appended DTB, we also save the size of the DTB, and set the end of the kernel image after the DTB so the appended DTB (optionally modified with the ATAGs) is included in the total size of the compressed kernel. If an appended DTB was found, we also bump the stack and the malloc() location so we don’t destroy the DTB.

Notice: if a device tree pointer was passed in in r2, and an appended DTB was also supplied, the appended DTB “wins” and is what the system will use. This can sometimes be used to override a default DTB passed by a boot loader.

Notice: if ATAGs were passed in in r2, there certainly was no DTB passed in through that register. You almost always want the CONFIG_ARM_ATAG_DTB_COMPAT symbol if you use an elder boot loader that you do not want to replace, as the ATAGs properly defines the memory on elder platforms. It is possible to define the memory in the device tree, but more often than not, people skip this and rely on the boot loader to provide this, one way (the bootloader alters the DTB) or another (the ATAGs augment the appended DTB at boot).

Memory overlap The decompressed kernel may overlap the compressed kernel.

Next we check if we would overwrite the compressed kernel with the uncompressed kernel. That would be unfortunate. If this would happen, we check where in the memory the uncompressed kernel would end, and then we copy ourselves (the compressed kernel) past that location.

Then the code simply does a trick to jump back to the relocated address of a label called restart: which is the start of the code to set up the stack pointer and malloc() area, but now executing at the new physical address.

This means it will again set up the stack and malloc() area and look for the appended DTB and everything will look like the kernel was loaded in this location to begin with. (With one difference though: we have already augmented the DTB with ATAGs, so that will not be done again.) This time the uncompressed kernel will not overwrite the compressed kernel.

Moving the compressed kernel We move the compressed kernel down so the decompressed kernel can fit.

There is no check for if the memory runs out, i.e. if we would happen to copy the kernel beyond the end of the physical memory. If this happens, the result is unpredictable. This can happen if the memory is 8MB or less, in these situations: do not use compressed kernels.

Moving the compressed kernel below the decompressed kernel The compressed kernel is moved below the decompressed kernel.

Now we know that the kernel can be decompressed into a memory that is below the compressed image and that they will not collide during decompression and we execute at the label wont_overwrite:.

We check if we are executing on the address the decompressor was linked to, and possibly alter some pointer tables. This is for the C runtime environment executing the decompressor.

We make sure that the caches are turned on. (There is not certainly space for a page table.)

We clear the BSS area (so all uninitialized variables will be 0), also for the C runtime environment.

Next we call the decompress_kernel() symbol in boot/compressed/misc.c which in turn calls do_decompress() which calls __decompress() which will perform the actual decompression.

This is implemented in C and the type of decompression is different depending on Kconfig options: the same decompressor as the compression selected when building the kernel will be linked into the image and executed from physical memory. All architectures share the same decompression library. The __decompress() function called will depend on which of the decompressors in lib/decompress_*.c that was linked into the image. The selection of decompressor happens in arch/arm/boot/compressed/decompress.c by simply including the whole decompressor into the file.

All the variables the decompressor needs about the location of the compressed kernel are set up in the registers before calling the decompressor.

After decompression, the decompressed kernel is at TEXT_OFFSET and the appended DTB (if any) remains where the compressed kernel was.

After the decompression, we call get_inflated_image_size() to get the size of the final, decompressed kernel. We then flush and turn off the caches again.

We then jump to the symbol __enter_kernel which sets r0, r1 and r2 as the boot loader would have left them, unless we have an attached device tree blob, in which case r2 now points to that DTB. We then set the program counter to the start of the kernel, which will be the start of physical memory plus TEXT_OFFSET, typically 0x00008000 on a very conventional system, maybe 0x20008000 on some Qualcomm systems.

We are now at the same point as if we had loaded an uncompressed kernel, the vmlinux file, into memory at TEXT_OFFSET, passing (typically) a device tree in r2.

Kernel startup: executing vmlinux

The uncompressed kernel begins executing at the symbol stext(), start of text segment. This code can be found in arch/arm/kernel/head.S.

This is a subject of another discussion. However notice that the code here does not look for an appended device tree! If an appended device tree should be used, you must use a compressed kernel. The same goes for augmenting any device tree with ATAGs. That must also use a compressed kernel image, for the code to do this is part of the assembly that bootstraps a compressed kernel.

Looking closer at a kernel uncompress

Let us look closer at a Qualcomm APQ8060 decompression.

First you need to enable CONFIG_DEBUG_LL, which enables you to hammer out characters on the UART console without any intervention of any higher printing mechanisms. All it does is to provide a physical address to the UART and routines to poll for pushing out characters. It sets up DEBUG_UART_PHYS so that the kernel knows where the physical UART I/O area is located. Make sure these definitions are correct.

First enable a Kconfig option called CONFIG_DEBUG_UNCOMPRESS. All this does is to print the short message “Uncompressing Linux…” before decompressing the kernel and , “done, booting the kernel” after the decompression. It is a nice smoke test to show that the CONFIG_DEBUG_LL is set up and DEBUG_UART_PHYS is correct and decompression is working but not much more. This does not provide any low-level debug.

The actual kernel decompression can be debugged and inspected by enabling the DEBUG define in arch/arm/boot/compressed/head.S, this is easiest done by tagging on -DDEBUG to the AFLAGS (assembler flags) for head.S in the arch/arm/boot/compressed/Makefile like this:

AFLAGS_head.o += -DTEXT_OFFSET=$(TEXT_OFFSET) -DDEBUG

We then get this message when booting:

C:0x403080C0-0x40DF0CC0->0x41801D00-0x422EA900

This means that as we were booting I loaded the kernel to 0x40300000 which would collide with the uncompressed kernel. Therefore the kernel was copied to 0x41801D00 which is where the uncompressed kernel will end. Adding some further debug prints we can see that an appended DTB is first found at 0x40DEBA68 and after moving the kernel down it is found at 0x422E56A8, which is where it remains when the kernel is booted.