linusw

In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:

  • PAN for LPAE CONFIG_CPU_TTBR0_PAN
  • KCFI on ARM32 CONFIG_CFI_CLANG

As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!

A good rundown of these and other historically interesting security features can be found in Russell Currey's abridged history of kernel hardening which sums up what has been done up to now in a very approachable form.

PAN for LPAE

PAN is an abbreviation for the somewhat grammatically incorrect Privileged Access Never.

The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions get_from_user() and put_to_user(). Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.

In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.

The ARM32 hardware has for a few years had a config option named CONFIG_SW_DOMAIN_PAN which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.

For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.

As of the patch originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate translation table base registers (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).

By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace – unless explicitly enabled in get|put_[from|to]_user() – we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.

KCFI on ARM32

The Kernel Control Flow Integrity is a “forward edge control flow checker”, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.

KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag LLVM=1 to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.

Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option -fsanitize=kcfi else the option will be disabled.

The patch set is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special SYM_TYPED_FUNC_START() and SYM_FUNC_END() macros, inserting KCFI hashes also before functions written in assembly.

The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.

As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the init task and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:

  • How tasks are actually scheduled on ARM32
  • How the kernelspace and userspace are actually separated, and thus how we move from one to the other

This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.

Terminology

With tasks we mean processes, threads and kernel threads. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.

Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called kthreadd — the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.

Processes in userspace are in essence executing computer programs, or objects with an older terminology, giving the origin of expressions such as object file format. The kernel will start very few such processes, but modprobe and init (which always has process ID 1) are notable exceptions. Any other userspace processes are started by init. Processes can fork new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. POSIX threads is usually the way this happens and further abstractions such as the GLib GThread etc exist.

Task pie chart A pie chart of tasks according to priority on a certain system produced using CGFreak shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.

The userspace is the commonplace name given to a specific context of execution where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but not from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls “syscalls” or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)

The kernelspace conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.

Background

This separation, along with everything else we take for granted in modern computers and operating systems was created in the first time-sharing systems such as the CTSS running on the IBM 700/7000 series computers in the late 1950ies. The Ferranti Atlas Computer in 1962-1967 and its supervisor program followed shortly after these. The Atlas invented nifty features such as virtual memory and memory-mapped I/O, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.

The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the Motorola 68000 CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program code or TEXT segments and a DATA segments.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.

All this said to give you a clear idea why the acronym SVC as in Supervisor Call is used rather than e.g. operating system call or kernel call which would have been more natural. This naming is historical.

Execution Modes or Levels

We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.

When it comes to the older CPUs in the ARMv4, ARMv5 and ARMv6 range these have a special supervisor mode (SVC mode) and a user mode, and as you could guess these two modes are mapped directly to kernelspace and userspace in Linux. In addition to this there are actually 5 additional exception modes for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the user mode belong to kernelspace.

Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from supervisor mode and this way it is clear who is in control.

The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper secure monitor or just monitor mode.

For reference, these modes in the ARMv8 architecture correspond to “privilege levels”. Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 “higher” privilege levels. EL2 is used for hypervisor (virtualization) and EL3 is used for a secure monitor that oversee the switch back and forth to the trusted execution environment (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in drivers/tee in the kernel) but it is a different thing than Linux entirely.

These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.

Exceptional Events

We discussed that different execution modes pertain to certain exceptions. So let's recap ARM32 exceptions.

As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.

The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.

Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the SVC (supervisor call) instruction. This very same instruction — i.e. with the same binary operation code — was previously called SWI (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named TRAP, on x86 there is the INT instruction and RISC-V has the SBI (supervisor binary interface) call.

In my article about how the ARM32 architecture is set up I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace and back.

The transitions occurs at these distinct points:

  • A hardware RESET occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.
  • An undefined instruction is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)
  • A software interrupt occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special SVC (aka SWI) instruction that passes a 1-byte parameter to the software interrupt handler.
  • A prefetch abort occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.
  • A data abort occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.
  • An address exception occurs. This doesn't happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.
  • A hardware interrupt occurs – since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: IRQ and FIQ. Each can be routed to an external interrupt controller, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.
  • A fault occurs such as through division by zero or other arithmetic fault – the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a processor abort.

That's all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and preempted, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.

So how does that actually happen? Let's look at that next.

Entering Kernelspace

Everything has a beginning. I have explained in a previous article how the kernel bootstraps from the function start_kernel() in init/main.c and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.

Further down start_kernel() we initialize the timers, start the clocksource (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the start_kernel() function where arch_call_rest_init() is called. This is in most cases a call to rest_init() in the same file (only S390 does anything different) and that in turn actually initializes some processes:

pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
(...)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

We create separate threads running the in-kernel functions kernel_init and kthreadd, which is the kernel thread daemon which in turn spawns all kernel threads.

The user_mode_thread() or kernel_thread() calls create a new processing context: they both call kernel_clone() which calls copy_process() with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using dup_task_struct() passing current as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach copy_thread() which calls back into the architecture to initialize struct thread_info for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel_init passed instead of just forking), the following happens:

memset(childregs, 0, sizeof(struct pt_regs));
thread->cpu_context.r4 = (unsigned long)args->fn_arg;
thread->cpu_context.r5 = (unsigned long)args->fn;
childregs->ARM_cpsr = SVC_MODE;
(...)
thread->cpu_context.pc = (unsigned long)ret_from_fork;

fn_arg will be NULL in this case but fn is kernel_init or kthreadd. And we execute in SVC_MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to ret_from_fork will be significant, so notice this!

Neither of the functions kernel_init or kthreadd will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called schedule() a single time, which means nothing happens, because nothing is yet scheduled.

kernel_init is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for kthreadd to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged __init) and only then proceed to run_init_process(). As indicated, this will start the init process using kernel_execve(), usually /sbin/init which will then proceed to spawn all usermode processes/tasks. kernel_execve() will check for supported binary formats and most likely call the ELF loader to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START_THREAD() which in turn wraps the ARM32-specific start_thread() which will, amongst other things, do this:

regs->ARM_cpsr = USR_MODE;
(...)
regs->ARM_pc = pc & ~1;

So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. regs->ARM_cpsr will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.

kthreadd on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.

But as said: neither is executing.

In order to actually start the scheduling we call schedule_preempt_disabled() which will issue schedule() with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more “blocky” at this point. However: we already have the clockevent timer running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler_tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.

Until this point we have been running in the context of the Linux init task which is a elusive hard-coded kernel thread with PID 0 that is defined in init/init_task.c and which I have briefly discussed in a previous article. This task does not even appear in procfs in /proc.

As we call schedule(), the kernel init task will preempt and give way to kthreadd and then to the userspace init process. However when the scheduler again schedules the init task with PID 0, we return to rest_init(), and we will call cpu_startup_entry(CPUHP_ONLINE) and that function is in kernel/sched/idle.c and looks like this:

void cpu_startup_entry(enum cpuhp_state state)
{
        arch_cpu_idle_prepare();
        cpuhp_online_idle(state);
        while (1)
                do_idle();
}

That's right: this function never returns. Nothing ever breaks out of the while(1) loop. All that do_idle() does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU “idle” i.e. sleep, since nothing is going on. Then the loop repeats. The kernel init task, PID 0 or “main() function” that begins at start_kernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the init task. The kernel has some documentation of the inner loop that assumes that you know this context.

Let's look closer at do_idle() in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):

while (!need_resched()) {
    local_irq_disable();
    enter_arch_idle_code();
    /* here a considerable amount of wall-clock time can pass */
    exit_arch_idle_code();
    local_irq_enable();
}
(...)
schedule_idle();

This will spin here until something else needs to be scheduled, meaning the init task has the TIF_NEED_RESCHED bit set, and should be preempted. The call to schedule_idle() soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic schedule() call which we will see later.

We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.

Scheduling the first task

So we know that schedule() has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.

We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.

Scheduler model A mental model of the scheduler: scheduler_tick() sets the flag TIF_NEED_RESCHED and a later call to schedule() will actually call out to check_and_switch_context() that does the job of switching task.

Scheduler tick and TIF_NEED_RESCHED

As part of booting the kernel in start_kernel() we first initialized the scheduler with a call to sched_init() and the system tick with a call to tick_init() and then the timer drivers using time_init(). The time_init() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in drivers/clocksource.

There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.

The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it's interrupt will not be able to fire until local_irq_enable() is called further down in start_kernel(). After this point the system has a running scheduling tick.

As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.

The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO_HZ).

What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.

When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own ->event_handler() which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to do_timer() followed by a call to scheduler_tick(). (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)

As a result of calling scheduler_tick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, “rescheduled” and calls resched_curr(rq) on the runqueue for the CPU, which in turn will call set_tsk_need_resched(curr) on the current task, which flags it as ready to be rescheduled.

set_tsk_need_resched() will set the flag TIF_NEED_RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in arch/arm/include/asm/thread_info.h and ARM32 has a bitmask version of this flag helpfully named _TIF_NEED_RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.

This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means “at your earliest convenience, yield to another task”. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when schedule() is called.

The Task State and Stack

We mentioned the architecture-specific struct thread_info so let's hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread_info is simply part of the task_struct. The struct task_struct is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:

struct task_struct {
    struct thread_info thread_info;
    (...)
    unsigned int state;
    (...)
    void *stack;
    (...)
    struct mm_struct *mm;
    (...)
    pid_t pid;
    (...)
};

The struct thread_info which in our case is a member of task_struct contains all the architecture-specific aspects of the state.

The task_struct refers to thread_info, but also to a separate piece of memory void *stack called the task stack, which is where the task will store its activation records when executing code. The task stack is of size THREAD_SIZE, usually 8KB (2 * PAGE_SIZE). These days, in most systems, the task stack is mapped into the VMALLOC area.

The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first enabled THREAD_INFO_IN_TASK which enabled thread info to be contained in the task_struct and then enabled CONFIG_VMAP_STACK for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!

Task struct The task_struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread_info is inside it, and inside the thread_info is a cpu_context_save. It has a task stack of size THREAD_SIZE for ARM32 which is typically twice the PAGE_SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task_struct and thread_info is such that task_struct is Linux-generic and thread_info is architecture-specific and they correspond 1-to-1.

Actual Preemption

In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of schedule() getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?

schedule() can be called for two reasons:

  • Voluntary preemption: such as when a kernel thread want to give up it's time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call schedule_preempt_disabled() the very first time, we voluntarily preempt the kernel execution of the init task with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the kthreadd process. Other places can be found by git grep:ing for calls to cond_resched() or just an explicit call to schedule().
  • Forced preemption: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its' timeslice, and schedule_tick() has set the TIF_NEED_RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.

Places where forced preemption happens:

The short answer to the question “where does forced preemption happen?” is “at the end of exception handlers”. Here are the details.

The most classical place for preemption of userspace processes is on the return path of a system call. This happens from arch/arm/kernel/entry-common.S in the assembly snippets for ret_slow_syscall() and ret_fast_syscall(), where the ARM32 kernel makes an explicit call to do_work_pending() in arch/arm/kernel/signal.c. This will issue a call to schedule() if the flag _TIF_NEED_RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is ret_from_fork which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.

The most common place for preemption is however when returning from a hardware interrupt. Interrupts on ARM32 are handled in assembly in arch/arm/kernel/entry-armv.S with a piece of assembly that saves the processor state for the current CPU into a struct pt_regs and from there just calls the generic interrupt handling code in kernel/irq/handle.c named generic_handle_arch_irq(). This code is used by other archs than ARM32 and will nominally just store the system state and registers in a struct pt_regs record on entry and restore it on exit. However when the simplistic code in generic_handle_arch_irq() is done, it exits through the same routines in arch/arm/kernel/entry-common.S as fast and slow syscalls, and we can see that in ret_to_user_from_irq the code will explicitly check for the resched and other flags with ldr r1, [tsk, #TI_FLAGS] and branch to the handler doing do_work_pending(), and consequently preempt to another task instead of returning from an interrupt.

Now study do_work_pending():

do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
        /*
         * The assembly code enters us with IRQs off, (...)
         */

        do {
                if (likely(thread_flags & _TIF_NEED_RESCHED)) {
                        schedule();
                } else {
                        (...)
                }
                local_irq_disable();
                thread_flags = read_thread_flags();
        } while (...);
        return 0;
}

Notice the comment: we enter do_work_pending() with local IRQs disabled so we can't get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is local_irq_disable() – we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do_work_pending you will see that this looks the same on other architectures with similar setup.

In reality do_work_pending() does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local_irq_disable().

The struct pt_regs should be understood as “processor trace registers” which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the CPU state, including the program counter pc, which is the place where the task was supposed to resume execution, unless it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt_regs are stored in the task stack during the call to generic_handle_arch_irq().

The assembly in entry-common.S can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:

	(...)
slow_work_pending:
	mov	r0, sp				@ 'regs'
	mov	r2, why				@ 'syscall'
	bl	do_work_pending
	cmp	r0, #0
	beq	no_work_pending
	(...)

ENTRY(ret_to_user_from_irq)
	ldr	r1, [tsk, #TI_FLAGS]
	movs	r1, r1, lsl #16
	bne	slow_work_pending
no_work_pending:
	asm_trace_hardirqs_on save = 0
	ct_user_enter save = 0
	restore_user_regs fast = 0, offset = 0

We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by do_work_pending() which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to no_work_pending and restore the usersmode registers and continue execution.

Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler_tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler_tick() gets called by the timer interrupt, and if it sets TIF_NEED_RESCHED a different thread will start to execute on our way out of the exception handler!

The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named restore_user_regs can be found in entry-header.S and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):

	mov	r2, sp
	(...)
	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
	add	sp, sp, #\offset + PT_REGS_SIZE
	movs	pc, lr				@ return & move spsr_svc into cp

r2 is set to the stack pointer, where pt_regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) — NOTE: the little caret (^) after the ldmdb instruction means “also load CPSR from the stack” — then moves the stackpointer past the saved registers and returns.

Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task task stack, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.

Also notice that this must happen in the end of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.

But let's not skip the last part of what schedule() does.

Setting the Program Counter

So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do_work_pending() which in turn will call schedule() for us.

The schedulers schedule() call is supposed to very quickly select a process to run next. Eventually it will call context_switch() in kernel/sched/core.c, which in turn will do essentially two things:

  • Check if the next task has a unique memory management context (next->mm is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call switch_mm_irqs_off() which in the ARM32 case is just defined to the ARM32-specific switch_mm() which will call the ARM32-specific check_and_switch_context()NOTE that this function for any system with MMU is hidden in the arch/arm/include/asm/mmu_context.h header file — which in turn does one of two things:
    • If interrupts are disabled, we will just set mm->context.switch_pending = 1 so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific finish_arch_post_lock_switch() which is implemented right below and gets called right after dropping the runqueue lock.
    • If interrupts are not disabled, we will immediately call cpu_switch_mm(). This is a per-cpu callback witch is written in assembly for each CPU as cpu_NNNN_switch_mm() inside arch/arm/mm/proc-NNNN.S. For example, all v7 CPUs have the cpu_v7_switch_mm() in arch/arm/mm/proc-v7.S.
  • Switch context (such as the register states and stack) to the new task by calling switch_to() with the new task and the previous one as parameter. In most cases this latches to an architecture-specific __switch_to(). In the ARM32 case, this routine is written in assembly and can be found in arch/arm/kernel/entry-armv.S.

Now the final details happens in __switch_to() which is supplied the struct thread_info (i.e. the architecture-specific state) for both the current and the previous task:

  • We store the registers of the current task in the task stack, at the TI_CPU_SAVE index of struct thread_info, which corresponds to the .cpu_context entry in the struct, which is in turn a struct cpu_context_save, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to continue as if nothing has happened when we “return” after the schedule() call. I put “return” in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.
  • Then the TLS (Thread Local Storage) settings for the new task are obtained and we issue switch_tls(). On v6 CPUs this has special implications, but in most cases we end up using switch_tls_software() which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the kernel-provided user helpers, which in turn are a few kernel routines “similar to but different from VDSO” that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the struct thread_info pointer, so it cannot be used for TLS on ARMv7. (More on this later.)
  • We then broadcast THREAD_NOTIFY_SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet __switch_to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.

Then we reach the final step in __switch_to(), which is a bit different depending on whether we use CONFIG_VMAP_STACK or not.

The simple path when we are not using VMAP:ed stacks looks like this:

	set_current r7, r8
	ldmia	r4, {r4 - sl, fp, sp, pc}	@ Load all regs saved previously

Here r7 contains a pointer to the next tasks thread_info (which will somewhere the kernel virtual memory map), and set_current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label __current but ARMv7 and SMP systems have a dedicated special CP15 TPIDRURO thread ID register to store this in the CPU so that the thread_info can be located very quickly. (The only user of this information is, no surprise, the get_current() assembly snippet, but that is in turn called from a lot of places and contexts.)

The next ldmia instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) and pc(r15) from the location pointed out by r4, which again is the .cpu_context entry in the struct thread_info, the struct cpu_context_save, which is all the context there is including pc so the next instruction after this will be whatever pc was inside the struct cpu_context_save. We have switched to the new task and preemption is complete.

But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn't the task we're switching to going to miss those registers?

For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be “clobbered” during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It's a function call to inline assembly and not an exception.

And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or (C) explicitly called schedule() when this happened, this just doesn't matter.

Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)

When starting a completely new task all the contents of struct cpu_context_save will be zero, and the return address will be set to ret_from_fork or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.

If we're on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn't matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt_regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.

This is why r0 thru r3 are missing from struct cpu_context_save and need not be preserved.

When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution right after the schedule(); call. If we were on our way out of an exception in do_work_pending() we will proceed to return from the exception handler, and to the process it will “feel” like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.

Running init

So how does /sbin/init actually come to execute?

We saw that after start_kernel we get to rest_init which creates the thread with pid = user_mode_thread(kernel_init, NULL, CLONE_FS).

Then kernel_init calls on kernel_execve() to execute /sbin/init. It locates an ELF parser to read and page in the file. Then it will eventually issue start_thread() which will set regs->ARM_cpsr = USR_MODE and regs->ARM_pc to the start of the executable.

Then this tasks task_struct including memory context etc will be selected after a call to schedule().

But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we “cheat”:

When we initialized the process in arch/arm/kernel/process.c, we set the program counter to ret_from_fork so we are not going back after any schedule() call: we are going back to ret_from_fork! And this is just an exception return path, so this will restore regs->ARM_cpsr to USR_MODE, and “return from an exception” into whatever is in regs->ARM_pc, which is the start of the binary program from the ELF file!

So /sbin/init is executed as a consequence of returning from a fake exception through ret_from_fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.

This is how ARM32 schedules and executes processes.

We are discussing and working toward adding the language Rust as a second implementation language in the Linux kernel. A year ago Jake Edge made an excellent summary of the discussions so far on Rust for the Linux kernel and we (or rather Miguel and Wedson) have made further progress since then. For the record I think this is overall a good idea and worth a try. I wanted to add some background that was sketched in a mail thread for the kernel summit.

TL;DR: my claim is that Rust is attempting to raise the abstraction in the programming language and ultimately to join computer science and software engineering into one single discipline, an ambition that has been around since these disciplines were created.

Beginning with ALGOL

The first general high-level language was FORTRAN, which is still in use for some numerical analysis tasks around the world. Then came ALGOL, which attracted a wider audience.

The first “real” operating system (using virtual memory etc) for the Atlas Machine supervisor in 1962 was as far as I can tell implemented in Atlas autocode which was a dialect of ALGOL, which was the lingua franca at the time. Pure ALGOL could not be used because ALGOL 60 had no input/output primitives, so every real-world application of ALGOL, i.e. any application not solely relying on compiled-in constants, required custom I/O additions.

Algol specifications Copies of the first specifications of ALGOL 60, belonging at one time to Carl-Erik Fröberg at Lund University.

ALGOL inspired CPL that inspired BCPL that inspired the B programming language that inspired the C programming language, which we use for the Linux kernel.

Between 1958 and 1968 ALGOL was the nexus in a wide attempt to join computer languages with formal logic. In this timespan we saw the ALGOL 58, ALGOL 60 and ALGOL 68 revisions come out. The outcome was that it established computer science as a discipline and people could start building their academic careers on that topic. One notable outcome was the BNF form for describing syntax in languages. This time was in many ways formative for computer science: the first three volumes of Donald Knuths The Art of Computer Programming were published in close proximity to these events.

To realize that ALGOL was popular and widespread at the time that Unix was born, and that C was in no way universally accepted, it would suffice to read a piece of the original Bourne Shell source code tree for example:

setlist(arg,xp)
	REG ARGPTR	arg;
	INT		xp;
{
	WHILE arg
	DO REG STRING	s=mactrim(arg->argval);
	   setname(s, xp);
	   arg=arg->argnxt;
	   IF flags&execpr
	   THEN prs(s);
		IF arg THEN blank(); ELSE newline(); FI
	   FI
	OD
}

This doesn't look much like C as we know it, it looks much more like ALGOL 68. The ALGOL 68 definition added constructions such as IF/FI, DO/OD etc, which were not present in ALGOL 60. The reason is that Stephen Bourne was an influential contributor to ALGOL 68 and created a set of macros so that the C preprocessor would turn his custom dialect of ALGOL into C, for which I think someone on Reddit suggested to nominate bash for the obfuscated C contest.

This is just one of the instances where we can see that the C programming language was not universally loved. The Bourne Shell scripting language that we all love and use is also quite close to ALGOL 68, so the descendants of this language is used more than we may think.

Around 1970 Niklaus Wirth was working to improve ALGOL68 with what he called ALGOL W. Tired of the slowness of the language committee process he forked ALGOL and created the programming language Pascal which was a success in its own right. In his very interesting IEEE article named A Brief History of Software Engineering Professor Wirth gives his perspective on some of the events around that time: first he writes about the very influential NATO conference on software engineering 1968 in Garmisch, Germany which served to define software engineering as a distinct discipline. To counter the so-called software crisis – the problems presented by emerging large complex systems – the suggestion was to raise the abstraction in new languages.

To raise the abstraction means to use more mathematical, machine independent constructs in the language. First consider the difference between low-level and high-level languages: a simple operation such as x = x + 1 is not high level, and just a fancy assembly instruction; if we compile it we can readily observe the resulting code in some kind of ADD instruction in the resulting object code. However a[i] = x + 1 raises abstraction past the point of high-level languages. This is because indexing into an array requires knowledge of the target machine specifics: base addresses, memory layout, etc. This makes the instruction more high-level and thus raises the abstraction of the language. The assumption is that several further higher levels of abstraction exist. We will look into some of these languages in the following sections.

The Garmisch conference is famous in Unix circles because Douglas McIlroy was present and presented his idea of componentized software as a remedy against rising complexity, an idea that was later realized in the form of Unix's pipes and filters mechanism. D-Bus and similar component interoperation mechanisms are contemporary examples of such software componentry — another way to counter complexity and make software less fragile, but not the focus in this article.

Wirth makes one very specific and very important observation about the Garmisch conference:

Ultimately, analytic verification and correctness proofs were supposed to replace testing.

This means exactly what it says: with formally verified programming languages, all the features and constructs that are formally proven need not be tested for. Software engineering is known for advocating test-driven development (TDD) to this day, and the ambition was to make large chunks of TDD completely unnecessary. Software testing has its own chapter in the mentioned report from the Garmisch NATO conference where the authors A.I. Llewelyn and R.F. Wickens conclude:

There are, fundamentally, two different methods of determining whether a product meets its specification. One can analyse the product in great detail and from this determine if it is in accordance with its specification, or one can measure its performance experimentally and see if the results are in accord with the specification; the number and sophistication of the experiments can be varied to provide the degree of confidence required of the results.

The first part of this paragraph i.e. “analyze in great detail” is what Wirth calls analytic verification and is today called formal verification. The latter part of this paragraph is what we call test-driven development, TDD. Also: the former is a matter of computer science, while the latter is a matter of software engineering. So here is a fork in the road.

Wirth also claims the discussions in Garmisch had a distinct influence on Pascal. This can be easily spotted in Pascal strings, which was one of his principal improvements over ALGOL: Pascal strings are arrays of char, but unlike C char, a Pascal char is not the same as a byte; instead it is defined as belonging to an “ordered character set”, which can very well be ISO8859-1 or Unicode, less, more or equal to 255 characters in size. Strings stored in memory begin with an positive integer array length which defines how long the string is, but this is none of the programmer's business, this shall be altered by the language runtime and not by any custom code. Indexing out of bounds is therefore not possible and can be trivially prohibited during compilation and at runtime. This raises the abstraction of strings: they are set-entities, they have clear boundaries, they need special support code to handle the length field in memory. Further Pascal also has set types, such as:

var
    JanuaryDays : set of 1..31;

Perhaps Pascal's application to real-world problems didn't work out as expected, as it has since also defined PChar as a NULL-terminated pointer to a sequence of characters, akin to C strings. However it should be noted that Pascal pointers are persistently typed and cannot be converted: casting is not possible in Pascal. A Pascal pointer to an integer is always a pointer to an integer.

From Wirth's perspective, C “presented a great leap backward” and he claims “it revealed that the community at large had hardly grasped the true meaning of the term 'high-level language' which became an ill-understood buzzword”. He attributes the problem to Unix which he says “acted like a Trojan horse for C”. He further details the actual technical problems with C:

C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous.

His point about C lacking index checking is especially important: it can be brought into question if C is really a high-level language. It is not fully abstracting away the machine specifics of handling an array. Language theorists can occasionally refer to C as a “big macro assembler”, the only thing abstracted away is really the raw instruction set.

Wirth however also goes on to state the appealing aspects of the C programming language:

people at large, particularly in academia, found it intriguing and “better than assembly code” (...) its rules could easily be broken, exactly what many programmers cherished. It was possible to manage access to all of a computer’s idiosyncracies, to items that a high-level language would properly hide. C provided freedom, where high-level languages were considered as straight-jackets enforcing unwanted discipline. It was an invitation to use tricks which had been necessary to achieve efficiency in the early days of computers.

We can see why an efficiency-oriented operating system kernel such as Linux will tend toward C.

It's not like these tricks stopped after the early days of computing. Just the other day I wrote a patch for Linux with two similar code paths, which could be eliminated by cast:ing a (const void *) into a (void *) which I then quipped about in the commit message of the revised patch. The reason for violating formal rules in this case — is that of a choice between two evils, and chosing the lesser evil: in a choice between formal correctness and code reuse I chose code reuse. And C enables that kind of choice. The languages presented later in this article absolutely do not allow that kind of choice, and C casts are seen as nothing less than an abomination.

The language family including C and also Pascal is referred to as imperative programming languages. The defining character is that the programmer “thinks like a computer” or imagine themselves as the program counter to be exact. “First I do this, next I do this, then I do this” – a sequence of statements executed in order, keeping the computer state (such as registers, memory locations and stacks) in the back of your head.

The immediate appeal to operating system programmers should be evident: this closely models what an OS developer needs to keep in mind, such as registers, stacks, cache frames, MMU tables, state transitions in hardware and so on. It is possible to see the whole family of imperative languages as domain specific languages for the domain of writing operating systems, so it would be for operating system developers what OpenGL is for computer graphics software developers.

Lambda Calculus for Defining Languages

In 1966 one of the early adopters and contributors to ALGOL (alongside Peter Naur, Tony Hoare and Niklaus Wirth), Peter Landin, published two articles in the Journal of the ACM titled Correspondence between ALGOL 60 and Church's Lambda-notation part I and part II. In the first article he begins with a good portion of dry humour:

Anyone familiar with both Church's λ-calculi and ALGOL 60 will have noticed a superficial resemblance between the way variables tie up with the λ's in a nest of λ-expressions, and the way that identifiers tie up with the headings in a nest of procedures and blocks.

He is of course aware that no-one beside himself had been in the position to realize this: the overlap between people familiar with Alonzo Church's λ-calculus and with ALGOL 60 was surprisingly down to one person on the planet. What is surprising is that it was even one person.

Alonzo Church was a scholar of mathematical logic and computability, the supervisor of Alan Turing's doctoral thesis and active in the same field as Kurt Gödel (those men quoted each other in their respective articles). The lambda calculus ties into the type set theory created by Bertrand Russell and the logical-mathematical programme, another universe of history we will not discuss here.

What λ-calculus (Lambda-calculus) does for a programming language definition is analogous to what regular expressions does for a languages syntax, but for it's semantics. While regular expressions can express how to parse a body of text in a language with regular grammar, expressions in λ-calculus can go on from the abstract syntax tree and express what an addition is, what a subtraction is, or what a bitwise OR is. This exercise is seldomly done in e.g. compiler construction courses, but defining semantics is an inherent part of a programming language definition.

Perhaps the most remembered part of Landin's papers is his humorous term syntactic sugar which denotes things added to a language to make the life of the programmer easier, but which has no semantic content that cannot be expressed by the basic features of the language. The basic mathematical features of the language, on the other hand, are best expressed with λ-calculus.

A notable invention in Landin's first article about defining ALGOL in terms of λ-calculus are the keywords let and where chosen to correspond to λ-calculus' Applicable Expressions. These keywords do not exist in ALGOL: they are part of a language to talk about a language, or in more complicated terms: a meta-language. So here we see the first steps toward a new language derived from λ-calculus. Landin does not give this language a name in this article, but just refers to it as “AE”. The AE executes in a theoretical machine called SECD, which is another trick of the trade, like Alan Turings “turing machine”: rather close to a mathematicians statement “let's assume we have...” The complete framework for defining ALGOL in λ-calculus is called AE/SECD.

Functional Programming

Functional programming languages then, implements lambda calculus. The central idea after some years of experience with defining languages such as ALGOL in terms of lambda calculus, is to just make the language resemble lambda calculus expressions to begin with, and the verification of the semantics will be simple and obvious.

In 1966 Peter Landin followed up his articles using λ-calculus to describe ALGOL with his article The Next 700 Programming Languages. Here he invents functional programming in the form of an invented language called ISWIM (If You See What I Mean), as you can see again with a good dry humour. The language is λ-calculus with “syntactic sugar” on top, so a broad family of languages are possible to create using the framework as a basis. Landin's article was popular, and people did invent languages. Maybe not 700 of them. Yet.

In section 10 of his article, named Eliminating explicit sequencing, Landin starts speculating and talks about a game that can be played with ALGOL: by removing any goto statements and labels, the program get a less sequential nature, i.e. the program counter is just advancing to the next line or iterating a loop. He quips:

What other such features are there? This question is considered because, not surprisingly, it turns out that an emphasis on describing things in terms of other things leads to the same kind of requirements as an emphasis against explicit sequencing.

He then goes on to show how to transform an ALGOL program into a purely functional ISWIM program and concludes:

The special claim of ISWlM is that it grafts procedural notions onto a purely functional base without disturbing many of the desirable properties. (...) This paper can do no more than begin the task of explaining their practical significance.

This reads as a call to action: we need to create functional programming languages akin to ISWIM, and we need to get rid of the J operator (the program control flow operator). Landin never did that himself.

The Meta Language ML

A few years later, in 1974, computer scientist Robin Milner, inspired by ISWIM and as a response to Landin's challenge, created the language ML, short for Meta Language. This is one of the 700 next languages and clearly recognized Landin's ideas about a language for defining languages, a grammar for defining grammar: a meta language with a meta grammar.

He implemented the language on the DEC10 computer with the help of Malcolm Newey, Lockwood Morris, Mike Gordon and Chris Wadswort. The language was later ported to the VAX architectures.

The language was based on ISWIM and dropped the so-called J operator (program point operator). It is domain-specific, and intended for authoring a tool for theorem proving called LCF. Standard ML has been fully semantically specified and formally verified. This language became widely popular, both in academia and industry.

Removing the J operator made ML a declarative language, i.e. it does not specify the order of execution of statements, putting it in the same class of languages as Prolog or for that matter: Makefiles: there is no control flow in a Makefile, just a number of conditions that need to be evaluated to arrive at a complete target.

ML still has one imperative language feature: assignment. Around this time, some scholars thought both the J operator and assignment were unnecessary and went on to define purely functional languages such as Haskell. We will not consider them here, they are outside the scope of this article. ML and everything else we discuss can be labelled as impure: a pejorative term invented by people who like purely functional languages. These people dislike not only the sequencing nature of imperative languages but also the assignment (such as happens with the keyword let) and prefer to think about evaluating relationships between abstract entities.

ML can be grasped intuitively. For example this expression in ML evaluates to the integer 64:

let
    val m : int = 4
    val n : int = m*m
in
    m*n
end

Here we see some still prominent AE/SECD, ISIWM features such as the keyword let for binding variables, or rather, associate names with elements such as integers and functions (similar to := assignment in some languages). The we see an implementation section in. We can define functions in ML, like this to compute the square root of five times x:

val rootfivex : real -> real =
    fn x : real => Math.sqrt (5.0 * x)

Notice absence of constructs such as BEGIN/END or semicolons. ML, like Python and other languages use whitespace to find beginning and end of basic blocks. The notation real –> real clearly states that the function takes a real number as input and produces a real number as output. The name real reflects some kind of mathematical ambition. The language cannot handle the mathematical set of real numbers — the ML real is what other languages call a float.

ML has more syntactic sugar, so the following is equivalent using the keyword fun (fun-notation):

fun rootfivex (x:real):real = Math.sqrt (5.0 * x)

The syntax should be possible to grasp intuitively. Another feature of ML and other functional languages is that they easily operate on tuples i.e. an ordered sequence of variables, and tuples can also be returned from functions. For example you can calculate the distance between origin and two coordinates in a x/y-oriented plane like this:

fun dist (x:real, y:real):real = Math.sqrt (x*x + y*y)

This function can then be called elsewhere like this:

val coor (x:real, y:real)
val d = dist(coor)

The type real of d will be inferred from the fact that the dist() function returns a real.

ML gets much more complex than this. One of the upsides of the language that is universally admired is that ML programs, like most programs written in functional languages can be proven correct in the computational sense. This can be done within certain ramifications: for example input/output operations need to specify exactly which values are input or an undefined behaviour will occur.

CAML and OCaml

In 1987 Ascánder Suárez at the French Institute for Research in Computer Science and Automation (INRIA) reimplemented a compiler and runtime system for ML in LISP and called the result CAML for Categorical Abstract Machine Language, a pun on the fact that it ran on a virtual machine (Category Abstract Machine) and the heritage from ML proper. The abstract machine used was the LLM3 abstract LISP machine, which in turn ran on another computer. It was not fast.

CAML was reimplemented in C in 1990-91 by Xavier Leroy, creating Caml Light, which was faster, because it was not written in a virtual machine running a virtual machine. Caml Light was more like Java and used a bytecode interpreter for its virtual machine.

In 1995, Caml Special Light introduced a native compiler, so the bytecode produced from the Caml compiler could be compiled to object code and executed with no virtual machine overhead, using a native runtime environment. Didier Rémy, Jérôme Vouillon and Jacques Garrigue continued the development of Caml.

Objective Caml arrived in 1996 and added some object oriented features to Caml. In 2011 the extended Caml Special Light compiler, and language derivative (dialect) of ML was renamed OCaml. In essence the compiler and language has a symbiotic relationship. There is no second implementation of OCaml.

From the 1990s and forward, what is now the OCaml language and implementation has gained traction. It is a very popular functional programming language, or rather, popular as far as functional programming goes. It has optimized implementations for most architectures. The compiler itself is now written mostly in OCaml, but the runtime in C is still around, to hook into each operating system where the program will eventually run. The language and compiler has been used for a variety of applications. Every major Linux distribution carries packages with the OCaml compiler and libraries. There is even a GTK+ 3 OCaml library binding, so OCaml GUI programs can be created.

OCaml simplifies binding labels to numbers etc, here is bubblesort implemented in OCaml:

(* Bubblesort in OCaml, Linus Walleij 2022 *)
let sort v =
  let newv = Array.make (Array.length v) 0 in
  for i = 1 to (Array.length v) - 1 do
    if v.(i - 1) > v.(i) then begin
      newv.(i - 1) <- v.(i);
      newv.(i) <- v.(i - 1);
      (* Copy back so we are working on the same thing *)
      v.(i - 1) <- newv.(i - 1);
      v.(i) <- newv.(i);
    end else begin
      newv.(i - 1) <- v.(i - 1);
      newv.(i) <- v.(i);
    end
  done;
  newv

let rec ordered v =
  if Array.length v = 0 then true
  else if Array.length v = 1 then true
  (* ... or if the rest of the array is ordered *)
  else if v.(0) < v.(1) && ordered (Array.sub v 1 (Array.length v - 1)) then true
  else false;;

let plist v =
  print_string "V = ";
  for i = 0 to (Array.length v) - 1 do begin
    print_int v.(i);
    if i < (Array.length v - 1) then print_string ",";
    end
  done;
  print_endline "";;

let rec sortme v =
  if ordered v then v
  else sortme (sort v);;

let v = [| 14 ; 4 ; 55 ; 100 ; 11 ; 29 ; 76 ; 19 ; 6 ; 82 ; 99 ; 0 ; 57 ; 36 ; 61 ; 30 |];;
plist v;;
plist (sortme v);;

My experience with working with this example is that OCaml makes a “bit of resistance” to changing contents of things like arrays by indexing. It “dislikes” any imperative constructs and kind of nudges you in the direction of purely logical constructs such as the ordered function above. This is just my personal take.

OCaml is still a dialect of ML. The file ending used on all files is .ml as well. OCaml – like Pythons pip or Perls CPAN has its own package system and library called opam. The prime application is still the OCaml Ergo Library, a library for automatic theorem proving. If your first and foremost use of computers is theorem proving, ML and OCaml continue to deliver since 1974. The more recent and widely popular Coq theorem prover is also written in OCaml.

Rust then

Rust was initially developed in 2006 as a hobby project by Graydon Hoare who was at the time working at Mozilla. OCaml and ML is mentioned as the biggest influence on the language, apart from C/C++. A typical sign of this influence would be that the first compiler for Rust was written in OCaml. A notable contributor to this codebase, apart from Hoare, is Brendan Eich, one of the founders of the Mozilla project and the inventor of JavaScript. While Brendan did not contribute much code he was at the time CTO of Mozilla, and this shows that when Mozilla started supporting the project in 2009 Rust was certainly well anchored in the organization, and Eich's early contributions to the language should be noted. (It may be commonplace that people in the CTO position at middle sized companies make commits to complex code bases, but I am not aware in that case.)

Despite the OCaml codebase the first documentation of the language talks more about other functional or declarative languages such as NIL, Hermes, Erlang, Sather, Newsqueak, Limbo and Napier. These origins with extensive quotes from e.g. Joe Armstrong (the inventor of Erlang) have been toned down in contemporary Rust documentation. It is however very clear that Graydon has a deep interest in historical computer languages and is convinced that they have something to teach us, and the expressed ambition is to draw on these languages to pick the best parts. In his own words:

I've always been a language pluralist — picture my relationship towards languages like a kid enjoying a wide variety of building blocks, musical instruments or plastic dinosaurs — and I don't think evangelism or single-language puritanism is especially helpful.

What is unique about Rust is that it fuses “impure” functional programming with imperative programming, bringing several concepts from ML and OCaml over into the language.

Another characteristic is that Rust compiled to target machine code from day one, rather than using any kind of virtual machine as did Peter Landins ISWIM, or the ML and OCaml languages (and as does say Java, or Python). Graydon probably did this intuitively, but a post he made in 2019 underscores the point: that virtual machines, even as an intermediate step, is bad language engineering and just generally a bad idea.

In 2013 Graydon stepped down as main lead for Rust for personal reasons which he has detailed in a posting on Reddit.

Rust has had the same symbiotic relationship between language and a single compiler implementation as OCaml, but this is changing, as there is now a second, GCC-based implementation in the works.

Here is bubblesort implemented in Rust:

/* Bubblesort in Rust, Linus Walleij 2022 */
fn sort(array : &mut [i32]) {
   let mut x : i32;
   if array.len() == 1 {
      return;
   }
   for i in 1..array.len() {
      if array[i - 1] > array[i] {
      	 x = array[i - 1];
	 array[i - 1] = array[i];
	 array[i] = x;
      }
   }
}

fn is_ordered(array : &[i32]) -> bool {
   if array.len() <= 1 {
     return true;
   }
   for i in 1..array.len() {
     if array[i - 1] > array[i] {
       return false;
     }
   }
   return true;
}

fn parray(array : &[i32]) {
   let mut x : i32;
   print!("V = ");
   for i in 0..array.len() {
       x = array[i];
       print!("{x}");
       if i != (array.len() - 1) {
       	  print!(",");
       }
   }
   println!("");
}

fn main() {
   let mut array: [i32; 16] = [14, 4, 55, 100, 11, 29, 76, 19, 6, 82, 99, 0, 57, 36, 61, 30];
   parray(&array);
   while !is_ordered(&array) {
     sort(&mut array);
   }
   parray(&array);
}

Rust leaves itself to easier imperative programming than OCaml: the keyword mut becomes quite similar to C:s const correctness tagging in this example. Since is_ordered and parray isn't altering the contents of the array these functions do not need to be marked with mut. You see some familiar virtues from Pascal: arrays “know” their length, and we use a method to obtain it: array.len().

The stated ambition is improved memory safety, data-race safety (concurrency) and type safety. The article Safe Systems Programming in Rust certainly presents the ambition in a straight-forward manner. Graydon also underscores the focus on memory and concurrency safety in a 2016 blog post.

But make no mistake. The current underlying ambition is definitely nothing different from the ambition of the ALGOL committee between 1958 and 1968: to raise the abstraction of the language through the ambition to join computer programming with formal logic. This comes from the arrival of strong academic support for the language.

A typical indication of this ambition is the well-funded RustBelt project involving a large amount of academic researchers, all familiar with formal logic, and resulting in such artefacts as Ralf Jung's PhD thesis Understanding and Evolving the Rust Programming Language. Here, formal logic in Rust Belt and the Coq proof assistant is used and concludes (from the abstract):

Together, these proofs establish that, as long as the only unsafe code in a well-typed λRust program is confined to libraries that satisfy their verification conditions, the program is safe to execute.

What is meant by “safe to execute” is that no use-after-free, dangling pointers, stale references, NULL pointer exceptions etc can ever occur in safe Rust code, because it is proven by formal logic: QED. It does not stop you from e.g. dividing by zero however, that problem is out-of-scope for the exercise.

To me personally the most astonishing fact about Jung's thesis is that it manages to repeatedly cite and reference the computer scientist Tony Hoare without quoting the inventor of the Rust language, Graydon Hoare, a single time. In a way it confirms Graydon's own statement that Rust “contains nothing new” from a language point of view.

The C programming language cannot be subject to the same scrutiny as Rust, simply because of all the (ab)use it allows, and which was mentioned by Wirth in his historical perspective: if a type can be changed by a cast and array indexing is not even part of the language, there is nothing much to prove. What has been interesting for scholars to investigate is a well-defined subset of C, such as the eBPF subset, which also partly explains the strong interest in eBPF: like with Rust, the build environment and language runtime has been defined with much stricter constraints and thus can be subject to formal verification.

The ambition of Rust is, as I perceieve it, and whether the people driving it even knows it or not, to finish what the ALGOL committe as primus motor started in 1958, and what the Garmisch NATO conference concluded was necessary in 1968: to develop a language for systems programming that rely on formal logic proof, and to fulfil what ALGOL never could, what Pascal never could, and what the whole maybe-not-700 functional programming languages never could: a language that joins the disciplines of computer science and software Engineering into ONE discipline, where the scholars of each can solve problems together.

That is the ambition of Rust as an implementation language for operating systems, such as Linux: provide a language backed by current top-of-the-line computer science research, for immediate application to software engineering developing the top-of-the-line operating system.

What it offers Linux is raised abstraction to counter the problems of complexity identified in the 1968 Garmisch NATO conference and now bleeding obvious given the recurring security incidents, and thereby would bring the engineering project Linux closer to computer science.

Other approaches to increased Linux (memory- concurrency-) safety are possible: notably increased testing, which is the engineering go-to panacea. And automated testing of Linux has indeed increased a lot in recent years. Raising the abstraction of the implementation language and proving it formally comes with the ambition to make testing less important.

[Mathieu Poirer and Jesper Jansson has helped out in reviewing this blog post, for which I am forever grateful: remaining errors, bugs and biased opinions are my own.]

This is a retrospect of my work on the KASan Kernel Address Sanitizer for the ARM32 platform. The name is a pun on the diving decompression stop that is something you perform after going down below the surface to avoid decompression sickness.

Where It All Began

The AddressSanitizer (ASan) is a really clever invention by Google, hats off. It is one of those development tools that like git just take on the world in a short time. It was invented by some smart russians, especially Андрей Коновалов (Andrey Konovalov) and Дмитрий Вьюков (Dmitry Vyukov). It appears to be not just funded by Google but also part of a PhD thesis work.

The idea with ASan is to help ensure memory safety by intercepting all memory accesses through compiler instrumentation, and consequently providing “ASan splats” (runtime problem detections) while stressing the code. Code instrumented with ASan gets significantly slower than normal and uses up a bunch of memory for “shadowing” (I will explain this) making it a pure development tool: it is not intended to be enabled on production systems.

The way that ASan instruments code is by linking every load and store into symbols like these:

__asan_load1(unsigned long addr);
__asan_store1(unsigned long addr);
__asan_load2(unsigned long addr);
__asan_store2(unsigned long addr);
__asan_load4(unsigned long addr);
__asan_store4(unsigned long addr);
__asan_load8(unsigned long addr);
__asan_store8(unsigned long addr);
__asan_load16(unsigned long addr);
__asan_store16(unsigned long addr);

As you can guess these calls loads or stores 1, 2, 4, 8 or 16 bytes of memory in a chunk into the virtual address addr and reflects how the compiler thinks that the compiled code (usually C) thinks about these memory accesses. ASan intercepts all reads and writes by placing itself between the executing program and any memory management. The above symbols can be implemented by any runtime environment. The address will reflect what the compiler assumed runtime environment thinks about the (usually virtual) memory where the program will execute.

You will instrument your code with ASan, toss heavy test workloads on the code, and see if something breaks. If something breaks, you go and investigate the breakage to find the problem. The problem will often be one or another instance of buffer overflow, out-of-bounds array or string access, or the use of a dangling pointer such as use-after-free. These problems are a side effect of using the C programming language.

When resolving the mentioned load/store symbols, ASan instrumentation is based on shadow memory, and this is on turn based on the idea that one bit in a single byte “shadows” 8 bytes of memory, so you allocate 1/8 the amount of memory that your instrumented program will use and shadow that to some other memory using an offset calculation like this:

shadow = (address >> 3) + offset

The shadow memory is located at offset, and if our instrumented memory is N bytes then we need to allocate N/8 = N >> 3 bytes to be used as shadow memory. Notice that I say instrumented memory not code: ASan shadows not only the actual compiled code but mainly and most importantly any allocations and referenced pointers the code maintains. Also the DATA (contants) and BSS (global variables) part of the executable image are shadowed. To achive this the userspace links to a special malloc() implementation that overrides the default and manages all of this behind the scenes. One aspect of it is that malloc() will of course return chunks of memory naturally aligned to 8, so that the shadow memory will be on an even byte boundary.

ASan shadow memory The ASan shadow memory shadows the memory you're interested in inspecting for memory safety.

The helper library will allocate shadow memory as part of the runtime, and use it to shadow the code, data and runtime allocations of the program. Some will be allocated up front as the program is started, some will be allocated to shadow allocations at runtime, such as dynamically allocated buffers or anything else you malloc().

The error detection was based on the observation that a shadowing byte with each bit representing an out-of-bounds access error can have a “no error” state (0x00) and 8 error states, in total 9 states. Later on a more elaborate scheme was adopted. Values 1..7 indicate how many of the bytes are valid for access (if you malloc() just 5 bytes then it will be 5) and then there are magic bytes for different conditions.

When a piece of memory is legally allocated and accessed the corresponding bits are zeroed. Uninitialized memory is “poisoned”, i.e. set to a completely illegal value != 0. Further SLAB allocations are padded with “red zones” poisoning memory in front and behind of every legal allocation. When accessing a byte in memory, it is easy to verify that the access is legal: is the shadow byte == 0? That means all 8 bytes can be freely accessed and we can quickly proceed. Else we need a closer look. Values 1 thru 7 means bytes 1 thru 7 are valid for access (partly addressable) so we check that and any other values means uh oh.

  • 0xFA and 0xFB means we have hit a heap left/right redzone so an out-of-bounds access has happened
  • 0xFD means access to a free:ed heap region, so use-after-free
  • etc

Decoding the hex values gives a clear insight into what access violation we should be looking for.

To be fair the one bit per byte (8-to-1) mapping is not compulsory. This should be pretty obvious. Other schemes such as mapping even 32 bytes to one byte have been discussed for memory-constrained systems.

All memory access calls (such as any instance of dereferencing a pointer) and all functions in the library such as all string functions are patched to check for these conditions. It's easy when you have compiler instrumentation. We check it all. It is not very fast but it's bareable for testing.

Researchers in one camp argue that we should all be writing software (including operating systems) in the programming language Rust in order to avoid the problems ASan is trying to solve altogether. This is a good point, but rewriting large existing software such as the Linux kernel in Rust is not seen as realistic. Thus we paper over the problem instead of using the silver bullet. Hybrid approaches to using Rust in kernel development are being discussed but so far not much has materialized.

KASan Arrives

The AddressSanitizer (ASan) was written with userspace in mind, and the userspace project is very much alive.

As the mechanism used by ASan was quite simple, and the compilers were already patched to handle shadow memory, the Kernel Address Sanitizer (KASan) was a natural step. At this point (2020) the original authors seem to spend a lot of time with the kernel, so the kernel hardening project has likely outgrown the userspace counterpart.

The magic values assigned to shadow memory used by KASan is different:

  • 0xFA means the memory has been free:ed so accessing it means use-after-free.
  • 0xFB is a free:ed managed resources (devm_* accessors) in the Linux kernel.
  • 0xFC and 0xFE means we access a kmalloc() redzone indicating an out-of-bounds access.

This is why these values often occur in KASan splats. The full list of specials (not very many) can be found in mm/kasan/kasan.h.

The crucial piece to create KASan was a compiler flag to indicate where to shadow the kernel memory: when the kernel Image is linked, addresses are resolved to absolute virtual memory locations, and naturally all of these, plus the area where kernel allocates memory (SLABs) at runtime need to be shadowed. As can be seen in the kernel Makefile.kasan include, this boils down to passing the flags -fsanitize=kernel-address and -asan-mapping-offset=$(KASAN_SHADOW_OFFSET) when linking the kernel.

The kernel already had some related tools, notably kmemcheck which can detect some cases of use-after-free and references to uninitialized memory. It was based on a slower mechanism so KASan has since effectively superceded it, as kmemcheck was removed.

KASan was added to the kernel in a commit dated february 2015 along with support for the x86_64 architecture.

To exercise the kernel to find interesting bugs, the inventors were often using syzkaller, a tool similar to the older Trinity: it bombs the kernel with fuzzy system calls to try to provoke undefined and undesired behaviours yielding KASan splats and revealing vulnerabilities.

Since the kernel is the kernel we need to explicitly assign memory for shadowing. Since we are the kernel we need to do some manouvers that userspace can not do or do not need to do:

  • During early initialization of the kernel we point all shadow memory to a single page of just zeroes making all accesses seem fine until we have proper memory management set up. Userspace programs do not need this phase as “someone else” (the C standard library) handles all memory set up for them.
  • Memory areas which are just big chunks of code and data can all point to a single physical page with poison. In the virtual memory it might look like kilobytes and megabytes of poison bytes but it all points to the same physical page of 4KB.
  • We selectively de-instrument code as well: code like KASan itself, the memory manager per se, or the code that patches the kernel for ftrace, or the code that unwinds the stack pointer for a kernel splat clearly cannot be instrumented with KASan: it is part of the design of these facilities to poke around at random locations in memory, it's not a bug. Since KASan was added all of these sites in the generic kernel code have been de-instrumented, more or less.

Once these generic kernel instrumentations were in place, other architectures could be amended with KASan support, with ARM64 following x86 soon in the autumn of 2015.

Some per-architecture code, usually found in arch/xxxx/mm/kasan_init.c is needed for KASan. What this code does is to initalize the shadow memory during early initialization of the virtual memory to point to a “zero page” and later on to populate all the shadow memory with poisoned shadow pages.

The shadow memory is special and needs to be populated accessing the very lowest layer of the virtual memory abstraction: we manipulate the page tables from the top to bottom: pgd, p4d, pud, pmd, pte to make sure that the $(KASAN_SHADOW_OFFSET) points to memory that has valid page table entries.

We need to use the kernel memblock early memory management to set up memory to hold the page tables themselves in some cases. The memblock memory manager also provide us with a list of all the kernel RAM: we loop over it using for_each_mem_range() and populate the shadow memory for each range. As mentioned we first point all shadows to a zero page, and later on to proper KASan shadow memory, and then KASan kicks into action.

A special case happens when moving from using the “zero page” KASan memory to proper shadow memory: we would risk running kernel threads into partially initialized shadow memory and pull the ground out under ourselves. Not good. Therefore the global page table for the entire kernel (the one that has all shadow memory pointing to a zero page) is copied and used during this phase. It is then replaced, finally, with the proper KASan-instrumented page table with pointers to the shadow memory in a single atomic operation.

Further all optimized memory manipulation functions from the standard library need to be patched as these often have assembly-optimized versions in the architecture. This concerns memcpy(), memmove() and memset() especially. While the quick optimized versions are nice for production systems, we replace these with open-coded variants that do proper memory accesses in C and therefore will be intercepted by KASan.

All architectures follow this pattern, more or less. ARM64 supports hardware tags, which essentially means that the architecture supports hardware acceleration of KASan. As this is pretty fast, there is a discussion about using KASan even on production systems to capture problems seen very seldom.

KASan on ARM32

Then there was the attempt to provide KASan for ARM32.

The very first posting of KASan in 2014 was actually targeting x86 and ARM32 and was already working-kind-of-prototype-ish on ARM32. This did not proceed. The main reason was that when using modules, these are loaded into a designated virtual memory area rather than the kernel “vmalloc area” which is the main area used for memory allocations and what most architectures use. So when trying to use loadable modules the code would crash as this RAM was not shadowed.

The developers tried to create the option to move modules into the vmalloc area and enable this by default when using KASan to work around this limitation.

The special module area is however used for special reasons. Since it was placed in close proximity to the main kernel TEXT segment, the code could be accessed using short jumps rather than long jumps: no need to load the whole 32-bit program counter anew whenever a piece of code loaded from a module was accessed. It made code in modules as quick as normal compiled-in kernel code +/– cache effects. This provided serious performance benefits.

As a result KASan support for ARM was dropped from the initial KASan proposal and the scope was limited to x86, then followed by ARM64. “We will look into this later”.

In the spring of 2015 I started looking into KASan and was testing the patches on ARM64 for Linaro. In june I tried to get KASan working on ARM32. Andrey Ryabinin pointed out that he actually had KASan running on ARM32. After some iterations we got it working on some ARM32 platforms and I was successfully stressing it a bit using the Trinity syscall fuzzer. This version solved the problem of shadowing the loadable modules by simply shadowing all that memory as well.

The central problem with running KASan on a 32-bit platform as opposed to a 64-bit platform was that the simplest approach used up 1/8 of the whole address space which was not a problem for 64-bit platforms that have ample virtual address space available. (Notice that the amount of physical memory doesn't really matter, as KASan will use the trick to just point large chunks of virtual memory to a single physical poison page.) On 32-bit platforms this approach ate our limited address space for lunch.

We were setting aside several static assigned allocations in the virtual address space, so we needed to make sure that we only shadow the addresses actually used by the kernel. We would not need to shadow the addresses used by userspace and the shadow memory virtual range requirement could thus be shrunk from 512 MB to 130 MB for the traditional 3/1 GB kernel/userspace virtual address split used on ARM32. (To understand this split, read my article How the ARM32 Kernel Starts which tries to tell the story.)

Sleeping Beauty

This more fine-grained approach to assigning shadow memory would create some devil-in-the-details bugs that will not come out if you shadow the whole virtual address space, as the 64-bit platforms do.

A random access to some memory that should not be poked (and thus lacking shadow memory) will lead to a crash. While QEMU and some hardware was certainly working, there were some real hardware platforms that just would not boot. Things were getting tedious.

KASan for ARM32 development ground to a halt because we were unable to hash out the bugs. The initial patches from Andrey started trading hands and these out-of-tree patches were used by some vendors to test code on some hardware.

Personally, I had other assignments and could not take over and develop KASan at this point. I'm not saying that I was necessarily a good candidate at the time either, I was just testing and tinkering with KASan because ARM32 vendors had a generic interest in the technology.

As a result, KASan for ARM32 was pending out-of-tree for almost 5 years. In 2017 Abbot Liu was working on it and fixed up the support for LPAE (large physical address extension) and in 2019 Florian Fainelli picked up where Abbot left off.

Some things were getting fixed along the road, but what was needed was some focused attention and these people had other things on their plate as well.

Finally Fixing the Bugs

In April 2020 I decided to pick up the patches and have a go at it. I sloppily named my first iteration “v2” while it was something like v7.

I quickly got support from two key people: Florian Fainelli and Ard Biesheuvel. Florian had some systems with the same odd behaviour of just not working as my pet Qualcomm APQ8060 DragonBoard that I had been using all along for testing. Ard was using the patches for developing and debugging things like EFI and KASLR.

During successive iterations we managed to find and patch the remaining bugs, one by one:

  • A hard-coded bitmask assuming thread size order to be 1 (4096 bytes) on ARMv4 and ARMv5 silicon made the kernel crash when entering userspace. KASan increases the thread order so that there would be space for redzones before and after allocations, so it needed more space. After reading assembly one line at the time I finally figured this out and patched it.
  • The code was switching MMU table by simply altering the TTBR0 register. This worked in some machines, especially ARMv7 silicon, but the right way to do it was to use the per-CPU macro cpu_switch_mm() which looks intuitive but is an ARM32-ism which is why the original KASan authors didn't know about it. This macro accounts for tiny differences between different ARM cores, some even custom to certain vendors.
  • Something fishy was going on with the attached device tree. It turns out, after much debugging, that the attached devicetree could end up in memory that was outside of the kernel 1:1 physical-to-virtual mapping. The page table entries that would have translated the physical memory area where the device tree was stored was wiped clean yielding a page fault. The problem was not caused by KASan per se: it was a result of the kernel getting over a certain size, and all the instrumentation added to the kernel makes it bigger to the point that it revealed the bug. We were en route to fix a bug related to big compressed kernel images. I developed debugging code specifically to find this bug and then made a patch for this making sure not to wipe that part of the mapping. (This post gives a detailed explanation of the problem.) Ard quickly came up with a better fix: let's move the device tree to determined place in the fixed mappings and handle it as if it was a ROM.

These were the major roadblocks. Fixing these bugs created new bugs which we also fixed. Ard and Florian mopped up the fallout.

In the middle of development, five level page tables were introduced and Mike Rapoport made some generic improvements and cleanup to the core memory management code, so I had to account for these changes as well, effectively rewriting the KASan ARM32 shadow memory initialization code. At one point I also broke the LPAE support and had to repair it.

Eventually the v16 patch set was finalized in october 2020 and submitted to Russell Kings patch tracker and he merged it for Linux v5.11-rc1.

Retrospect

After the fact three things came out nice in the design of KASan for ARM32:

  • We do not shadow or intercept highmem allocations, which is nice because we want to get rid of highmem altogether.
  • We do not shadow the userspace memory, which is nice because we want to move userspace to its own address space altogether.
  • Personally I finally got a detailed idea of how the ARM32 kernel decompresses and starts, and the abstract concepts of highmem, lowmem, and the rest of those wild animals. I have written three different articles on this blog as a result, with ideas for even more of them. By explaining how things work to others I realize what I can't explain and as a result I go and research it.

Andrey and Dmitry has since worked on not just ASan and KASan but also on what was intially called the KernelThreadSanitizer (KTSAN) but which was eventually merged under the name KernelConcurrencySanitizer (KCSAN). The idea is again to use shadow memory, but now for concurrency debugging at runtime. I do not know more than this.

This is the continuation of Setting Up the ARM32 Architecture, part 1.

As a recap, we have achieved the following:

  • We are executing in virtual memory
  • We figured out how to get the execution context of the CPU from the init task with task ID 0 using the sp register and nothing else
  • We have initialized the CPU
  • We have identified what type of machine (ARM system) we are running on
  • We have enumerated and registered the memory blocks available for the kernel to use with the primitive memblock memory manager
  • We processed the early parameters earlyparams
  • We have provided early fixmaps and early ioremaps
  • We have identified lowmem and highmem bounds

Memory Blocks – Part 2

We now return to the list of memory available for the Linux kernel.

arm_memblock_init() in arch/arm/mm/init.c is called, resulting in a number of memory reservations of physical memory the Linux memory allocator can NOT use, given as physical start address and size. So we saw earlier that memblock stores a list of available blocks of memory, and in addition to that it can set aside reserved memory.

For example the first thing that happens is this:

memblock_reserve(__pa(KERNEL_START), KERNEL_END - KERNEL_START);

It makes perfect sense that the kernel cannot use the physical memory occupied by by the kernel – the code of the kernel we are executing. KERNEL_END is set to _end which we know from previous investigation to cover not only the TEXT segment but also BSS of the kernel, i.e. all the memory the kernel is using.

Next we call arm_mm_memblock_reserve() which will reserve the memory used by the kernel initial page table, also known as pg_swapper_dir. Would be unfortunate if we overwrite that with YouTube videos.

Finally we reserve the memory used by the device tree (at the moment I write this there is a patch pending to fix a bug here) and any other memory reservations defined in the device tree.

A typical example of a memory reservation in the device tree is if you have a special video ram (VRAM). The following would be a typical example:

        reserved-memory {
                #address-cells = <1>;
                #size-cells = <1>;
                ranges;
 
                /* Chipselect 3 is physically at 0x4c000000 */
                vram: vram@4c000000 {
                            /* 8 MB of designated video RAM */
                            compatible = "shared-dma-pool";
                            reg = <0x4c000000 0x00800000>;
                            no-map;
                };
        };

This specific memory block (taken from the Versatile Express reference design) will be outside of the physical RAM memory and not disturb any other allocations, but it uses the very same facility in the device tree: anything with compatible “shared-dma-pool” will be set aside for special use.

When chunks of common (non-special-purpose) RAM is set aside these chunks are referred to as “carveouts”. A typical use of such carveouts is media buffers for video and audio.

Next, before we have started to allocate any memory on the platform we set aside memory to be used for contiguous memory allocation (CMA) if this memory manager is in use. The CMA memory pool can be used for other things than contiguous memory, but we cannot have unmovable allocations in there, so we better flag this memory as “no unmoveable allocations in here” as soon as possible.

As CMA is so good at handling contiguous memory it will be used to handle the random carveouts and special memory areas indicated by “shared-dma-pool” as well. So be sure to select the Kconfig symbols CMA and CMA_DMA if you use any of these.

Next we call memblock_dump_all() which will show us nothing, normally. However if we pass the command line parameter memblock=debug to the kernel we will get a view of what things look like, first how much memory is available in total and how much is reserved in total for the things we outlined above, and then a detailed list of the memory available and set aside, similar to this:

MEMBLOCK configuration:
 memory size = 0x08000000 reserved size = 0x007c1256
 memory.cnt  = 0x1
 memory[0x0]    [0x00000000-0x07ffffff], 0x08000000 bytes flags: 0x0
 reserved.cnt  = 0x3
 reserved[0x0]    [0x00004000-0x00007fff], 0x00004000 bytes flags: 0x0
 reserved[0x1]    [0x00008400-0x007c388b], 0x007bb48c bytes flags: 0x0
 reserved[0x2]    [0x00c38140-0x00c39f09], 0x00001dca bytes flags: 0x0
 reserved[0x3]    [0xef000000-0xefffffff], 0x01000000 bytes flags: 0x0

This is taken from the ARM Versatile reference platform and shows that we have one big chunk of memory that is 0x08000000 (128 MB) in size. In this memory we have chiseled out three reservations of in total 0x007C1256 (a bit more than 7 MB). The first reservation is the page table (swapper_pg_dir) the second is the kernel TEXT and BSS, the third is the DTB and the last reservation is the CMA pool of 16MB.

After this point, we know what memory in the system we can and cannot use: early memory management using memblock is available. This is a crude memory management mechanism that will let you do some rough memory reservations before the “real” memory manager comes up. We can (and will) call memblock_alloc() and memblock_phys_alloc() to allocate virtual and physical memory respectively.

If you grep the kernel for memblock_alloc() calls you will see that this is not a common practice: a few calls here and there. The main purpose of this mechanism is for the real memory manager to bootstrap itself: we need to allocate memory to be used for the final all-bells-and-whistles memory management: data structures to hold the bitmaps of allocated memory, page tables and so on.

At the end of this bootstrapping mem_init() will be called, and that in turn calls memblock_free_all() and effectively shuts down the memblock mechanism. But we are not there yet.

After this excursion among the memblocks we are back in setup_arch() and we call adjust_lowmem_bounds() a second time, as the reservations and carveouts in the device tree may have removed memory from underneath the kernel. Well we better take that into account. Redo the whole thing.

Setting up the paging

We now initialize the page table. early_ioremap_reset() is called, turning off the early ioremap facility (it cannot be used while the paging proper is being set up) and then we enable proper paging with the call to paging_init(). This call is really interesting and important: this is where we set up the system to perform the lower levels of proper memory management.

This is a good time to recap the inner workings of the ARM32 page tables before you move along. The relationship between kernel concepts such as PGD, PMD and PTE and the corresponding ARM level-1, level-2 and on LPAE level-3 page table descriptors need to be familiar.

The first thing we do inside paging_init() is to call prepare_page_table(). What it does is to go into the PGD and clear all the PMDs that are not in use by the kernel. As we know, the PGD has 4096 32bit/4-bytes entries on the classic ARM32 MMU grouped into 2048 PMDs and 512 64bit/8 byte entries on LPAE corresponding to one PMD each. Each of them corresponds to a 2MB chunk of memory. The action is going to hit the swapper_pg_dir at 0xC0004000 on classic MMUs or a combination of PGD pointers at 0xC0003000 and PMD pointers at 0xC0004000 on LPAE MMUs.

Our 1MB section mappings currently covering the code we are running and all other memory we use are first level page table entries, so these are covering 1 MB of virtual memory on the classic MMU and 2 MB of memory on LPAE systems. However the size we advance with is defined as PMD_SIZE which will always be 2 MB so the loop clearing the page PMDs look like so:

for (addr = 0; addr < PAGE_OFFSET; addr += PMD_SIZE)
        pmd_clear(pmd_off_k(addr));

(Here I simplified the code by removing the execute-in-place (XIP) case.)

We advance one PMD_SIZE (2 MB) chunk at a time and clear all PMDs that are not used, here we clear the PMDs covering userspace up to the point in memory where the linear kernel mapping starts at PAGE_OFFSET.

We have stopped using sections of 1 MB or any other ARM32-specific level-1 descriptor code directly. We are using PMDs of size 2 MB and the generic kernel abstractions. We talk to the kernel about PMDs and not “level-1 descriptors”. We are one level up in the abstractions, removed from the mundane internals of the ARM32 MMU.

pmd_clear() will in practice set the entry to point at physical address zero and all MMU attributes will also be set to zero so the memory becomes non-accessible for read/write and any other operations, and then flush the translation lookaside buffer and the L2 cache (if present) so that we are sure that all this virtual memory now points to unaccessible zero – a known place. If we try to obtain an instruction or data from one of these addresses we will generate a prefect abort of some type.

One of the PMDs that get wiped out in this process is the two section-mapping PMDs that were 1-to1-mapping __turn_mmu_on earlier in the boot, so we swipe the floor free from some bootstrapping.

Next we clear all PMDs from the end of the first block of lowmem up until VMALLOC_START:

end = memblock.memory.regions[0].base + memblock.memory.regions[0].size;
if (end >= arm_lowmem_limit)
    end = arm_lowmem_limit;

for (addr = __phys_to_virt(end); addr < VMALLOC_START; addr += PMD_SIZE)
    pmd_clear(pmd_off_k(addr));

What happens here? We take the first memory block registered with the memblock mechanism handling physical memory, then we cap that off at arm_lowmem_limit which we saw earlier is in most cases set to the end of the highest memory block, which will mostly be the same thing unless we have several memory blocks or someone passed in command parameters reserving a lot of vmalloc space, so this procedure assumes that the kernel is loaded into the first available (physical) memory block, and we will then start at the end of that memory block (clearly above the kernel image) and clear all PMDs until we reach VMALLOC_START.

VMALLOC_START is the end of the virtual 1-to-1 mapping of the physical memory + 8 MB. If we have 512 MB of physical memory at physical address 0x00000000 then that ends at 0x1FFFFFFF and VMALLOC_START will be at 0x20000000 + PAGE_OFFSET + 0x00800000 = 0xE0800000. If we have 1 GB of physical memory VMALLOC_START will run into the highmem limit at 0xF0000000 and the end of the 1-to-1 physical mapping will naturally be there, so the VMALLOC_START will be at 0xF0800000 for anything with more 768 MB memory. The 8 MB between the end of lowmem and VMALLOC_START is a “buffer” to catch stray references.

For example if a system has 128 MB of RAM starting at 0x00000000 in a single memory block and the kernel zImage is 8MB in size and gets loaded into memory at 0x10008000-0x107FFFFF with the PGD (page global directory) at 0x10004000, we will start at address 0x20000000 translated to virtual memory 0xE0000000 and clear all PMDs for the virtual memory up until we reach 0xF0800000.

The physical memory in the memblock that sits above and below the start of the kernel, in this example the physical addresses 0x00000000-0x0FFFFFFF and 0x10800000-0x1FFFFFFF will be made available for allocation as we initialize the memory manager: we have made a memblock_reserve() over the kernel and that is all that will actually persist – memory in lowmem (above the kernel image) and highmem (above the arm_lowmem_limit, if we have any) will be made available to the memory allocator as well.

This code only initializes the PMDs, i.e. the entries in the first level of the page table, to zero memory with zero access to protect us from hurting ourselves.

Clearing the PMDs This image illustrates what happens when we initialize the PMDs (the picture is not to scale). The black blocks are the PMDs we clear, making them unavailable for any references for now. We have one block of physical memory from 0x00000000-0x1FFFFFFF (512 MB) and we clear out from the end of that block in virtual memory until we reach VMALLOC_START.

Mapping lowmem

Next we call map_lowmem() which is pretty self-describing. Notice that we are talking about lowmem here: the linear kernelspace memory map that is accessed by adding or subtracting an offset from/to the physical or virtual memory address. We are not dealing with the userspace view of the memory at all, only the kernel view of the memory.

We round two physical address pointers to the start and end PMDs (we round on SECTION_SIZE, i.e. 1 MB bounds) of the executable portions of the kernel. The lower part of the kernel (typically starting at address 0xC0008000) is the executable TEXT segment so the start of this portion of the virtual memory is assigned to pointer kernel_x_start and kernel_x_end is put rounded up to the next section for the symbol __init_end, which is the end of the executable part of the kernel.

The BSS segment and other non-execuable segments are linked below the executable part of the kernel, so this is in the virtual memory above the executable part of the kernel.

Then we loop over all the memory blocks we have in the system and call create_mapping() where we first check the following conditions:

  • If the end of the memblock is below the start of the kernel the memory is mapped as readable/writeable/executable MT_MEMORY_RWX. This is a whole memory block up in userspace memory and similar. We want to be able to execute code up there. I honestly do not know why, but I can think about things such as small firmware areas that need to be executable, registered somewhere in a very low memblock.
  • If the start of the memblock is above kernel_x_end the memory is mapped as readable/writeable. No execution shall happen in the linear map above the executable kernel memory.

Next we reach the situation where the memblock is covering both the executable and the non-executable part of the kernel image. There is even a comment saying “this better cover the entire kernel” here: the whole kernel has to be inside one memory block under these circumstances or the logic will not work.

This is the most common scenario under all circumstances, such as my example with a single 128 MB physical memory block. Most ARM32 systems are like this.

We then employ the following pretty intuitive mapping:

  • If the memblock starts below the executable part of the kernel kernel_start_x we chop off that part and map it as readable/writeable with MT_MEMORY_RW.
  • Then we map kernel_x_start to kernel_x_end as readable/writeable/executable with MT_MEMORY_RWX.
  • Then we map the last part of the kernel above kernel_x_end to the end of the memblock as readable/writeable with MT_MEMORY_RW.

Remapping the kernel image This illustrates how the memory above the kernel is readable/writeable, the upper part of the kernel image with the text segment is readable/writeable/executable while the lower .data and .bss part is just readable/writeable, then the rest of lowmem is also just readable writable. In this example we have 128MB (0x20000000) of memory and the in the kernel image lowmem is mapped 0x00000000 -> 0xB0000000, 0x10000000 -> 0xC0000000 and 0x20000000 -> 0xD0000000. The granularity may require individual 4K pages so this will use elaborate page mapping.

The executable kernel memory is also writable because the PGD is in the first PMD sized chunk here – we sure need to be able to write to that – and several kernel mechanisms actually rely on being able to runtime-patch the executable kernel, even if we have already finalized the crucial physical-to-virtual patching. One example would be facilities such as ftrace.

Everything that fits in the linearly mapped lowmem above and below the kernel will be readable and writable by the kernel, which leads to some optimization opportunities – especially during context switches – but also to some problems, especially regarding the highmem. But let’s not discuss that right now.

map_lowmem() will employ create_mapping() which in turn will find out the PGD entries (in practice, on a classical MMU in this case that will be one entry of 32 bits/4 bytes in the level-1 page table) for the address we pass in (which will be PMD-aligned at 2 MB), and then call alloc_init_p4d() on that pgd providing the start and end address. We know very well that ARM32 does not have any five- or four-level page tables but this is where the generic nature of the memory manager comes into play: let’s pretend we do. alloc_init_p4d() will traverse the page table ladder with alloc_init_pud() (we don’t use that either) and then alloc_init_pmd() which we actually use, and then at the end if we need it alloc_init_pte().

What do I mean by “if we need it”? Why would we not allocate PTEs, real page-to-page tables for every 0x1000 chunk of physical-to-virtual map?

It’s because allocinitpmd() will first see if the stuff we map is big enough to use one or more section mappings – i.e. just 32 bits/4 bytes in the PGD at 0xC0004000-somewhere to map the memory. In our case that will mostly be the case! The kernel will if possible remain section mapped, and we call __map_init_section() which will create and write the exact same value into the PGD as we had put in there before – well maybe it was all executable up onto this point, so at least some small bits will change. But we try our best to use the big and fast section maps if we can without unnecessarily creating a myriad of PTE-level objects that turn into level-2 descriptors that need to be painstakingly traversed.

However on the very last page of the executable part of the kernel and very first page of the data segments it is very likely that we cannot use a section mapping, and for the first time the generic kernel fine-granular paging using 4 KB pages will kick in to provide this mapping. We learnt before that the linker file has been carefully tailored to make sure that each segment of code starts on an even page, so we know that we can work at page granularity here.

In create_mapping(), if the start and end of the memory we try to map is not possible to fit perfectly into a section, we will call alloc_init_pte() for the same address range. This will allocate and initialize a PTE, which is the next level down in the page table hierarchy.

phys_addr_t start, end;
struct map_desc map;

map.pfn = __phys_to_pfn(start);
map.virtual = __phys_to_virt(start);
map.length = end - start;
map.type = MT_MEMORY_RWX;

create_mapping(&map);

A typical case of creating a mapping with create_mapping(). We set the page frame number (pfn) to the page we want to start the remapping at, then the virtual address we want the remapping to appear at, and the size and type of the remapping, then we call create_mapping().

So Let’s Map More Stuff

We now know what actually happens when we call create_mapping() and that call is used a lot in the early architecture set-up. We know that map_lowmem() will chop up the border between the executable and non-executable part of the kernel using the just described paging layout of our MMU using either the classic complicated mode or the new shiny LPAE mode.

The early section mappings we put over the kernel during boot and start of execution from virtual memory are now gone. We have overwritten them all with the new, proper mappings, including the very memory we executed the remappings in. Nothing happens, but the world is now under generic kernel memory management control. The state of the pages is, by the way, maintained in the page tables themselves.

If we create new PTEs these will be allocated into some new available memory page as well using alloc(). At this point that means that the memblock allocator will be used, since the proper memory management with kmalloc() is not yet operational. The right type of properties (such as read/write/execute or other MT_* characteristics) will be used however so we could say that we have a “halfway” memory manager: the hardware is definitely doing the right thing now, and the page tables are manipulated the right way using generic code.

map_lowmem() is done. We call memblock_set_current_limit(arm_lowmem_limit) because from now on we only want memblock allocations to end up in lowmem proper, we have just mapped it in properly and all. In most cases this is the same as before, but in some corner cases we cannot put this restriction until now.

We remap the contiguous memory allocation area if CMA is in use, again using all the bells and whistles of the kernel’s generic memory manager. MMU properties for DMA memory is set to MT_MEMORY_DMA_READY which is very close to normal read/writeable memory.

Next we shutdown the fixmap allocations. The remapped memory that was using fixmaps earlier gets mapped like the kernel itself and everything else using create_mapping() instead of the earlier hacks poking directly into the page tables. They are all one page each and uses MT_DEVICE, i.e. write-through, uncached registers of memory-mapped I/O, such as used by the earlyconsole UART registers.

Next we set up some special mappings in devicemaps_init() that apart from the early ones we just reapplied add some new ones, i.e. some not-so-early mappings. They are mostly not devices either so the function name is completely misleading. The function is called like this because it at one point calls the machine-specific ->map_io() callback for the identified machine descriptor, which on pure device tree systems isn’t even used.

Inside devicemaps_init() We clear some more PMDs. Now we start at VMALLOC_START: the place where we previously stopped the PMD clearing, advancing 2 MB at a time up to FIXADDR_TOP which is the location of the fixmaps. Those have been redefined using the generic kernel paging engine, but they are still there, so we must not overwrite them.

Next follow some special mappings, then we get to something really interesting: the vectors.

Setting Up and Mapping the Exception Vector Table

The vector table is a page of memory where the ARM32 CPUs will jump when an exception occurs. Exceptions vectors can be one of:

  • Reset exception vector: the address the PC is set to if the RESET line to the CPU is asserted.
  • Undefined instruction exception vector: the address we jump to if an undefined instruction is executed. This is used for example to emulate floating point instructions if your CPU does not have them.
  • Software Interrupt (SWI) exception vector: also called a “trap”, is a way to programmatically interrupt the program flow and execute a special handler. In Linux this is used by userspace programs to execute system calls: to call on the kernel to respond to needs of a userspace process.
  • Prefetch abort exception vector: this happens when the CPU tries to fetch an instruction from an illegal address, such as those addresses where we have cleared the level-1 page table descriptor (PMD) so there is no valid physical memory underneath. This is also called a page fault and is used for implementing demand paging, a central concept in Unix-like operating systems.
  • Data abort exception vector: same thing but the CPU is trying to fetch data rather than an instruction. This is also used for demand paging.
  • Address exception vector: this is described in the source as “this should never happen” and some manuals describe it as “unused” or “reserved”. This is actually an architectural leftover from the ARM26 (ARMv1, v2, v3) no longer supported by Linux. In older silicon the virtual address space was 26 bits rather than the 32 bits in later architectures, and this exception would be triggered when an address outside the 26 bit range was accessed. (This information came from LWN reader farnz as a reply to this article.) On full 32-bit silicon it should indeed never happen.
  • Interrupt Request IRQ exception vector: the most natural type of exception in response to the IRQ line into the CPU. In later ARM32 CPUs this usually comes from the standard GIC (Generic Interrupt Controller) but in earlier silicon such as ARMv4 or ARMv5 some custom interrupt controller is usually connected to this line. The origin of the line was a discrete signal routed out on the CPU package, but in modern SoCs these are usually synthesized into the same silicon so the line is not visible to the outside, albeit the concept is the same.
  • Fast Interrupt Request FIQ exception vector: this is mostly unused in Linux, and on ARMv7 silicon often used to trap into the secure world interrupt handlers and thus not even accessible by the normal world where Linux is running.

These eight vectors in this order is usually all we ever need on any ARM32 CPU. They are one 32bit word each, so the PC is for example set at address 0xFFFF0000 when reset occurs and whatever is there is executed.

The way the vector table/page works is that the CPU will store the program counter and processor state in internal registers and put the program counter at the corresponding vector address. The vector table can be put in two locations in memory: either at address 0x00000000 or address 0xFFFF0000. The location is selected with a single bit in the CP15 control register 1. Linux supports putting the vectors in either place with a preference for 0xFFFF0000. Using address 0x00000000 is typically most helpful if the MMU is turned off and you have a 1-to-1 mapping to a physical memory that starts at address 0x00000000. If the MMU is turned on, which it is for us, the address used is the virtual one, even the vector table goes through MMU translation, and it is customary to use the vectors high up in memory, at address 0xFFFF0000.

As we noted much earlier, each exception context has its own copy of the sp register and thus is assigned an exception-specific stack. Any other registers need to be spooled out and back in by code before returning from the exception.

Exception vectors on the ARM CPUs The ARM32 exception vector table is a 4KB page where the first 8 32-bit words are used as vectors. The address exception should not happen and in some manuals is described as “unused”. The program counter is simply set to this location when any of these exceptions occur. The remainder of the page is “poisoned” with the word 0xE7FDDEF1.

ARM Linux uses two consecutive pages of memory for exception handling: the first page is the vectors, the second page is called stubs. The vectors will typically be placed at 0xFFFF0000 and the stubs at the next page at 0xFFFF1000. If you use the low vectors these will instead be at 0x00000000 and 0x00001000 respectively. The actual physical pages backing these locations are simply obtained from the rough memblock allocator using memblock_alloc(PAGE_SIZE * 2).

The stubs page requires a bit of explanation: since each vector is just 32 bits, we simply cannot just jump off to a desired memory location from it. Long jumps require many more bits than 32! Instead we do a relative jump into the next page, and either handle the whole exception there (if it’s a small thing) or dispatch by jumping to some other kernel code. The whole vector and stubs code is inside the files arch/arm/kernel/entry-armv.S and arch/arm/kernel/traps.c. (The “armv” portion of the filename is misleading, this is used for pretty much all ARM32 machines. “Entry” means exception entry point.)

The vector and stub pages is set up in the function early_trap_init() by first filling the page with the 32bit word 0xE7FDDEF1 which is an undefined instruction on all ARM32 CPUs. This process is called “poisoning”, and makes sure the CPU locks up if it ever would put the program counter here. Poisoning is done to make sure wild running program counters stop running around, and as a security vulnerability countermeasure: overflow attacks and other program counter manipulations often tries to lead the program counter astray. Next we copy the vectors and stubs to their respective pages.

The code in the vectors and stubs has been carefully tailored to be position-independent so we can just copy it and execute it wherever we want to. The code will execute fine at address 0xFFFF0000-0xFFFF1FFF or 0x00000000-0x00001FFF alike.

You can inspect the actual vector table between symbols __vector_start and __vector_end at the end of entry-armv.S: there are 8 32-bit vectors named vector_rst, vector_und … etc.

Clearing out PMDs setting vectors We cleared some more PMDs between VMALLOC_START and the fixmaps, so now the black blocks are bigger in the virtual memory space. At 0xFFFF0000 we install vectors and at 0xFFFF1000 we install stubs.

We will not delve into the details of ARM32 exception handling right now: it will suffice to know that this is where we set it up, and henceforth we can deal with them in the sense that ARM32 will be able to define exception handlers after this point. And that is pretty useful. We have not yet defined any generic kernel exception or interrupt interfaces.

The vectors are flushed to memory and we are ready to roll: exceptions can now be handled!

At the end of devicemaps_init() we call early_abt_enable() which enables us to handle some critical abort exceptions during the remaining start-up sequence of the kernel. The most typical case would be a secondary CPU stuck in an abort exception when brought online, we need to cope with that and recover or it will bring the whole system down when we enable it.

The only other notable thing happening in devicemaps_init() is a call to the machine descriptor-specific .map_io() or if that is undefined, to debug_ll_io_init(). This used to be used to set up fixed memory mappings of some device registers for the machine, since at this point the kernel could do that properly using create_mapping(). Nowadays, using device trees, this callback will only be used to remap a debug UART for LL_DEBUG (all other device memory is remapped on-demand) which is why the new function name debug_ll_io_init() which isn’t even using the machine descriptor is preferred.

Make no mistake: DEBUG_LL is already assuming a certain virtual address for the UART I/O-port up until this point and it better remain there. We mapped that much earlier in head.S using a big fat section mapping of 1MB physical-to-virtual memory.

What happens in debug_ll_io_init() is that the same memory window is remapped properly with create_mapping() using a fine-granular map of a single page of memory using the right kernel abstractions. We obtain the virtual address for the UART using the per-serialport assembly macro addruart from the assembly file for the corresponding UART in arch/arm/include/debug/*.

This will overwrite the level-1 section mapping descriptor used for debug prints up until this point with a proper level-1 to level-2 pointer using PMDs and PTEs.

Initialzing the Real Memory Manager

Back in paging_init() we call kmap_init() that initialize the mappings used for highmem and then tcm_init() which maps some very tiny on-chip RAMs if we have them. The TCM (tightly coupled memory) is small SDRAMs that are as fast as cache, that some vendors synthesize on their SoCs.

Finally we set top_pmd to point at address 0xFFFF0000 (the vector space) and we allocate a page called the empty_zero_page which will be a page filled with zeroes. This is sometimes very helpful for the kernel when referencing a “very empty page”.

We call bootmem_init() which brings extended memblock page handling online: it allows resizing of memblock allocations, finds the lowest and highest page frame numbers pfns, performs an early memory test (if compiled in) and initializes sparse memory handling in the generic virtual memory manager.

We then chunk the physical memory into different memory zones with the final call to free_area_init() providing the maximum page frame numbers for the different memory zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM are used on ARM32. The zones are assumed to be consecutive in physical memory, so only the maximum page frame numbers for each zone is given.

  • ZONE_DMA is for especially low physical memory that can be accessed by DMA by some devices. There are machines with limitations on which addresses some devices can access when performing DMA bus mastering, so these need special restrictions on memory allocation.
  • ZONE_NORMAL is what we refer to as lowmem on ARM32: the memory that the kernel or userspace can use for anything.
  • ZONE_HIGHMEM is used with the ARM32 definition of highmem , which we have discussed in detail above: memory physically above the 1-to-1-mapped lowmem.

After returning from free_area_init() the generic kernel virtual memory pager is finally initialized. The memory management is not yet online: we cannot use kmalloc() and friends, but we will get there. We still have to use memblock to allocate memory.

We call request_standard_resources(), which is a call to register to the kernel what purpose specific memory areas have. Here we loop over the memblocks and request them as System RAM if nothing else applies. The resource request facility is hierarchical (resources can be requested inside resources) so the kernel memory gets requested inside the memory block where it resides. This resource allocation provides the basic output from the file /proc/iomem such as the location of the kernel in memory. This facility is bolted on top of the actual memory mapping and just works as an optional protection mechanism.

Finalizing Architecture Setup

We are getting close to the end of ARM32:s setup_arch() call.

If the machine descriptor has a restart hook we assign the global function pointer arm_pm_restart to this. Nominally drivers to restart modern platforms should not use this: they should provide a restart handler in drivers/power/reset/* registering itself using register_restart_handler(), but we have a bit of legacy code to handle restarts, and that will utilize this callback.

Next we unflatten the device tree, if the machine uses this, which all sufficiently modern ARM32 machines should. The device tree provided from boot is compact, binary and read only, so we need to process it so that boot code and device drivers can traverse the device tree easily. An elaborate data structure is parsed out from the device tree blob and allocated into free pages, again using the crude memblock allocator.

So far we only used very ad hoc device tree inspection to find memory areas and memory reservations.

Now we can inspect the device tree in a more civilized manner to find out some very basic things about the platform. The first thing we will actually do is to read the CPU topology information out of the device tree and build a list of available CPUs on the system. But that will happen later on during boot.

Finally, and lastly, we check if the machine has defined ->init_early() and need some other early work. If it does then we call this callback. Else we are done with setup_arch().

After this we return to the function start_kernel() in init/main.c again, where we will see how the kernel builds zones of the memory blocks, initializes the page allocator and finally gets to call mm_init() which brings the proper memory management with kmalloc() and friends online. We will set up SMP, timekeeping and call back into the architecture to finalize the deal.

But this is all a topic for another time.

After we have considered how the ARM32 kernel uncompressed and the early start-up when the kernel jumps from executing in physical memory to executing in virtual memory we now want to see what happens next all the way until the kernel sets up the proper page tables and starts executing from properly paged virtual memory.

To provide a specific piece of the story that does not fit into this linear explanation of things, i have also posted a separate article on how the ARM32 page tables work. This will be referenced in the text where you might need to recapture that part.

To repeat: we have a rough initial section mapping of 1 MB sections covering the kernel RAM and the provided or attached Device Tree Blob (DTB) or ATAGs if we use a legacy system that is not yet using device tree. We have started executing from virtual memory in arch/arm/kernel/head-common.S, from the symbol __mmap_switched where we set up the C runtime environment and jump to start_kernel() in init/main.c. The page table is at a pointer named swapper_pg_dir.

Initial page table layout The initial page table swapper_pg_dir and the 1:1 mapped one-page-section __turn_mmu_on alongside the physical to virtual memory mapping at early boot. In this example we are not using LPAE so the initial page table is -0x4000 from (PAGE_OFFSET +TEXT_OFFSET), usually at 0xC0004000 thru 0xC0007FFF and memory ends at 0xFFFFFFFF.

We are executing in virtual memory, but interrupts and caches are disabled and absolutely no device drivers are available, except the initial debug console. The initial debug console can be enabled with CONFIG_DEBUG_LL and selecting the appropriate debug UART driver for your system. This makes the kernel completely non-generic and custom for your system but is great if you need to debug before the device drivers come up. We discussed how you can insert a simple print in start_kernel() using this facility.

In the following text we start at start_kernel() and move down the setup_arch() call. When the article ends, we are not yet finished with setup_arch() so there will be a second part to how we set up the architecture. (I admit I ran over the maximum size for a post, else it would be one gigantic post.)

In the following I will not discuss the “nommu” (uClinux) set-up where we do not use virtual memory, but just a 1-to-1 physical-to-virtual map with cache and memory protection. It is certainly an interesting case, but should be the topic for a separate discussion. We will describe setting up Linux on ARM32 with full classic or LPAE MMU support.

Setting Up the Stack Pointer and Memory for the init Task

In the following section the words task and thread are used to indicate the same thing: an execution context of a process, in this case the init process.

Before we start executing in virtual memory we need to figure out where our stack pointer is set. __mmap_switched in head-common.S also initializes the ARM stack pointer:

   ARM( ldmia   r4!, {r0, r1, sp} )
 THUMB( ldmia   r4!, {r0, r1, r3} )
 THUMB( mov     sp, r3 )

r4 in this case contains __mmap_switched_data where the third variable is:

.long   init_thread_union + THREAD_START_SP

THREAD_START_SP is defined as (THREAD_SIZE - 8), so 8 bytes backward from the end of the THREAD_SIZE number of bytes forward from the pointer stored in the init_thread_union variable. This is the first word on the stack that the stack will use, which means that bytes at offset THREAD_SIZE - 8, -7, -6, -5 will be used by the first write to the stack: only one word (4 bytes) is actually left unused at the end of the THREAD_SIZE memory chunk.

The init_thread_union is a global kernel variable pointing to the task information for the init process. You find the actual memory for this defined in the generic linker file for the kernel in include/asm-generic/vmlinux.lds.h where it is defined as a section of size THREAD_SIZE in the INIT_TASK_DATA section definition helper. ARM32 does not do any special tricks with this and it is simply included into the RW_DATA section, which you find in the linker file for the ARM kernel in arch/arm/kernel/vmlinux.lds.S surrounded by the labels _sdata and _edata and immediately followed by the BSS section:

        _sdata = .;
        RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
        _edata = .;

        BSS_SECTION(0, 0, 0)

This will create a section which is aligned at a page boundary, beginning with INIT_TASK_DATA of size THREAD_SIZE, initialized to the values assigned during compilation and linking and writable by the kernel.

The union thread_union is what is actually stored in this variable, and in the ARM32 case it actually looks like this after preprocessing:

union thread_union {
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

It is just an array of unsigned long 4 byte chunks, as the word length on ARM32 is, well 32 bits, and sized so that it will fit a THREAD_SIZE stack. The name stack is deceptive: this chunk of THREAD_SIZE bytes stores all ARM-specific context information for the task, and the stack, with the stack in the tail of it. The remaining Linux generic accounting details are stored in a struct task_struct which is elsewhere in memory.

THREAD_SIZE is defined in arch/arm/include/asm/thread_info.h to be (PAGE_SIZE << THREAD_SIZE_ORDER) where PAGE_SIZE is defined in arch/arm/include/asm/page.h to be (1 << 12) which usually resolves to (1 << 13) so the THREAD_SIZE is actually 0x2000 (8196) bytes, i.e. 2 consecutive pages in memory. These 0x2000 bytes hold the ARM-specific context and the stack for the task.

The struct thread_info is the architecture-specific context for the task and for the init task this is stored in a global variable called init_thread_info that you find defined at the end of init/init_task.c:

struct thread_info init_thread_info __init_thread_info = INIT_THREAD_INFO(init_task);

The macro __init_thread_info expands to a linker directive that puts this into the .data.init_thread_info section during linking, which is defined in the INIT_TASK_DATA we just discussed, so this section is only there to hold the init task thread_info. 8 bytes from the end of this array of unsigned longs that make up the INIT_TASK_DATA is where we put the stack pointer.

Where is this init_thread_union initialized anyway?

It is pretty hard to spot actually, but the include/asm-generic/vmlinux.lds.h INIT_DATA_TASK linker macro does this. It contains these statements:

init_thread_union = .;
init_stack = .;

So the memory assigned to the pointers init_thread_union and init_stack is strictly following each other in memory. When the .stack member of init_task is assigned to init_stack during linking, it will resolve to a pointer just a little further ahead in memory, right after the memory chunk set aside for the init_thread_union. This is logical since the stack grows toward lower addresses on ARM32 systems: .stack will point to the bottom of the stack.

Init thread During the early start of the kernel we take extra care to fill out these two thread_info and task_struct data structures and the pointers to different offsets inside it. Notice the sp (stack pointer) pointing 8 bytes up from the end of the end of the two pages assigned as memory to hold the task information. The first word on the stack will be written at sp, typically at offset 0x1FF8 .. 0x1FFB.

This init_stack is assigned to .stack of struct task_struct init_task in init/init_task.c, where the rest of the task information for the init task is hardcoded. This task struct for the init task is in another place than the thread_info for the init task, they just point back and forth to each other. The task_struct is the generic kernel part of the per-task information, while the struct thread_info is an ARM32-specific information container that is stored together with the stack.

Here we see how generic and architecture-specific code connect: the init_thread_info is something architecture-specific and stores the state and stack of the init task that is ARM32-specific, while the task_struct for the init task is something completely generic, all architectures use the same task_struct.

This init task is task 0. It is not identical to task 1, which will be the init process. That is a completely different task that gets forked in userspace later on. This task is only about providing context for the kernel itself, and a point for the first task (task 1) to fork from. The kernel is very dependent on context as we shall see, and that is why its thread/task information and even the stack pointer for this “task zero” is hardcoded into the kernel like this. This “zero task” does not even appear to userspace if you type ps aux, it is hidden inside the kernel.

Initializing the CPU

The very first thing the kernel does in start_kernel() is to initialize the stack of the init task with set_task_stack_end_magic(&init_task). This puts a STACK_END_MAGIC token (0x57AC6E9D) where the stack ends. Since the ARM stack grows downwards, this will be the last usable unsigned long before we hit the thread_info on the bottom of the THREAD_SIZE memory chunk associated with our init task.

init thread stack end marker We insert a 0x57AC6E9D token so we can see if the last word of per-task stack ever gets overwritten and corrupted. The ARM32 stack grows towards the lower addresses. (Up along the arrow in the picture.)

Next smp_setup_processor_id() is called which is a weak symbol that each architecture can override. ARM does this: arch/arm/kernel/setup.c contains this function and if we are running on an SMP system, we execute read_cpuid_mpidr() to figure out the ID of the CPU we are currently running on and initialize the cpu_logical_map() array such that the current CPU is at index 0 and we print the very first line of kernel log which will typically be something like:

Booting Linux on physical CPU 0x0

If you are running on a uniprocessor system, there will be no print like this.

The kernel then sets up some debug objects and control group information that is needed early. We then reach local_irq_disable(). Interrupts are already disabled (at least they should be) but we exercise this code anyways. local_irq_disable() is defined to arch_local_irq_disable() which in the ARM case can be found in arch/arm/include/asm/irqflags.h. As expected it resolves to the assembly instruction cpsid i which will disable any ordinary IRQ in the CPU. This is short for change processor state interrupt disable i. It is also possible to issue cpsid f to disable FIQ, which is an interrupt which the ARM operating systems seldom make use of.

ARM systems usually also have an interrupt controller: this is of no concern here: we are disabling the line from the interrupt controller (such as the GIC) to the CPU: this is the big main switch, like going down in the basement and cutting the power to an entire house. Any other lightswitches in the house are of no concern at this point, we haven’t loaded a driver for the interrupt controller so we just ignore any interrupt from any source in the system.

Local IRQ big powerswitch The local_irq_disable() results in cpsid i which will cut the main interrupt line to the CPU core.

This move makes a lot of sense, because at this point we have not even set up the exception vectors, which is what all IRQs have to jump through to get to the destined interrupt handler. We will get to this in due time.

Next we call boot_cpu_init(). As the name says this will initialize the one and only CPU we are currently running the kernel on. Normally any other CPUs in the system are just standing by with frozen instruction counters at this point. This does some internal kernel bookkeeping. We note that smp_processor_id() is called and __boot_cpu_id is assigned in the SMP case. smp_processor_id() is assigned to raw_smp_processor_id() in include/linux/smp.h and that will go back into the architecture in arch/arm/include/asm/smp.h and reference the global variable current_thread_info()->cpu, while on uniprocessor (UP) systems it is simply defined to 0.

Let’s see where that takes us!

In the SMP case, current_thread_info() is defined in arch/arm/include/asm/thread_info.h and is dereferenced from the current_stack_pointer like this:

static inline struct thread_info *current_thread_info(void)
{
        return (struct thread_info *)
                (current_stack_pointer & ~(THREAD_SIZE - 1));
}

current_stack_pointer in turn is defined in arch/arm/include/asm/percpu.h and implemented as the assembly instruction sp. We remember that we initialized this to point at the end of THREAD_SIZE minus 8 bytes of the memory reserved for init_thread_info. The stack grows backward. This memory follows right after the thread_info for the init task so by doing this arithmetic, we get a pointer to the current struct thread_info, which will be the init_task. Further we see that the link file contains this:

. = ALIGN(THREAD_SIZE);
(...)
RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)

So the linker takes care to put thread info into a THREAD_SIZE:d chunk in memory. This means that if THREAD_SIZE is 0x2000, then the init task thread_info will always be on addresses like 0x10000000, 0x10002000, 0x10004000...

For example if THREAD_SIZE is 0x2000 then (THREAD_SIZE - 1) is 0x00001FFF and ~0x00001FFF is 0xFFFFE000 and since we know that the struct thread_info always start on an even page, this arithmetic will give us the address of the task information from the stack pointer. If the init_thread_info ends up at 0x10002000 then sp points to 0x10003FF8 and we calculate 0x10003FF8 & 0xFFFFE000 = 0x10002000 and we have a pointer to our thread_info, which in turn has a pointer to init_task: thus we know all about our context from just using sp.

Thus sp is everything we need to keep around to quickly look up the information of the currently running task, i.e. the process context. This is a central ARM32 Linux idea.

Since the kernel is now busy booting we are dealing with the init task, task 0, but all tasks that ever get created on the system will follow the same pattern: we go into the task context and use sp to find any information associated with the task. The init_task provides context so we don't crash. Now it becomes evident why this special “task zero” is needed so early.

So to conclude: when raw_smp_processor_id() inspects current_thread_info()->cpu to figure out what CPU we are running on, this thread info is going to be the init_thread_info and ->cpu is going to be zero because the compiler has assigned the default value zero to this member during linking (we haven’t assigned anything explicitly). ->cpu is an index and at index zero in the cpu_logical_map() is where we poked in our own CPU ID a little earlier using smp_setup_processor_id() so we can now find ourselves through the sp. We have come full circle.

As I pointed out this clever sp mechanism isn’t just used for the init task. It is used for all tasks on the system, which means that any allocated thread info must strictly be on a memory boundary evenly divisible with THREAD_SIZE. This works fine because kmalloc() that is used to allocate kernel memory returns chunks that are “naturally aligned”, which means “aligned to the object size if that size is a power of two” – this will ascertain that the allocation ends up on a page boundary evenly divisible by the object allocated, and the current_thread_info() call will always work fine. This is one of the places where the kernel assumes this kind of natural alignment.

Hello World

Next we call page_address_init() which mostly does nothing, but if the system is using highmem (i.e. has a lot of memory) this will initialize a page hash table. We will explain more about how highmem is handled later in this article.

Next we print the Linux banner. The familiar first text of the kernel appear in the kernel print buffer stating what version and git hash we built from and what compiler and linker was used to build this kernel:

Linux version 5.9.0-rc1-00021-gbbe281ed6cfe-dirty (...)

Even if we have an early console defined for the system, i.e. if we defined CONFIG_DEBUG_LL these first few messages will still NOT be hammered out directly on the console. They will not appear until we have initialized the actual serial console driver much later on. Console messages from CONFIG_DEBUG_LL only come out on the console as a result of printascii() calls. This banner is just stashed aside in the printk memory buffer until we get to a point where there is a serial device that can actually output this text. If we use earlyprintk and the serial driver has the proper callbacks, it will be output slightly earlier than when the memory management is properly up and running. Otherwise you will not see this text until the serial driver actually gets probed much later in the kernel start-up process.

But since you might have enabled CONFIG_DEBUG_LL there is actually a trick to get these prints as they happen: go into kernel/printk/printk.c and in the function vprintk_store() insert some code like this at the end of the function, right before the call to log_output():

#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
       {
               extern void printascii(char *);
               printascii(textbuf);
       }
#endif

Alternatively, after kernel version v5.11-rc1:

if (dev_info)
                memcpy(&r.info->dev_info, dev_info, sizeof(r.info->dev_info));
#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
       {
               extern void printascii(char *);
               printascii(&r.text_buf[0]);
       }
#endif

This will make all printk:s get hammered out on the console immediately as they happen, with the side effect that once the serial port is up you get conflict about the hardware and double prints of everything. However if the kernel grinds to a halt before that point, this hack can be really handy: even the very first prints to console comes out immediately. It’s one of the ARM32-specific hacks that can come in handy.

Next we call early_security_init() that will call ->init() on any Linux Security Modules (LSMs) that are enabled. If CONFIG_SECURITY is not enabled, this means nothing happens.

Setting Up the Architecture

The next thing that happens in start_kernel() is more interesting: we call setup_arch() passing a pointer to the command line. setup_arch() is defined in <linux/init.h> and will be implemented by each architecture as they seem fit. The architecture initializes itself and passes any command line from whatever mechanism it has to provide a command line. In the case of ARM32 this is implemented in arch/arm/kernel/setup.c and the command line can be passed from ATAGs or the device tree blob (DTB).

Everything in this article from this point on will concern what happens in setup_arch(). Indeed this function has given the name to this whole article. We will conclude the article once we return out of setup_arch(). The main focus will be memory management in the ARM32 MMUs.

Setting Up the CPU

ARM32 first sets up the processor. This section details what happens in the setup_processor() call which is the first thing the setup_arch() calls. This code will identify the CPU we are running on, determine its capabilities such as cache type and prepare the exception stacks. It will however not enable the caches or any other MMU functionality such as paging, that comes later.

We call read_cpuid_id() which in most cases results in a CP15 assembly instruction to read out the CPU ID: mrc p15, 0, <Rd>, c0, c0, 0 (some silicon such as the v7m family require special handling).

Using this ID we cross reference a struct with information about the CPU, struct proc_info_list. This is actually the implementation in assembly in head-common.S that we have seen earlier, so we know that this will retrieve information about the CPU from the files in arch/arm/mm/proc-*.S for example arch/arm/mm/proc-v7.S for all the ARMv7 processors. With some clever linkage the information about the CPU is now assigned into struct proc_info_list from arch/arm/include/asm/procinfo.h, so that any low-level information about the CPU and a whole set of architecture-specific assembly functions for the CPU can be readily accessed from there. A pointer to access these functions is set up in a global vector table named, very clearly, processor.

We print the CPU banner which can look something like this (Qualcomm APQ8060, a dual-core Cortex A9 SMP system):

CPU: ARMv7 Processor [510f02d2] revision 2 (ARMv7), cr=10c5787d

From the per-processor information struct we also set up elf_hwcap and elf_hwcap2. These flags that can be found in arch/arm/include/uapi/asm/hwcap.h tells the ELF (Executable and Linkable Format) parser in the kernel what kind of executable files we can deal with. For example if the executable is using hardware floating point operations or NEON instructions, a flag is set in the ELF header, and compared to this when we load the file so that we can determine if we can even execute the file.

Just reading that out isn’t enough though so we call in succession cpuid_init_hwcaps() to make a closer inspection of the CPU, disable execution of thumb binaries if the kernel was not compiled with thumb support, and we later also call elf_hwcap_fixup() to get these flags right for different fine-granular aspects of the CPU.

Next we check what cache type this CPU has in cacheid_init(). You will find these in arch/arm/include/asm/cachetype.h with the following funny names tagged on: VIVT, VIPT, ASID, PIPT. Naturally, this isn’t very helpful. These things are however clearly defined in Wikipedia, so go and read.

We then call cpu_init() which calls cpu_proc_init() which will execute the per-cputype callback cpu_*_proc_init from the CPU information containers in arch/arm/mm/proc-*.S. For example, the Faraday FA526 ARMv4 type CPU cpu_fa526_proc_init() in arch/arm/mm/proc-fa526.S will be executed.

Lastly there is a piece of assembly: 5 times invocation of the msr cpsr_c assembly instruction, setting up the stacks for the 5 different exceptions of an ARM CPU: IRQ, ABT, UND, FIQ and SVC. msr cpsr_c can switch the CPU into different contexts and set up the hardware-specific sp for each of these contexts. The CPU stores these copies of the sp internally.

The first assembly instruction loads the address of the variable stk, then this is offsetted for the struct members irq[0], abt[0], und[0], and fiq[0]. The struct stack where this is all stored was located in the beginning of the cpu_init() with these two codelines:

unsigned int cpu = smp_processor_id();
struct stack *stk = &stacks[cpu];

stacks is just a file-local struct in a cacheline-aligned variable:

struct stack {
    u32 irq[3];
    u32 abt[3];
    u32 und[3];
    u32 fiq[3];
} ____cacheline_aligned;

static struct stack stacks[NR_CPUS];

In the ARM32 case cacheline-aligned means this structure will start on an even 32, 64 or 128-byte boundary. Most commonly a 32-byte boundary, as this is the most common cacheline size on ARM32.

So we define as many exception callstacks as there are CPUs in the system. Interrupts and other exceptions thus have 3 words of callstack. We will later in this article descript how we set up the exception vectors for these exceptions as well.

Setting up the machine

We have managed to set up the CPU per se and we are back in the setup_arch() function. We now need to know what kind of machine we are running on to get further.

To do this we first look for a flattened device tree (DTB or device tree blob) by calling setup_machine_fdt(). If this fails we will fall back to ATAGs by calling setup_machine_tags() which was the mechanism used for boardfiles before we had device trees.

We have touched upon this machine characteristic before: the most important difference is that in device trees the hardware on the system is described using an abstract hierarchical tree structure and in ATAGs the hardware is defined in C code in a so-called boardfile, that will get called later during this initialization. There are actually to this day three ways that hardware is described in the ARM systems:

  1. Compile-time hardcoded: you will find that some systems such as the RISC PC or StrongARM EBSA110 (evaluation board for StrongARM 110) will have special sections even in head.S using #ifdef CONFIG_ARCH_RPC and similar constructs to kick in machine-specific code at certain stages of the boot. This was how the ARM32 kernel was started in 1994-1998 and at the time very few machines were supported. This was the state of the ARM32 Linux kernel as it was merged into the mainline in kernel v2.1.80.
  2. Machine number + ATAGs-based: as the number of ARM machines grew quickly following the successful Linux port, the kernel had to stop relying on compile-time constants and needed a way for the kernel to identify which machine it was running on at boot time rather than at compile time. For this reason ATAGs were introduced in 2002. The ATAGs are also called the kernel tagged list and they are a simple linked list in memory passed to the kernel at boot, and most importantly provides information about the machine type, physical memory location and size. The machine type is what we need to call into the right boardfile and perform the rest of the device population at runtime. The ATAGs are however not used to identify the very machine itself. To identify the machine the boot loader also passes a 32bit number in r1, then a long list of numbers identifying each machine, called mach-types, is compiled into the kernel to match this number.
  3. Device Tree-based: the Android heist in 2007 and onward started to generate a constant influx of new machine types to the ARM kernel, and the board files were growing wild. In 2010 Grant Likely proposed that ARM follow PowerPC and switch to using device trees to describe the hardware, rather than ATAGs and boardfiles. In 2011 following an outburst from Torvalds in march, the situation was becoming unmaintainable and the ARM community started to accelerate to consolidate the kernel around using device trees rather than boardfiles to cut down on kernel churn. Since then, the majority of ARM32 boards use the device tree to describe the hardware. This pattern has been followed by new architectures: Device Tree is established as the current most viable system description model for new machines.

All of the approaches use the kernel device and driver model to eventually represent the devices in the kernel – however during the very early stages of boot the most basic building blocks of the kernel may need to shortcut this to some extent, as we will see.

Whether we use ATAGs or a DTB, a pointer to this data structure is passed in register r2 when booting the kernel and stored away in the global variable __atags_pointer. The name is a bit confusing since it can also point to a DTB. When using ATAGs a second global variable named __machine_arch_type is also used, and this contains register r1 as passed from the boot loader. (This value is unused when booting from a device tree.)

Using the ATAGs or the DTB, a machine descriptor is located. This is found either from the numerical value identifying the machine in __machine_arch_type or by parsing the DTB to inspect the .compatible value of the machine, at the very top node of the device tree.

While ARM32 device tree systems are mainly relying on the device tree to boot, they also still need to match against a machine descriptor in some file under arch/arm/mach-*/*.c, defined with the macro DT_MACHINE_START. A simple grep DT_MACHINE_START gives you an idea of how many basic ARM32 machine types that boot from the device tree. The ARM64 (Aarch64) kernel, by contrast, has been engineered from start not to require any such custom machine descriptors and that is why it is not found in that part of the kernel. On ARM64, the device trees is the sole machine description.

If you inspect the arch/arm/kernel/devtree.c file you will see that a default device tree machine named Generic DT based system is defined if we are booting a multiplatform image. If no machine in any subdirectory matches the compatible-string of the device tree, this one will kick in. This way it is possible to rely on defaults and boot an ARM32 system with nothing but a device tree, provided you do not need any fix-ups of any kind. Currently quite a lot of ARM32 machines have some quirks though. ARM64 again, have no quirks on the machine level: if quirks are needed for some hardware, these will normally go into the drivers for the machine, and gets detected on a more fine-granular basis using the compatible-string or data found in hardware registers. Sometimes the ARM64 machines use firmware calls.

In either case the physical address pointing to the data structure for ATAGs or DTB is converted to a virtual address using __phys_to_virt() which involves a bit of trickery as we have described in an earlier article about phys_to_virt patching.

Memory Blocks – Part 1

One of the more important effects of calling either setup_machine_fdt() or setup_machine_tags() depending on machine, is the population of a list of memory blocks, memblocks or regions, which is the basic boot-time unit of physical RAM in Linux. Memblocks is the early memory abstraction in Linux and should be contiguous (follow each other strictly in memory). The implementation as used by all Linux architectures can be found in mm/memblock.c.

Either parser (device tree or ATAGs) identifies blocks of physical memory that are added to a list of memory blocks using the function arm_add_memory(). The memory blocks (regions) are identified with start and size, for example a common case is 128MB of memory starting at 0x00000000 so that will be a memory block with start = 0x00000000 and size = 0x08000000 (128 MB).

The memory blocks are actually stored in struct memblock_region which have members .base and .size for these two numbers, plus a flag field.

The ATAG parser calls arm_add_memory() which will adjust the memory block a bit if it has odd size: the memory block must start at a page boundary, be inside a 32bit physical address space (if we are not using LPAE), and it must certainly be on or above PHYS_OFFSET.

When arm_add_memory() has aligned a memory block it will call a facility in the generic memory manager of the Linux kernel with memblock_add() which will in effect store the list of memory blocks in a kernel-wide global list variable with the helpful name memory.

Why must the memory blocks be on or above (i.e. at higher address) than PHYS_OFFSET?

This is because of what has been said earlier about boot: we must load the kernel into the first block of physical memory. However the effect of this check is usually nothing. PHYS_OFFSET is, on all modern platforms, set to 0x00000000. This is because they all use physical-to-virtual patching at runtime. So on modern systems, as far as memory blocks are concerned: anything goes as long as they fit inside a 32-bit physical memory address space.

Consequently, the device tree parser we enter through setup_machine_fdt() does not care about adjusting the memory blocks to PHYS_OFFSET: it will end up in early_init_dt_scan_memory() in drivers/of/fdt.c which calls early_init_dt_add_memory_arch() which just aligns the memory block to a page and add it to the list with memblock_add(). This device tree parsing code is shared among other Linux architectures such as ARC or ARM64 (Aarch64).

When we later want to inspect these memory blocks, we just use the iterator for_each_mem_range() to loop over them. The vast majority of ARM32 systems have exactly one memory block/region, starting at 0x00000000, but oddities exist and these are handled by this code. An example I will use later involves two memory blocks 0x00000000-0x20000000 and 0x20000000-0x40000000 comprising 2 x 512 MB of memory resulting in 1 GB of physical core memory.

Initial Virtual Memory init_mm

After some assigning of variables for the machine, and setting up the reboot mode from the machine descriptor we then hit this:

init_mm.start_code = (unsigned long) _text;
init_mm.end_code   = (unsigned long) _etext;
init_mm.end_data   = (unsigned long) _edata;
init_mm.brk       = (unsigned long) _end;

init_mm is the initial memory management context and this is a compile-time prepared global variable in mm/init-mm.c.

As you can see we assign some variables to the (virtual) addresses of the kernel .text and .data segments. Actually this also covers the BSS section, that we have initialized to zero earlier in the boot process, as _end follows after the BSS section. The initial memory management context needs to know these things to properly set up the virtual memory. This will not be used until much later, after we have exited setup_arch() and get to the call to the function init_mm(). (Remember that we are currently still running in a rough 1MB-chunk-section mapping.)

init MM set-up Here we can see the area where the kernel memory starts at PAGE_OFFSET and how we align in the different sections into init_mm. The swapper_pg_dir is actually the page global directory of the init_mm structure as we will see later.

The Early Fixmap and early_mm_init()

Next we initialize the early fixmap. The fixmap is a virtual memory area from 0xFFC00000-0xFFF00000 where some fixed physical-to-virtual mappings can be specified. They are used for remapping some crucial parts of the memory as well as some I/O memory before the proper paging is up, so we poke around in the page table in a simplified manner and we can do a few necessary I/O operations in the virtual memory even before the proper set-up of the virtual memory.

There are four types of fixmaps on ARM32, all found in arch/arm/include/asm/fixmap.h where enum fixed_addresses define slots to be used for different early I/O maps:

  1. FIX_EARLYCON: this is a slot used by the early console TTY driver. Some serial line drivers have a special early console callback that can be used to get an early console before the actual serial driver framework has started. This is supported on ARM32 but it has a very limited value, because on ARM32 we have CONFIG_DEBUG_LL which provides a hardcoded serial port at compile time, which makes it possible to get debug output on the serial port even before we are running in virtual memory, as we have seen earlier. CONFIG_DEBUG_LL cannot be used on a multiplatform image and has the upside of using the standard serial port driver callbacks, the serial port defines an EARLYCON_DECLARE() callback and assigns functions to ->con->write/read to get both read and write support on the early console.
  2. FIX_KMAP: As can be seen from the code, this is KM_TYPE_NR * NR_CPUS. KM_TYPE_NR is tied down to 16 for ARM32, so this will be 16 maps for each CPU. This area is used for high memory “highmem”. Highmem on ARM32 is a whole story on its own and we will detail it later, but it relates to the way that the kernel uses a linear map of memory (by patching physical to virtual memory mapping as we have seen), but for now it will suffice to say that this is where memory that cannot be accessed by using the linear map (a simple addition or subtraction to get between physical and virtual memory) is temporarily mapped in so that the kernel can make use of it. The typical use case will be page cache.
  3. FIX_TEXT_POKE[0|1]: These two slots are used by debug code to make some kernel text segments writable, such as when inserting breakpoints into the code. The fixmap will open a “window” over the code that it needs to patch, modify the code and close the window again.
  4. FIX_BTMAPS: This is parameterization for 32 * 7 slots of mappings used by early ioremap, see below.

Those early fixmaps are set up during boot as we call:

pmd_t *pmd;

pmd = fixmap_pmd(FIXADDR_TOP);
pmd_populate_kernel(&init_mm, pmd, bm_pte);

The abbreviations used here include kernel page table idiosyncracies so it might be a good time to read my article on how the ARM32 page tables work.

This will create an entry in the page middle directory (PMD) and sufficient page table entries (PTE) referring to the memory at FIXADDR_TOP, and populate it with the finer granular page table entries found in bm_pte. The page table entries in bm_pte uses PAGE_SIZE granularity so usually this is a number of 0x1000 sized windows from virtual to physical memory. FIXADDR_TOP points to the last page in the FIXADDR address space, so this will be at 0xFFF00000 - PAGE_SIZE so typically at 0xFFEFF000..0xFFEFFFF. The &init_mm parameter is actually ignored, we just set up this PMD using the PTEs in bm_pte. Finally the memory storing the PMD itself is flushed in the translation lookaside buffer so we know the MMU has the right picture of the world.

Did we initialize the init_mm->pgd member now again? Nope. init_mm is a global variable defined in <linux/mm_types.h> as extern struct mm_struct init_mm, and the struct mm_struct itself is also defined in this header. To find init_mm you need to look into the core kernel virtual memory management code in mm/init-mm.c and there it is:

struct mm_struct init_mm = {
        .mm_rb          = RB_ROOT,
        .pgd            = swapper_pg_dir,
        (...)
};

So this is assigned at compile time to point to swapper_pg_dir, which we know already to contain our crude 1MB section mappings. We cross our fingers, hope that our fixmap will not collide with any of the existing 1MB section mappings, and just push a more complex, proper “PMD” entry into this PGD area called swapper_pg_dir. It will work fine.

So we still have our initial 1MB-sized section mappings, and to this we have added an entry to this new “PMD”, which in turn point to the page table entry bm_pte which stores our fixmaps.

Init MM context This illustrates the world of the init_mm memory management context. It is getting crowded in this picture. You can see the fixmap entries piling up around 0xFFF00000 and the serial port mapped in as one of the fixmaps.

Code will have to assign the physical base address (aligned to a page boundary) to use for each slot in the fixmap, and the kernel will assign and map a suitable virtual address for the physical address, for example the early console does this:

set_fixmap(FIX_EARLYCON, <physical address>);

This just loops back to __set_fixmap() in arch/arm/mm/mmu.c that looks up which virtual address has been assigned for this index (in this case the index is FIX_EARLYCON) by using __fix_to_virt() from the generic part of the fixmap in include/asm-generic/fixmap.h. There are two functions for cross referencing virtual to physical memory and vice versa that look like this:

#define __fix_to_virt(x)        (FIXADDR_TOP - ((x) << PAGE_SHIFT))
#define __virt_to_fix(x)        ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)

As you can see, the fixmap remappings are done one page at the time (so each remapping is one single page), backwards from FIXADDR_TOP (on ARM32 0xFFEFF000) so the first index for FIX_EARLYCON will be a page at 0xFFEFE000, the second index at 0xFFEFD000 etc.

It then looks up the PTE for this virtual address using pte_offset_fixmap():

pte_t *pte = pte_offset_fixmap(pmd_off_k(vaddr), vaddr);

Which at this point is a pointer to pte_offset_early_fixmap():

static pte_t * __init pte_offset_early_fixmap(pmd_t *dir, unsigned long addr)
{
    return &bm_pte[pte_index(addr)];
}

So we simply get a pointer to the PTE inside the bm_pte page table, which makes sense since this is the early fixmap page table where these page table entries physically reside.

Then set_fixmap() will modify the page table entry in the already registered and populated bm_pte page table by calling set_pte_at() which call set_pte_ext() which will eventually call down into per-CPU assembly symbol set_pte_ext in arch/arm/mm/proc-*.S This will conjure the right value for the PTE and write that back to the page table in the right slot, so that the virtual-to-physical mapping actually happens.

After this we call local_flush_tlb_kernel_range() just to make sure we don’t have this entry stored in the translation lookaside buffer for the CPU. We can now access the just remapped physical memory for the early console in the virtual address space. It better not be more than one page, but it’s cool: there is no memory mapped serial port that uses more than one page of memory. It will “just work”.

The early fixmaps will eventually be converted to proper (non-fixed) mappings once we call early_fixmap_shutdown() inside paging_init() which will be described later. Curiously only I/O memory is supported here. We better not use any fixmaps before early_fixmap_shutdown() that are not I/O memory and expect them to still be around after this point. This should be safe: the patching in the poke windows and the highmem business should not happen until later anyways.

As I noted the FIX_BTMAPS inside the early fixmaps are used for I/O memory. So this comes next, as well call early_ioremap_init(). As ARM32 is using the generic early ioremap code this just calls early_ioremap_setup() in mm/early_ioremap.c. This makes it possible to use early calls to ioremap(). As noted we have defined for NR_FIX_BTMAPS which we use to parameterize the generic early ioremap code. We can early ioremap 32 different memory areas.

For drivers it will transparently provide a back-end for ioremap() so that a piece of I/O memory can be remapped to a virtual address already at this point, and stay there, usually for the uptime of the system. A driver requesting an ioremap at this point will get a virtual-to-physical mapping in the assigned virtual memory area somewhere in the range 0xFFC00000-0xFFF00000 and it stays there.

Which device drivers will use these early ioremaps to get to the memory-mapped I/O? The early console is using it’s own fixmap so not that one. Well. actually not much as it looks. But it’s available.

Later, at runtime, the fixmaps will find another good use: they are used to map in highmem: memory that the kernel cannot handle due to being outside of the linear kernel map. Such areas will be mapped in here one piece at a time. This is causing complexities in the highmem handling, and is why we are investigating an end to highmem. If you do not know what highmem means in this context – do not worry! – it will be explained in more detail below.

Early Parameters

We have just set up some early mappings for the early console, so when we next call parse_early_param() in init/main.c to read a few command line parameters that just cannot wait. As the documentation in <linux/init.h> says: “only for really core code”.

Before we set up the early fixmap we actually copied the boot_command_line. This was already present since the call to setup_machine_fdt() or setup_machine_tags(). During the parsing of ATAGs or the DTB, boot_command_line is assigned from either source. So a kernel command line can be passed in from each of these two facilities. Parameters to the kernel can naturally be passed in on this command line.

The normal way to edit and pass these command line arguments is by using a facility in U-Boot, UEFI or GRUB to set them up in the boot loader before booting the kernel. U-Boot for example will either pass them in a special ATAG (old way) or by modifying the chosen node of the device tree in memory before passing a pointer to the device tree to the kernel in r2 when booting.

The early params are defined all over the kernel compiled-in code (not in modules, naturally) using the macro early_param(). Each of these result in a struct obs_kernel_param with a callback ->setup_func() associated with them, that gets stored in a table section named .init.setup by the linker and these will be called one by one at this point.

Examples of things that get set up from early params are cache policy, extra memory segments passed with mem=..., parameters to the IRQ controllers such as noapic to turn off the x86 APIC, initrd to point out the initial RAM disk.

mem is an interesting case: this is parsed by an early_param() inside setup.c itself, and if the user passed some valid mem= options on the command line, these will be added to the list of available memory blocks that we constructed from the ATAGs or DTB.

earlycon is also parsed, which will activate the early console if this string is passed on the command line. So we just enabled the special FIX_EARLYCON early fixmap, and now the early console can use this and dump out the kernel log so far. All this happens in drivers/tty/serial/earlycon.c by utilizing the early console callbacks of the currently active serial port.

If you are using device tree, early_init_dt_scan_chosen_stdout() will be called, which will call of_setup_earlycon() on the serial driver selected in the stdout-path in the chosen section in the root of the device tree, such as this:

/ {
    chosen {
         stdout-path = "uart0:19200n8";
    };
(...)

Or like this:

/ {
    chosen {
           stdout-path = &serial2;
    };
(...)
    serial2: uart@80007000 {
        compatible = "arm,pl011", "arm,primecell";
        reg = <0x80007000 0x1000>;
    };
};

The device tree parsing code will follow the phandle or use the string to locate the serial port. For this reason, to get an early console on device tree systems, all you need to pass on the command line is earlycon. The rest will be figured out by the kernel by simply inspecting the device tree.

On older systems it is possible to pass the name of the driver and a memory address of a serial port, for example earlycon=pl011,0x80007000 but this is seldomly used these days. Overall the earlycon, as noted, is seldomly used on ARM32. It is because we have the even more powerful DEBUG_LL that can always hammer out something on the serial port. If you need really early debugging information, my standard goto solution is DEBUG_LL and using printascii() until the proper kernel prints have come up. If earlycon is early enough, putting the stdout-path into your device tree and specifying the earlycon parameter should do the trick on modern systems.

Early Memory Management

As we reach early_mm_init() we issue build_mem_type_table() and early_paging_init() in sequence. This is done so that we can map the Linux memory the way it is supposed to be mapped using all features of the MMU to the greatest extent possible. In reality, this involves setting up the “extra bits” in the level-2/3 page descriptors corresponding to the page table pointer entries (PTE:s) so the MMU knows what to do.

build_mem_type_table() will fill in the mem_types[] array of memory types with the appropriate protection and access settings for the CPU we are running on. mem_types[] is an array of some 16 different memory types that can be found inside arch/arm/mm/mmu.c and list all the memory types that may exist in an ARM32 system, for example MT_DEVICE for memory-mapped I/O or MT_MEMORY_RWX for RAM that can be read, written and executed. The memory type will determine how the page descriptors for a certain type of memory gets set up. At the end of building this array, the kernel will print out some basic information about the current main memory policy, such as:

Memory policy: ECC enabled, Data cache, writeback

This tells us that was turned on by passing ecc=on on the command line, with the writeback data cache policy, which is the default. The message is a bit misleading since it actually just talks about a few select aspects of the memory policy regarding the use of ECC and the data cache. There are many other aspects to the cache policy that can be seen by inspecting the mmu.c file. It is not normal to pass in ecc=on, so this is just an example.

early_paging_init() then, is a function that calls ->pv_fixup() on the machine, and this is currently only used on the Texas Instruments Keystone 2 and really just does something useful when the symbol CONFIG_ARM_PV_FIXUP is defined. What happens then is that an address space bigger than 4GB is enabled using LPAE. This is especially complicated since on this machine even the physical addresses changes as part of the process and PHYS_OFFSET is moved upwards. The majority of ARM32 machines never do this.

We do some minor initialization calls that are required before kicking in the proper paging such as setting up the DMA zone, early calls to Xen (virtualization) and EFI. Then we get to the core of the proper Linux paging.

Paging Initialization: Lowmem and Highmem

We now reach the point where the kernel will initialize the virtual memory handling proper, using all the bells and whistles of the MMU.

The first thing we do is to identify the bounds of lowmem. ARM32 differentiates between lowmem and highmem like this:

  • Lowmem is the physical memory used by the kernel linear mapping. This is typically the virtual memory between 0xC0000000-0xF0000000, 768 MB. With an alternative virtual memory split, such as VMSPLIT_1G giving userspace 0x40000000 and kernelspace 0xC0000000 lowmem would instead be 0x40000000-0xD0000000, 2.8 GB. But this is uncommon: almost everyone and their dog uses the default VMSPLIT with 768 MB lowmem.
  • Highmem is any memory at higher physical addresses, that we cannot fit inside the lowmem linear physical-to-virtual map.

You might remember that this linear map between physical and virtual memory was important for ARM32. It is achieved by physical-to-virtual runtime patching as explained in a previous article, with the goal of being as efficient as possible.

As you can see, on ARM32 both lowmem and highmem have a very peculiar definition and those are just conventions, they have very little to do with any hardware limitations of the architecture or any other random definitions of “lowmem” and “highmem” that are out there.

A system can certainly have less than 768 MB (0x30000000) physical memory as well: then all is fine. The kernel can map in and access any memory and everyone is happy. For 14 years all ARM32 systems were like this, and “highmem” did not even exist as a concept. Highmem was added to ARM32 by Nicolas Pitre in september 2008. The use case was a Marvell DB-78x00-BP development board that happened to have 2 GB of RAM. Highmem requires fixmap support so that was added at the same time.

The kernel currently relies on being able to map all of the core memory – the memory used by the kernel itself, and all the userspace memory on the machine – into its own virtual memory space. This means that the kernel cannot readily handle a physical memory bigger than 768 MB, with the standard VMSPLIT at 0xC0000000. To handle any memory outside of these 768 MB, the FIX_KMAP windows in the fixmaps we discussed earlier are used.

Let’s inspect how this 768 MB limitation comes about.

To calculate the end of lowmem we call adjust_lowmem_bounds(), but first notice this compiled-in constant a little bit above that function:

static void * __initdata vmalloc_min =
    (void *)(VMALLOC_END - (240 << 20) - VMALLOC_OFFSET);

vmalloc is shorthand for virtual memory allocation area, and indicates the memory area where the kernel allocates memory windows. It has nothing to do with the physical memory of the kernel at this point: it is an area of virtual addresses that the kernel can use to place mappings of RAM or memory-mapped I/O. It will be used by SLAB and other kernel memory allocators as well as by ioremap(), it is a number of addresses in the kernel’s virtual memory that it will be using to access random stuff in physical memory. As we noted, we had to use early fixmaps up to this point, and the reason is exactly this: there is no vmalloc area to map stuff into yet.

The variables used for this pointer can be found in arch/arm/include/asm/pgtable.h:

#define VMALLOC_OFFSET          (8*1024*1024)
#define VMALLOC_START           (((unsigned long)high_memory + VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1))
#define VMALLOC_END             0xff800000UL

Never mind the high_memory variable in VMALLOC_START: it has not been assigned yet. We have no idea about where the virtual memory allocation will start at this point.

VMALLOC_END is hardcoded to 0xFF800000, (240 << 20) is an interesting way of writing 0x0F000000 and VMALLOC_OFFSET is 0x00800000. So vmalloc_min will be a pointer to the virtual address 0xF0000000. This lower bound has been chosen for historical reasons, which can be studied in the commit. Above this, all the way to 0xFF800000 we will find our vmalloc area.

From this we calculate the vmalloc_limit as the physical address corresponding to vmalloc_min, i.e. what address in the physical memory the virtual address 0xF0000000 corresponds to in the linear map of physical-to-virtual memory. If our kernel was loaded/decompressed at 0x10008000 for example, that usually corresponds to 0xC0008000 and so the offset is 0xB0000000 and the corresponding physical address for 0xF0000000 would be 0xF0000000 - 0xB0000000 = 0x40000000. So this is our calculated vmalloc_limit.

We arrive at the following conclusion: since the kernel starts at 0x10000000 (in this example) and the linear map of the kernel stretches from 0x10000000-0x40000000 (in virtual address space 0xC0000000-0xF0000000) That would be the equivalent of 0x30000000 bytes of memory, i.e. 768MB. This holds in general: the linear map of the kernel memory is 768 MB. The kernel and all memory it directly references must fit in this area.

In adjust_lowmem_bounds() we first loop over all memblocks to skip over those that are not “PMD-aligned” which means they do not start at a physical address evenly divisible by 0x00200000 (2 MB) on ARM systems. These are marked nomap in the memblock mechanism, so they will not be considered for kernel mappings.

The patch-phys-to-virt mechanism and the way we map page tables over the kernel relies on the kernel being at a PMD boundary so we will not be able to use memory blocks that do not start at a PMD-aligned memory address. If the memory block is such an off thing, such as when a machine has some small memories at random locations for things like graphics or other buffers, these will not be considered as a candidate for the core of the system. You can think about this as a heuristic that says: “small memories are probably not supposed to be used as main RAM memory”.

The next loop is more interesting. Here we again loop over the memblocks, but skip over those we just marked as unfit for the core memory. For those that remain we check if they start below vmalloc_limit (in our example at 0x40000000). As the memory we are executing inside certainly must start below vmalloc_limit we have a core memory candidate and inspect it further: if the block ends above the current lowmem_limit we bump lowmem_limit to the end of the block, unless it would go past vmalloc_limit, in that case we will truncate it at vmalloc_limit. This will in effect fit the core memory candidate over the 768 MB available for lowmem, and make sure that lowmem_limit is at the end of the highest memblock we can use for the kernel alternatively at the hard vmalloc_limit if that is lower than the end of the memblock. vmalloc_limit in turn is just calculated by subtracting vmalloc_reserve from VMALLOC_END, and that is parsed from a command line argument, so if someone passed in a minimum size of vmalloc reservation space, that will be respected.

In most cases arm_vmalloc_limit will be at the end of the highest located memblock. Very often there is just one big enough memblock on a given system.

The loop is necessary: there could be two blocks of physical memory adjacent to each other. Say one memblock at 0x00000000-0x20000000 and another memblock at 0x20000000-0x40000000 giving 2 x 512 MB = 1 GB of core memory. Both must be mapped to access the maximum of core memory. The latter will however be truncated by the code so that it ends at vmalloc_limit, which will be at 0x30000000 as we can only map 768 MB of memory. Oops.

The remainder of the second memblock will be called highmem if we have enabled highmem support, else we will just call memblock_remove() to delete the remainder from the system, and in this case the remaining memory will be unused so we print a warning about this: “Consider using a HIGHMEM enabled kernel.”

Lowmem and highmem on 1 GB This shows how a 1 GB physical memory in two memblocks of 512 MB each starting at physical address 0x00000000 gets partitioned into a lowmem of 768 MB and highmem of 256 MB. The upper 256 MB cannot fit into the linear kernel memory map!

Lowmem and highmem on 2 GB If we instead have four memblocks of 512 MB physical memory comprising a total of 2 GB, the problem becomes ever more pronounced: 1.2 GB of memory is now in highmem. This reflects the situation in the first Marvell board with 2 GB of memory that initiated the work on highmem support for ARM32.

After exiting the function we set arm_lowmem_limit to the lowmem_limit we found. In our example with these two 512 MB banks, it will be set at physical address 0x30000000. This is aligned to the PMD granularity (0x00200000) which will again be 0x30000000, and set as a limit for memory blocks using memblock_set_current_limit(). The default limit is 0xFFFFFFFF so this will put a lower bound on where in physical memory we can make allocations with memblocks for the kernel: the kernel can now only allocate in physical memory between 0x00000000-0x30000000.

We immediately also assign high_memory to the virtual memory address right above arm_lowmem_limit, in this case 0xF0000000 of course. That means that this definition from arch/arm/include/asm/pgtable.h is now resolved:

#define VMALLOC_START (((unsigned long)high_memory + VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1))

And we find that VMALLOC_START is set to 0xF0800000. The VMALLOC area is now defined to be 0xF0800000-0xFF800000, 0x0F000000 bytes (240 MB). This looks familiar to the (240 << 20) statement in the vmalloc_min definition.

Next

This exploration will be continued in Setting Up the ARM32 Architecture, part 2 where we will take a second round around the memory blocks and set up the paging.

As I continue to describe in different postings how the ARM32 start-up sequence works, it becomes necessary to explain in-depth the basic kernel concepts around page tables and how it is implemented on ARM32 platforms.

To understand the paging setup, we need to repeat and extend some Linux paging lingo. Some good background is to read Mel Gormans description of the Linux page tables from his book “Understanding the Linux Virtual Memory Manager”. This book was published in 2007 and is based on Mel’s PhD thesis from 2003. Some stuff has happened in the 13 years since then, but the basics still hold. It is necessary to also understand the new layers in the page tables such as the five layers of page tables currently used in the Linux kernel.

First a primer: the ARM32 architecture with a classic MMU has 2 levels of page tables and the more recent LPAE (Large Physical Address Extension) MMU has 3 levels of page tables.

Only some of the ARMv7 architectures have LPAE, and it is only conditionally enabled, i.e. the machines can also use the classic MMU if they want, they have both. It is not enabled by default on the multi_v7 configuration: your machine has to explicitly turn it on during compilation. The layout is so different that the same binary image can never support both classic and LPAE MMU in the same kernel image.

Early implementations of ARMv7-A such as Cortex A8 and Cortex A9 do not support LPAE, rather it was introduced during the lifetime of this architecture. Since this is a compile-time setting, the default configuration for ARMv7 cannot enable it by default, or the older implementations would break. ARMv8 implementations all have LPAE enabled by default.

But let’s first discuss how Linux see the world of paging.

The Linux Kernel View of Page Tables

The generic paging management mechanism in the Linux kernel sees the world like this:

Paging directories Abstract page tables in the Linux kernel proceed from the PGD (page global directory) thru the P4D (fourth level directory), PUD (page upper directory), PMD (page middle directory) and down to the actual page table entries (PTEs) that maps individual pages pages of memory from virtual to physical address space.

Abbreviations:

  • pgd, pgd_t, pgdval_t = Page Global Directory – the Linux kernel main page table handling the PGD for the kernel memory is found in symbol swapper_pg_dir
  • p4d, p4d_t, p4dval_t = Page Level 4 Directory was introduced to handle 5-level page tables after the pud was introduced. Now it was clear that we need to replace pgd, pmd, pud with a figure indicating the directory level and we cannot go on with natural names any more. This is unused on ARM32.
  • pud, pud_t, pudval_t = Page Upper Direcory was introduced after the other abstractions to deal with 4-level page tables. Like p4d, this is unused on ARM32.
  • pmd, pmd_t, pmdval_t = Page Middle Directory, ARM32 uses this.
  • pte, pte_t, pteval_t = Page Table Entry, ARM32 uses this.
  • pfn = Page Frame Number each page in physical memory has a unique number, address 0x00000000 is page 0, address 0x00001000 is page 1 etc (see include/asm-generic/memory_model.h)

First and foremost notice that these are all except PTE named something-directory. This is because they all contain several pointers down to objects of the next level. Like any directory. So each of the entities PGD, P4D … etc are arrays of pointers. The PTE despite having a singular form also contain several pointers.

The Linux kernel will act as if 5 levels of page tables exist. This is of course grossly over-engineered for ARM32 which has 2 or 3 levels of page tables, but we need to cater for the rest of the world. One size fits all. In practice, the code is organized such that these page tables “fold” and we mostly skip over the intermediate translation steps when possible.

The other thing you need to know about the page table hierarchy is that on the lowest level the PTEs contains a number of pointers, which are 1-to-1 translation chunks. On ARM32, on the PTE level we always have 512 pointers per PTE, and each pointer translates one 4KB page (0x1000) of memory.

Page Table Entry A PTE on ARM32 contains 512 pointers, i.e. translations between physical and virtual pages. From the generic kernel virtual memory management point of view it is not important how this translation actually happens in hardware, all it needs to know is that a page of size 0x1000 (4KB) is translated by a single PTE pointer entry. This way a PTE is a directory, just like everything else in the page hierarchy. In the example the first pointer in the PTE translates physical address 0x10000000 to virtual address 0xC0000000.

On the PMD level the classic ARM32 MMU has one pointer per PMD and 2048 pointers per PGD. The P4D and PUD levels are “folded” i.e. unused. Having one pointer per PMD seems vaguely pointless: we have level-1 in the 2048 pointers in the PGD and level-2 in the 512 pointers in the PTE. You rightfully ask what the point is to have a “three level” hierarchy with the middle directory having one pointer per PMD. This is a way of fitting the Linux idea about page hierarchy with the actual ARM32 architecture, and will be explained shortly.

Classic ARM MMU page table The classic ARM32 paging setup in Linux folds P4D and PUD, providing 2048 pointers per PGD, 1 pointer per PMD and 512 pointers per PTE. It uses 3 levels of the hierarchy while ARM32 hardware only has two. This is however a good fit, as we shall see. To the left the object relations, to the right an illustration of the tables.

On LPAE the story is simpler: each PGD has 4 pointers covering 1GB each (for a total of 4GB of memory) and 512 pointers per PMD dividing each 1GB into 2MB chunks, then 512 pointers per PTE dividing the 2MB chunks into 4KB chunks (pages). The math should match up. The LPAE MMU can of course put more pointers into the PGD to cover up to 1TB of memory. Currently no ARM32 architectures need this so we just hammer it down to 4GB maximum for kernelspace. 4GB of kernelspace memory should be enough for everyone. Userspace is another story.

LPAE MMU page table The LPAE page table also folds P4D and PUD but has something meaningful on the PMD level: 4 PGD entries covering 1 GB each covers the whole 32bit address space, then there are 512 pointers per PMD and 512 pointers per PTE. Each PTE entry translate 4KB of physical memory to virtual memory. If we would fill the PGD with all the 512 available entries it would span exactly 1 TB of memory.

So when we say that the classic ARM32 MMU has 2 levels of page tables, we are presenting this to Linux in a peculiar way as “3 levels” where the middle one is a single pointer, whereas the LPAE MMU actually has 3 levels. We are sorry for the confusion.

Obtaining Pointers to Directory Entries

The pointer to an index in a directory is obtained with special inlined accessors that take a virtual address as parameter named for example pmd_off() and those traverse the whole hierarchy to get to the offset of the right element in hierarchy:

static inline pmd_t *pmd_off(struct mm_struct *mm, unsigned long va)
{
        return pmd_offset(pud_offset(p4d_offset(pgd_offset(mm, va), va), va), va);
}

On ARM32 the P4D and PUD parts of this ladder will resolve to nothing, get “folded” and optimized out at compile-time. The struct mm_struct *mm argument is the actual memory manager context (for the kernel this is init_mm) which stores the pointer to the memory where the page global directory (PGD) actually resides. And that will the the symbol swapper_pg_dir if we are running kernel code. This is how it actually happens in include/linux/pgtable.h;

#define pgd_offset(mm, address)         pgd_offset_pgd((mm)->pgd, (address))

(...)

static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
{
        return (pgd + pgd_index(address));
};

In kernelspace mode the mm argument is init_mm and the ->pgd pointer points to swapper_pg_dir, the kernel memory space root page table and equal to the location in virtual memory of the actual physical global page table. All index pointer accessors operate on this principle.

The ARM32 View of the Page Tables

The virtual address for the kernespace global page directory is PAGE_OFFSET+0x3000 or PAGE_OFFSET+0x4000 and PAGE_OFFSET depends on the kernel VMSPLIT, but is typically 0xC0000000, kernel memory from 0xC0000000..0xFFFFFFFF. On ARM32 this base address, in physical memory, is set in TTBR0 (Translation Table Base Register). Linux uses the symbol swapper_pg_dir for this address. This is 0x5000 bytes in size for LPAE and 0x4000 bytes in size for the classic ARM MMU. If the kernel starts at the physical address 0xnnnn8000 it is at 0xnnnn8000-PG_DIR_SIZE so at 0xnnnn3000..0xnnnn7FFF for LPAE and 0xnnnn4000..0xnnnn7FFF for any other ARM. The most common location is at 0xC0004000..0xC0007FFF in virtual memory.

PTE page table entries and the associated types pte_t and pteval_t is of course purely a Linux concept. And that shows. Because what ARM32 MMUs are thinking about is not PTEs but coarse 2- or 3-level page tables of 4KB pages. It would be neat if Linux was engineered in a more flexible way to allow for a coarse page table to be ONE PTE but it isn’t always the case, for old ARM32 systems it definitely is not.

“Coarse” in this context means we start at a level-1 descriptor and see that for this 1MB of virtual memory we are using some smaller 4K pages so we have to look further: go and walk the page table ladder using the modified virtual address I give you here. ARM MMUs also have a concept of “fine” pages of just 1K. Linux does not use these. Yet.

We are here dealing with PAGE_SIZE granularity, and Linux memory manager expects a PTE to be one page big and full of pointers so we will for example call arm_pte_alloc() to allocate a new level-2 page (or level-3 on LPAE) table of 0x1000 (4096) bytes. That much is simple.

The world is also reasonably simple on LPAE systems, because there the PTE is indeed one page, and it is populated by 512 entries of 64 bits / 8 bytes each, meaning we have 512 * 8 = 4096 bytes and it is a perfect match. Also these 64 bits fulfils the contract between Linux’ memory manager and the architecture of providing some MMU facilities such as a “dirty” bit. One PTE on a LPAE MMU is one page full of these coarse third level descriptors (remember that LPAE has three levels of pages). This nice fit will also be the case on the ARM64 / Aarch64 platform. End of story. Good for them!

What is not so simple is how Linux utilize these 4096 bytes on the classic oldschool ARM MMU, which is described in include/asm/pgtable-2level.h.

What is done on the classic MMU is that we put the course page table index into the modified virtual address (MVA) field of the level-1 descriptor. This “coarse page table index” is in bits 10-31 of the level-1 descriptor (which is half a PGD entry, more about that later) and that corresponds to the highest 21 bits of the physical address of the level-2 page table. What is peculiar about the course page table index is that it does not correspond to a page in memory. Instead it corresponds to a quarter of a page. This is logical because level-2 page tables are 32 bit / 4 bytes and we need 256 of them to cover 1MB which is the coarseness of the level-1 descriptor, thus we need 256 * 4 = 1024 = 0x400 bytes. That’s one quarter of a 4KB page. So the coarse page table index points to a location in virtual memory indexed by the 0x400:th chunk. Which is mildly confusing.

Further we have the following problem: ARMs classic MMU does not have all the “dirty” and “accessed”/”young” bits that the generic Linux virtual memory manager presuppose to exist in the architecture.

One could imagine telling the Linux virtual memory manager that we have only 256 pointers per PTE, put some metadata in the remainder but end up using only half a page and waste lots of memory for the PTEs.

The actual solution, as implemented on ARM32 (I think this was invented by Russell King), is to squeeze in as much as possible in a page as follows:

  • Tell the Linux virtual memory manager that we have 512 page descriptors, which Linux calls “pointers” in a PTE, even though we know very well that the hardware has 256 of them. This is done by setting PTRS_PER_PTE to 512 in include/asm/pgtable-2level.h
  • Tell it that we map 2 MB of memory in a PMD. This is done by defining PMD_SHIFT to 21 in include/asm/pgtable-2level.h. This is done despite the fact that we know very well that a level-1 descriptor maps 1 MB and not 2 MB of memory. We also make the page middle directory into a 1-to-1 directory with just one pointer per PMD: PTRS_PER_PMD is set to 1.
  • Set __PAGETABLE_PMD_FOLDED so we don't allocate any real backing data structures for the PMDs, as these will be stored inside the PGD.
  • Tell it that the PGD contains 2048 pointers (setting PTRS_PER_PGD to 2048), while in fact the 0x4000 bytes of level-1 descriptors have 4096 pointers, that we just group in pairs.
  • Whenever the generic memory manager asks for a PGD entry we reference a PMD pointer with two level-1 page table descriptors.
  • Whenever the generic memory manager asks for a PMD entry we reference a 1-to-1 entity.
  • Whenever the generic memory manager asks for a PTE entry we reference two level-2 page table descriptors and two metadata entries of 32 bits each.

We then occupy half a page with these two sets of level-2 descriptors. To solve the mapping problem for different “accessed” or “young” bits, we use the half page that is left to hold some metadata about what Linux has requested so we can do a bit of back-and-forward emulation of the features that Linux virtual memory manager wants.

We have effectively posed to the kernel virtual memory manager that we have an MMU with three levels of page tables: PGD (2048 pointers), PMD (one pointer) and PTE (512 pointers), while in fact we have two, and then behind the back of the generic virtual memory manger we do this optimization to adjust the map to reality.

This is not so bad: the virtual memory manager is well aware that not the whole world uses all five levels of page tables it supports, so it has been written to “fold” levels like our artificial PMD already. This will work just fine.

Alas, this is not a very simple solution. But it is very efficient, counterintuitive and complicated. Since operating system development can be pretty hard in the details, this is what we can expect.

Classic page table The classic ARM32 MMU page table layout uses 32bit descriptors in two levels: two level-1 descriptors in the global page table (TTB, what Linux calls PGD, page global directory) are grouped into one PMD (page middle directory). There are 4096 level-1 descriptors grouped in pairs to form a PMD so there are 2048 PMDs. Then 256 + 256 = 512 level-1 page table pointers are grouped into one PTE (page table entry) with some metadata right above it. What Linux sees is a PMD managing 2MB and a PTE managing 512 4K pages which is also 2 MB. This way a PTE fills exactly one page of memory, which simplifies things. The 4K level-2 descriptors are referred to as “coarse”.

Level 1 and Level 2 classic descriptors The rough format of the 32bit Level-1 and Level-2 page descriptors used in the classic ARM32 MMU, which is what most ARM32 Linux kernels use. A “coarse” level-1 descriptor points to “small page” (4KB) page descriptor. As can be seen, the memory area covered at each level is mapped by simply using the high bits of the physical address as is common in MMUs. Some domain and access control is stuffed in in the remaining bits, the bits explicitly set to 0 or 1 should be like so.

An LPAE page table set-up on the other hand is simpler and deeper than the classic MMU: the descriptors are 64bit wide and include all the bits that Linux needs. Four 64bit level-1 descriptors/PGD pointers covering the first 4GB of physical memory (0x00000000-0xFFFFFFFF) at typically at 0xC0003000 points to the 512 level-2 descriptors/PMD pointers at 0xC0004000 and these are 64bit but cover 2MB each so they cover the same amount of virtual memory space as the classic MMU: instead of 1024 32 bit “coarse” entries covering 1MB each, we have 512 64bit entries covering 2MB each. This makes 1 PGD pointer correspond exactly to 1 level-1 page descriptor 1 PMD pointer correspond exactly to one level-2 page descriptor.

The level-2 descriptor points to a level-3 descriptor which is also 64bits wide and covers 4KB. At the third level, 512 4KB descriptors cover 2MB and perfectly fill out exactly one 4KB page of memory. Further, the descriptor layout in the level-3 descriptors fits hand-in-glove with Linux’ idea about what special bits it wants like “dirty”, “young” and “accessed”. PGD, PMD and PTE have identical concepts in hardware and this makes things much more intuitive. ARM64/Aarch64 also uses this style of page descriptors.

Classic page table The LPAE page table layout is simpler and more intuitive than the classic MMU. 64bit pointers on each level in three levels, where one descriptor corresponds to one PGD, PMD or PTE entry, we use 4 PGD pointers consecutive at 0xC0003000, one PMD has 512 64-bit pointers in each of the 4 pages at 0xC0004000, 0xC0005000, 0xC0006000 and 0xC0007000 to PTEs, and one PTE is exactly one page (0x1000 bytes) and consists of exactly 512 64-bit translation pointers.

My previous article on how the kernel decompresses generated a lot of traffic and commentary, much to my surprise. I suppose that this may be because musings of this kind fill the same niche as the original Lions' Commentary on UNIX 6th Edition, with Source Code which was a major hit in the late 1970s. Operating system developers simply like to read expanded code comments, which is what this is.

When I’m talking about “ARM32” the proper ARM name for this is Aarch32 and what is implemented physically in the ARMv4 thru ARMv7 ARM architectures.

In this post I will discuss how the kernel bootstraps itself from executing in physical memory after decompression/boot loader and all the way to executing generic kernel code written in C from virtual memory.

This Is Where It All Begins

After decompression and after augmenting and passing the device tree blob (DTB) the ARM32 kernel is invoked by putting the program counter pc at the physical address of the symbol stext(), start of text segment. This code can be found in arch/arm/kernel/head.S.

A macro __HEAD places the code here in a linker section called .head.text and if you inspect the linker file for the ARM architecture at arch/arm/kernel/vmlinux.lds.S you will see that this means that the object code in this section will be, well, first.

This will be at a physical address which is evenly divisible by 16MB and an additional 32KB TEXT_OFFSET (which will be explained more later) so you will find stext() at an address such as 0x10008000, which is the address we will use in our example.

head.S contains a small forest of exceptions for different old ARM platforms making it hard to tell the trees from the forest. The standards for ATAGs and device tree boots came about since it was written so this special code has become increasingly complex over the years.

To understand the following you need a basic understanding of paged virtual memory. If Wikipedia is too terse, refer to Hennesy & Patterson’s book “Computer Architecture: A Quantitative Approach”. Knowing a bit of ARM assembly language and Linux kernel basics is implied.

ARM’s Virtual Memory Split

Let’s first figure out where in virtual memory the kernel will actually execute. The kernel RAM base is defined in the PAGE_OFFSET symbol, which is something you can configure. The name PAGE_OFFSET shall be understood as the virtual memory offset for the first page of kernel RAM.

You choose 1 out of 4 memory split alternatives, which reminds me about fast food restaurants. It is currently defined like this in arch/arm/Kconfig:

config PAGE_OFFSET
        hex
        default PHYS_OFFSET if !MMU
        default 0x40000000 if VMSPLIT_1G
        default 0x80000000 if VMSPLIT_2G
        default 0xB0000000 if VMSPLIT_3G_OPT
        default 0xC0000000

First notice that if we don’t have an MMU (such as when running on ARM Cortex-R class devices or old ARM7 silicon) we will create a 1:1 mapping between physical and virtual memory. The page table will then only be used to populate the cache and addresses are not rewritten. PAGE_OFFSET could very typically be at address 0x00000000 for such set-ups. Using the Linux kernel without virtual memory is referred to as “uClinux” and was for some years a fork of the Linux kernel before being brought in as part of the mainline kernel.

Not using virtual memory is considered an oddity when using Linux or indeed any POSIX-type system. Let us from here on assume a boot using virtual memory.

The PAGE_OFFSET virtual memory split symbol creates a virtual memory space at the addresses above it, for the kernel to live in. So the kernel keep all its code, state and data structures (including the virtual-to-physical memory translation table) at one of these locations in virtual memory:

  • 0x40000000-0xFFFFFFFF
  • 0x80000000-0xFFFFFFFF
  • 0xB0000000-0xFFFFFFFF
  • 0xC0000000-0xFFFFFFFF

Of these four, the last location at 0xC0000000-0xFFFFFFFF is by far the most common. So the kernel has 1GB of address space to live in.

The memory below the kernel, from 0x00000000-PAGE_OFFSET-1, i.e. typically at address 0x00000000-0xBFFFFFFF (3 GB) is used for userspace code. This is combined with the Unix habit to overcommit, which means that you optimistically offer programs more virtual memory space than what you have physical memory available for. Every time a new userspace process starts up it thinks that it has 3 gigabytes of memory to use! This type of overcommit has been a characteristic of Unix systems since their inception in the 1970s.

Why are there four different splits?

This is pretty straightforward to answer: ARM is heavily used for embedded systems and these can be either userspace-heavy (such as an ordinary tablet or phone, or even a desktop computer) or kernel-heavy (such as a router). Most systems are userspace-heavy or have so little memory that the split doesn’t really matter (neither will be very occupied), so the most common split is to use PAGE_OFFSET 0xC0000000.

Kernelspace userspace split The most common virtual memory split between kernelspace and userspace memory is at 0xC0000000. A notice on these illustrations: when I say memory is “above” something I mean lower in the picture, along the arrow, toward higher addresses. I know that some people perceive this as illogical and turn the figure upside-down with 0xFFFFFFFF on the top, but it is my personal preference and also the convention used in most hardware manuals.

It could happen that you have a lot of memory and a kernel-heavy use case, such as a router or NAS with lots of memory, say a full 4GB of RAM. You would then like the kernel to be able to use some of that memory for page cache and network cache to speed up your most typical use case, so you will select a split that gives you more kernel memory, such as in the extreme case PAGE_OFFSET 0x40000000.

This virtual memory map is always there, also when the kernel is executing userspace code. The idea is that by keeping the kernel constantly mapped in, context switches from userspace to kernelspace become really quick: when a userspace process want to ask the kernel for something, you do not need to replace any page tables. You just issue a software trap to switch to supervisor mode and execute kernel code, and the virtual memory set-up stays the same.

Context switches between different userspace processes also becomes quicker: you will only need to replace the lower part of the page table, and typically the kernel mapping as it is simple, uses a predetermined chunk of physical RAM (the physical RAM where it was loaded to begin with) and is linearly mapped, is even stored in a special place, the translation lookaside buffer. The translation lookaside buffer, “special fast translation tables in silicon”, makes it even faster to get into kernelspace. These addresses are always present, always linearly mapped, and will never generate a page fault.

Where Are We Executing?

We continue looking at arch/arm/kernel/head.S, symbol stext().

The next step is to deal with the fact that we are running in some unknown memory location. The kernel can be loaded anywhere (provided it is a reasonably even address) and just executed, so now we need to deal with that. Since the kernel code is not position-independent, it has been linked to execute at a certain address by the linker right after compilation. We don’t know that address yet.

The kernel first checks for some special features like virtualization extensions and LPAE (large physical address extension) then does this little dance:

    adr        r3, 2f
    ldmia      r3, {r4, r8}
    sub        r4, r3, r4            @ (PHYS_OFFSET - PAGE_OFFSET)
    add        r8, r8, r4            @ PHYS_OFFSET
    (...)
    2:         .long        .
               .long      PAGE_OFFSET

.long . is link-time assigned to the address of the label 2: itself, so . is resolved to the address that the label 2: is actually linked to, where the linker imagines that it will be located in memory. This location will be in the virtual memory assigned to the kernel, i.e. typically somewhere above 0xC0000000.

After that is the compiled-in constant PAGE_OFFSET, and we already know that this will be something like 0xC0000000.

We load in the compile-time address if 2: into r4 and the constant PAGE_OFFSET into r8. Then we subtract that from the actual address of 2:, which we have obtained in r3 using relative instructions, from r4. Remember that ARM assembly argument order is like a pocket calculator: sub ra, rb, rc => ra = rb - rc.

The result is that we have obtained in r4 the offset between the address the kernel was compiled to run at, and the address it is actually running at. So the comment @ (PHYS_OFFSET - PAGE_OFFSET) means we have this offset. If the kernel symbol 2: was compiled to execute at 0xC0001234 in virtual memory but we are now executing at 0x10001234 r4 will contain 0x10001234 - 0xC0001234 = 0x50000000. This can be understood as “-0xB0000000” because the arithmetic is commutative: 0xC0001234 + 0x50000000 = 0x10001234, QED.

Next we add this offset to the compile-time assigned PAGE_OFFSET that we know will be something like 0xC0000000. With wrap-around arithmetic, if we are again executing at 0x10000000 we get 0xC0000000 + 0x50000000 = 0x10000000 and we obtained the base physical address where the kernel is currently actually executing in r8 – hence the comment @PHYS_OFFSET. This value in r8 is what we are actually going to use.

The elder ARM kernel had a symbol called PLAT_PHYS_OFFSET that contained exactly this offset (such as 0x10000000), albeit compile-time assigned. We don’t do that anymore: we assign this dynamically as we shall see. If you are working with operating systems less sophisticated than Linux you will find that the developers usually just assume something like this in order to make things simple: the physical offset is a constant. Linux has evolved to this because we need to handle booting of one single kernel image on all kinds of memory layouts.

Physical virtual split 1 I do not know if this picture makes things easier or harder to understand. It illustrates physical to virtual memory mapping in our example.

There are some rules around PHYS_OFFSET: it needs to adhere to some basic alignment requirements. When we determine the location of the first block of physical memory in the uncompress code we do so by performing PHYS = pc & 0xF8000000, which means that the physical RAM must start at an even 128 MB boundary. If it starts at 0x00000000 that’s great for example.

There are some special considerations around this code for the XIP “execute in place” case when the kernel is executing from ROM (read only memory) but we leave that aside, it is another oddity, even less common than not using virtual memory.

Notice one more thing: you might have attempted to load an uncompressed kernel and boot it and noticed that the kernel is especially picky about where you place it: you better load it to a physical address like 0x00008000 or 0x10008000 (assuming that your TEXT_OFFSET is 0x8000). If you use a compressed kernel you get away from this problem because the decompressor will decompress the kernel to a suitable location (very often 0x00008000) and solve this problem for you. This is another reason why people might feel that compressed kernels “just work” and is kind of the norm.

Patching Physical to Virtual (P2V)

Now that we have the offset between virtual memory where we will be executing and the physical memory where we are actually executing, we reach the first sign of the Kconfig symbol CONFIG_ARM_PATCH_PHYS_VIRT.

This symbol was created because developers were creating kernels that needed to boot on systems with different memory configurations without recompiling the kernel. The kernel would be compiled to execute at a certain virtual address, like 0xC0000000, but can be loaded into memory at 0x10000000 like in our case, or maybe at 0x40000000, or some other address.

Most of the symbols in the kernel will not need to care of course: they are executed in virtual memory at an adress to which they were linked and to them we are always running at 0xC0000000. But now we are not writing some userspace program: things are not easy. We have to know about the physical memory where we are executing, because we are the kernel, which means that among other things we need to set up the physical-to-virtual mapping in the page tables, and regularly update these page tables.

Also, since we don’t know where in the physical memory we will be running, we cannot rely on any cheap tricks like compile-time constants, that is cheating and will create hard to maintain code full of magic numbers.

For converting between physical and virtual addresses the kernel has two functions: __virt_to_phys() and __phys_to_virt() converting kernel addresses in each direction (not any other addresses than those used by the kernel memory). This conversion is linear in the memory space (uses an offset in each direction) so it should be possible to achieve this with a simple addition or subtraction. And this is what we set out to do, and we give it the name “P2V runtime patching” . This scheme was invented by Nicolas Pitre, Eric Miao and Russell King in 2011 and in 2013 Santosh Shilimkar extended the scheme to also apply to LPAE systems, specifically the TI Keystone SoC.

The main observation is that if it holds that, for a kernel physical address PHY and a kernel virtual address VIRT (you can convince yourself of these two by looking at the last illustration):

PHY = VIRT – PAGE_OFFSET + PHYS_OFFSET VIRT = PHY – PHYS_OFFSET + PAGE_OFFSET

Then by pure laws of arithmetics it also holds that:

PHY = VIRT + (PHYS_OFFSET – PAGE_OFFSET) VIRT = PHY – (PHYS_OFFSET – PAGE_OFFSET)

So it is possible to always calculate the physical address from a virtual address by an addition of a constant and to calculate the virtual address from a physical address by a subtraction, QED. That is why the initial stubs looked like this:

static inline unsigned long __virt_to_phys(unsigned long x)
{
    unsigned long t;
    __pv_stub(x, t, "add");
    return t;
}

static inline unsigned long __phys_to_virt(unsigned long x)
{
    unsigned long t;
    __pv_stub(x, t, "sub");
    return t;
}

The __pv_stub() would contain an assembly macro to do add or sub. Since then the LPAE support for more than 32-bit addresses made this code a great deal more complex, but the general idea is the same.

Whenever __virt_to_phys() or __phys_to_virt() is called in the kernel, it is replaced with a piece of inline assembly code from arch/arm/include/asm/memory.h, then the linker switches section to a section called .pv_table, and then adds a entry to that section with a pointer back to the assembly instruction it just added. This means that the section .pv_table will expand into a table of pointers to any instance of these assembly inlines.

During boot, we will go through this table, take each pointer, inspect each instruction it points to, patch it using the offset to the physical and virtual memory where we were actually loaded.

Patching phys to virt Each and every location with an assembly macro doing a translation from physical to virtual memory is patched in the early boot process.

Why are we doing this complex manouver instead of just storing the offset in a variable? This is for efficiency reasons: it is on the hot data path of the kernel. Calls updating page tables and cross-referencing physical to virtual kernel memory are extremely performance-critical, all use cases exercising the kernel virtual memory, whether block layer or network layer operations, or user-to-kernelspace translations, in principle any data passing through the kernel, will at some point call these functions. They must be fast.

It would be wrong to call this a simple solution to the problem. Alas it is a really complicated solution to the problem. But it works and it is very efficient!

Looping through the patch table

We perform the actual patching by figuring out the offset like illustrated above, and iteratively patching all the assembly stubs. This is done in the call to the symbol __fixup_pv_table, where our just calculated offset in r8 comes into play: a table of 5 symbols that need to refer directly to physical memory known as the __pv_table is read into registers r3..r7 augmented using the same method described above (that is why the table is preceded by a .long .):

__fixup_pv_table:
	adr	r0, 1f
	ldmia	r0, {r3-r7}
	mvn	ip, #0
	subs	r3, r0, r3	@ PHYS_OFFSET - PAGE_OFFSET
	add	r4, r4, r3	@ adjust table start address
	add	r5, r5, r3	@ adjust table end address
	add	r6, r6, r3	@ adjust __pv_phys_pfn_offset address
	add	r7, r7, r3	@ adjust __pv_offset address
	mov	r0, r8, lsr #PAGE_SHIFT	@ convert to PFN
	str	r0, [r6]	@ save computed PHYS_OFFSET to __pv_phys_pfn_offset
	(...)
	b	__fixup_a_pv_table

1:	.long	.
	.long	__pv_table_begin
	.long	__pv_table_end
2:	.long	__pv_phys_pfn_offset
	.long	__pv_offset

The code uses the first value loaded into r3 to calculate the offset to physical memory, then adds that to each of the other registers so that r4 thru r7 now point directly to the physical memory contained in each of these labels. So r4 is a pointer to the physical memory storing __pv_table_begin, r5 points to __pv_table_end, r6 point to __pv_phys_pfn_offset and r7 points to __pv_offset. In C these would be u32 *, so pointers to some 32 bit numbers.

__pv_phys_pfn_offset is especially important, this means patch physical to virtual page frame number offset, so we first calculate this by making a logical shift right with mov r0, r8, lsr #PAGE_SHIFT using the value we earlier calculated in r8 (the offset from 0 to the kernel memory in our case 0x10000000), and writing it right into whatever location actually stores that variable with str r0, [r6]. This isn't used at this early stage of the kernel start-up but it will be used by the virtual memory manager later.

Next we call __fixup_a_pv_table which will loop over the addresses from r4 thru r5 (the table of pointers to instructions to be patched) and patch each of them in order using a custom binary patcher that convert the instructions whether in ARM or THUMB2 (the instruction set is known at compile time) to one with an immediate offset between physical and virtual memory encoded right into it. The code is pretty complex and contains quirks to accommodate for big endian byte order as well.

Notice that this also has to happen every time the kernel loads a module, who knows if the new module needs to convert between physical and virtual addresses! For this reason all module ELF files will contain a .pv_table section with the same kind of table and the very same assembly loop is called also every time we load a module.

Setting Up the Initial Page Table

Before we can start to execute in this virtual memory we have just chiseled out, we need to set up a MMU translation table for mapping the physical memory to virtual memory. This is commonly known as a page table, albeit what will use for the initial map is not pages but sections. The ARM architecture mandates that this must be placed on an even 16KB boundary in physical memory. Since it is also always 16KB in size this makes sense.

The location of the initial page table is defined in a symbol named swapper_pg_dir, “swapper page directory”, which is an idiomatic name for the initial page table for the kernel. As you realize it will later on be replaced (swapped, hence “swapper”) by a more elaborate page table.

The name “page table” is a bit misleading because what the initial mapping is using in ARM terminology is actually called sections, not pages. But the term is used a bit fuzzily to refer to “that thing which translates physical addresses to virtual addresses at boot”.

The symbol swapper_pg_dir is defined to be at KERNEL_RAM_VADDR - PG_DIR_SIZE. Let’s examine each of these.

KERNEL_RAM_VADDR is just what you would suspect: the address in the virtual memory where the kernel is located. It is the address to which the kernel was linked during compile.

The KERNEL_RAM_VADDR is defined to (PAGE_OFFSET + TEXT_OFFSET). PAGE_OFFSET can be one of four locations from the Kconfig symbol we just investigated, typically 0xC0000000. The TEXT_OFFSET is typically 0x8000, so the KERNEL_RAM_VADDR will typically be 0xC0008000 but with a different virtual memory split setting or exotic TEXT_OFFSET it will be something different. TEXT_OFFSET comes from textof-y in arch/arm/Makefile and is usually 0x8000, but can also be 0x00208000 on some Qualcomm platforms and 0x00108000 again on some Broadcom platforms, making KERNEL_RAM_VADDR 0xC0208000 or similar.

What we know for sure is that this address was passed to the linker when linking the kernel: if you inspect the linker file for the ARM architecture at arch/arm/kernel/vmlinux.lds.S like we did in the beginning of the article you can see that the linker is indeed instructed to place this at . = PAGE_OFFSET + TEXT_OFFSET. The kernel has been compiled to execute at the address given in KERNEL_RAM_VADDR, even if the very earliest code, that we are analyzing right now, has been written in assembly to be position-independent.

TEXT OFFSET illustration The TEXT_OFFSET is a small, typically 32KB area above the kernel RAM in both physical and virtual memory space. The location of the kernel in the physical RAM at 0x10000000 is just an example, this can be on any even 16MB boundary.

Notice that the little gap above the kernel, most commonly at 0xC0000000-0xC0007FFF (the typical 32KB if TEXT_OFFSET). While part of the kernelspace memory this is not memory that the kernel was compiled into: it contains the initial page table, potentially ATAGs from the boot loader and some scratch area and could be used for interrupt vectors if the kernel is located at 0x00000000. The initial page table here is certainly programmed and used by the kernel, but it is abandoned afterwards.

Notice that we add the same TEXT_OFFSET to physical and virtual memory alike. This is even done in the decompression code, which will decompress the kernel first byte of the kernel to PHYS_OFFSET + TEXT_OFFSET, so the linear delta between a kernel location in physical memory and the same location in virtual memory is always expressed in a few high bits of a 32bit word, such as bits 24 thru 31 (8 bits) making it possible to use immediate arithmetic and include the offset directly in the instruction we use when facilitating ARM_PATCH_PHYS_VIRT. From this comes the restriction that kernel RAM must begin at an address evenly divisible by 16MB (0x01000000).

So we find the physical location of swapper_pg_dir in virtual memory by adding TEXT_OFFSET to PAGE_OFFSET and then backing off PG_DIR_SIZE from there. In the common case this will be 0xC0000000 + 0x8000 - 0x4000 so the initial page table will be at 0xC0004000 in virtual memory and at the corresponding offset in physical memory, in my example with PHYS_OFFSET being 0x10000000 it will be at 0x10004000.

Inital page translation table in swapper pg dir The swapper_pg_dir symbol will in our example reside 16KB (0x4000 bytes) before the .text segment in physical memory, at 0x10004000 given that our PHYS_OFFSET is 0x10000000 and TEXT_OFFSET is 0x8000 while using classic ARM MMU.

If you use LPAE the page table PG_DIR_SIZE is 0x5000 so we end up at 0xC0003000 virtual and 0x10003000 physical. The assembly macro pgtbl calculates this for us: we take the physical address we have calculated in r8, add TEXT_OFFSET and subtract PG_DIR_SIZE and we end up on the physical address for the initial translation table.

The initial page table is constructed in a very simple way: we first fill the page table with zeroes, then build the initial page table. This happens at symbol __create_page_tables.

The ARM32 Page Table Format

The page table layout on ARM32 consists of 2 or 3 levels. The 2-level page table is referred to as the short format page table and the 3-level format is called the long format page table in ARM documentation. The long format is a feature of the large physical address extension, LPAE and as the name says it is there to handle larger physical memories requiring up to 40 bits of physical address.

These translation tables can translate memory in sections of 1MB (this is 2MB on LPAE but let’s leave that aside for now) or in pages of 16kB. The initial page table uses sections, so memory translation in the initial page table is exclusively done in sections of 1MB.

This simplifies things for the initial mapping: sections can be encoded directly in the first level of the page table. This way there is in general no need to deal with the complex 2-level hierarchy, and we are essentially treating the machine as something with 1-level 1MB section tables.

However if we are running on LPAE there is another level inbetween, so we need to deal with the first two levels and not just one level. That is why you find some elaborate LPAE code here. All it does is insert a 64bit pointer to our crude 2MB-sections translation table.

Page table format In the classic MMU we just point the MMU to a 1st level table with 1MB sections while for LPAE we need an intermediate level to get to the similar format with 2MB. The entries are called section descriptors.

Linux Page Table Lingo

At this point the code starts to contain some three-letter acronyms. It is dubious whether these are actually meaningful when talking about this crude initial page table, but developers refer to these short-hands out of habit.

  • PGD page global directory is Linux lingo for the uppermost translation table, where the whole MMU traversal to resolve sections and pages start. In our case this is at 0x10004000 in physical memory. In the ARM32 world, this is also the address that we write into the special CP15 translation table register (sometimes called TTBR0) to tell the MMU where to find the translations. If we were using LPAE this would be at 0x10003000 just to make space for that 0x1000 with a single 64bit pointer to the next level called PMD.
  • P4D page 4th level directory and PUD page upper directory are concepts in the Linux VMM (virtual memory manager) unused on ARM as these deal with translation table hierarchies of 4 or 5 levels and we use only 2 or 3.
  • PMD page middle directory is the the name for 3rd level translation stage only used in LPAE. This is the reason we need to reserve another 0x1000 bytes for the LPAE initial page table. For classic ARM MMU (non-LPAE) the PMD and the PGD is the same thing. This is why the code refers to PMD_ORDER for classic MMU and LPAE alike. Because of Linux VMM terminology this is understood as the “format of the table right above the PTE:s”, and since we’re not using PTEs but section mappings, the “PMD section table” is the final product of our mappings.
  • PTE:s, page table entries maps pages of RAM from physical to virtual memory. The initial boot translation table does not use these. We use sections.

Notice again: for classic ARM MMU PGD and PMD is the same thing. For LPAE it is two different things. This is sometimes referred to as “folding” PMD into PGD. Since in a way we are “folding” P4D and PUD as well, it becomes increasingly confusing.

If you work with virtual memory you will need to repeat the above from time to time. For our purposes of constructing the initial “page” table none of this really matters. What we are constructing is a list of 1MB sections mapping virtual memory to physical memory. We are not even dealing with pages at this point, so the entire “pages” terminology is mildly confusing.

The Binary Format of the Translation Table

Let use consider only the classic ARM MMU in the following.

From physical address 0x10004000 to 0x10007FFF we have 0x1000 (4096) section descriptors of 32 bits (4 bytes) each. How are these used?

The program counter and all CPU access will be working on virtual addresses after we turn on the MMU, so the translation will work this way: a virtual address is translated into a physical address before we access the bus.

The way the physical address is determined from the virtual address is as follows:

We take bits 31..20 of the virtual address and use this as index into the translation table to look up a 32bit section descriptor.

  • Addresses 0x00000000-0x000FFFFF (the first 1MB) in virtual memory will be translated by the index at 0x10004000-0x10004003, the first four bytes of the translation table, we call this index 0.
  • Virtual address 0x00100000-0x001FFFFF will be translated by index 1 at 0x10004004-0x10004007, so index times four is the byte address of the descriptor.
  • Virtual address 0xFFF00000-0xFFFFFFFF will be translated by index 0xFFF at 0x10003FFC-0x10003FFF

Oh. Clever. 0x4000 (16KB) of memory exactly spans a 32bit i.e. 4GB memory space. So using this MMU table we can make a map of any virtual memory address 1MB chunk to any physical memory 1MB chunk. This is not a coincidence.

This means for example that for the kernel virtual base 0xC0000000 the index into the table will be 0xC0000000 >> 20 = 0xC00 and since the index need to be multiplied by 4 to get the actual byte index we get 0xC00 * 4 = 0x3000 so at physical address 0x10004000 + 0x3000 = 0x10007000 we find the section descriptor for the first 1MB of the kernelspace memory.

The 32-bit 1MB section descriptors we will use for our maps may have something like this format:

Mock section descriptor

The MMU will look at bits 1,0 and determine that “10” means that this is a section mapping. We set up some default values for the other bits. Other than that we only really care about setting bits 31..20 to the right physical address and we have a section descriptor that will work. And this is what the code does.

For LPAE this story is a bit different: we use 64bit section descriptors (8 bytes) but at the same time, the sections are bigger: 2 MB instead of 1 MB, so in the end the translation table will have exactly the same size of 0x4000 bytes.

Identity mapping around MMU enablement code

First of all we create an identity mapping around the symbol __turn_mmu_on, meaning this snippet of code will be executing in memory that is mapped 1:1 between physical and virtual addresses. If the code is at 0x10009012 then the virtual address for this code will also be at 0x10009012. If we inspect the code we see that it is put into a separate section called .idmap.text. Creating a separate section means that this gets linked into a separate physical page with nothing else in it, so we map an entire 1MB section for just this code (maybe even two, if it happens to align just across a section boundary in memory), so that an identity mapping is done for this and this code only.

If we think about it this is usually cool, even if it spans say 2MB of memory: if we loaded the kernel at 0x10000000 as in our example, this code will be at something like 0x10000120 with an identity mapping at the same address this isn’t ever going to disturb the kernel at 0xC0000000 or even at 0x40000000 which is an extreme case of kernel memory split. If someone would start to place the start of physical memory at say 0xE0000000 we would be in serious trouble. Let’s pray that this never happens.

Mapping the rest

Next we create the physical-to virtual memory mapping for the main matter, starting at PHYS_OFFSET in the physical memory (the value we calculated in r8 in the section “Where Are We Executing”) and PAGE_OFFSET (a compile-time constant) in the virtual memory, then we move along one page at the time all the way until we reach the symbol _end in virtual memory which is set to the end of the .bss section of the kernel object itself.

Initial virtual memory mapping The initial page table swapper_pg_dir and the 1:1 mapped one-page-section __turn_mmu_on alongside the physical to virtual memory mapping at early boot. In this example we are not using LPAE so the initial page table is -0x4000 from PHYS_OFFSET and memory ends at 0xFFFFFFFF.

BSS is an idiomatic name for the section at the very end of the binary kernel footprint in memory and this is where the C compiler assigns locations for all runtime variables. The addresses for this section are well defined, but there is no binary data for them: this memory has undefined contents, i.e. whatever was in it when we mapped it in.

It is worth studying this loop of assembly in detail to understand how the mapping code works:

    ldr    r7, [r10, #PROCINFO_MM_MMUFLAGS] @ mm_mmuflags
    (...)
    add    r0, r4, #PAGE_OFFSET >> (SECTION_SHIFT - PMD_ORDER)
    ldr    r6, =(_end - 1)
    orr    r3, r8, r7
    add    r6, r4, r6, lsr #(SECTION_SHIFT - PMD_ORDER)
1:  str    r3, [r0], #1 << PMD_ORDER
    add    r3, r3, #1 << SECTION_SHIFT
    cmp    r0, r6
    bls    1b

Let us take this step by step. We assume that we use non-LPAE classic ARM MMU in this example (you can convince yourself that the same holds for LPAE).

add    r0, r4, #PAGE_OFFSET >> (SECTION_SHIFT - PMD_ORDER)

r4 contains the physical address of the page table (PGD or PMD) where we will set up our sections. (SECTION_SHIFT - PMD_ORDER) resolves to (20 – 2) = 18, so we take PAGE_OFFSET 0xC0000000 >> 18 = 0x3000 which is incidentally the absolute index into the translation table for 0xC0000000 as we saw earlier: aha. Well that makes sense since the index is 4 bytes. So whenever we see (SECTION_SHIFT - PMD_ORDER) that means “convert to absolute physical index into the translation table for this virtual address”, in our example case this will be 0x10003000.

So the first statement generates the physical address for the first 32bit section descriptor for the kernelspace memory in r0.

ldr    r6, =(_end - 1)

r6 is obviously set to the very end of the kernelspace memory.

    ldr    r7, [r10, #PROCINFO_MM_MMUFLAGS] @ mm_mmuflags
    (...)
    orr    r3, r8, r7

r8 contains the PHYS_OFFSET, in our case 0x10000000 (we count on bits 19..0 to be zero) and we OR that with r7 which represents what is jokingly indicated as “who cares” in the illustration above, is a per-CPU setting for the MMU flags defined for each CPU in arch/arm/mm/proc-*.S. Each of these files contain a special section named .proc.info.init and at index PROCINFO_MM_MMUFLAGS (which will be something like 0x08) there is the right value to OR into the section descriptor for the specific CPU we are using. The struct itself is named struct proc_info_list and can be found in arch/arm/include/asm/procinfo.h. Since assembly cannot really handle C structs some indexing trickery is used to get at this magic number.

So the section descriptor has a physical address in bits 31-20 and this value in r7 sets some more bits, like the two lowest, so that the MMU can handle this section descriptor right.

add    r6, r4, r6, lsr #(SECTION_SHIFT - PMD_ORDER)

This will construct the absolute physical index address for the section descriptor for the last megabyte of memory that we map. We are not OR:in on the value from r7: we till just use this for a loop comparison, not really write it into the translation table, so it is not needed.

So now r0 is the physical address of the first section descriptor we are setting up and r6 is the physical address of the last section descriptor we will set up. We enter the loop:

1:  str    r3, [r0], #1 << PMD_ORDER
    add    r3, r3, #1 << SECTION_SHIFT
    cmp    r0, r6
    bls    1b

This writes the first section descriptor into the MMU table, to the address in r0 which will start at 0x10003000 in our example. Then we post-increment the address in r0 with (1 << PMD_ORDER) which will be 4. Then we increase the descriptor physical address portion (bit 20 and upward) with 1 MB (1 << SECTION_SHIFT), check we we reached the last descriptor else loop to 1:.

This creates a virtual-to-physical map of the entire kernel including all segments and .bss in 1MB chunks.

Final mappings

Next we map some other things: we map the boot parameters specifically: this can be ATAGs or an appended device tree blob (DTB). ATAGs are usually the very first page of memory (at PHYS_OFFSET + 0x0100) While contemporary DTBs are usually somewhere below the kernel. You will notice that it would probably be unwise to place the DTB too far above the kernel, lest it will wrap around to lower addresses.

If we are debugging we also map the serial port specifically from/to predefined addresses in physical and virtual memory. This is done so that debugging can go on as we execute in virtual memory.

There is again an exception for “execute in place”: if we are executing from ROM we surely need to map the kernel from that specific address rather than the compile-time memory locations.

Jump to Virtual Memory

We are nearing the end of the procedure stext where we started executing the kernel.

First we call the per-cpu-type “procinit” function. This is some C/assembly clever low-level CPU management code that is collected per-CPU in arch/arm/mm/proc-*.S. For example most v7 CPUs have the initialization code in proc-v7.S and the ARM920 has its initialization code in proc-arm920.S. This will be useful later, but the “procinit” call is usually empty: only XScale really does anything here an it is about bugs pertaining to the boot loader initial state.

The way the procinit code returns is by the conventional ret lr which actually means that the value of the link register (lr) is assigned to the program counter (pc).

Before entering the procinit function we set lr to the physical address of label 1: which will make a relative branch to the symbol __enable_mmu. We also assign r13 the address of __mmap_switched, which is the compiletime-assigned, non-relative virtual address of the next execution point after the MMU has been enabled. We are nearing the end of relative code constructions.

We jump to __enable_mmu. r4 contains the address of the initial page table. We load this using a special CP15 instruction designed to load the page table pointer (in physical memory) into the MMU:

mcr    p15, 0, r4, c2, c0, 0

Nothing happens so far. The page table address is set up in the MMU but it is not yet translating between physical and virtual addresses. Next we jump to __turn_mmu_on. This is where the magic happens. __turn_mmu_on as we said is cleverly compiled into the section .idmap.text which means it has the same address in physical and virtual memory. Next we enable the MMU:

    mcr    p15, 0, r0, c1, c0, 0        @ write control reg
    mrc    p15, 0, r3, c0, c0, 0        @ read id reg

BAM! The MMU is on. The next instruction (which is incidentally an instruction cache flush) will be executed from virtual memory. We don’t notice anything at first, but we are executing in virtual memory. When we return by jumping to the address passed in r13, we enter __mmap_switched at the virtual memory address of this function, somewhere below PAGE_OFFSET (typically 0xC0nnnnnn). We can now facilitate absolute addressing: the kernel is executing as intended.

Executing in virtual memory The little “dance” of the program counter as we switch execution from physical to virtual memory.

We have successfully bootstrapped the initial page table and we are now finally executing the kernel at the location the C compiler thought the kernel was going to execute at.

Now Let’s Get Serious

__mmap_switched can be found in the file arch/arm/kernel/head-common.S and will do a few specific things before we call the C runtime-enabled kernel proper.

First there is an exceptional clause, again for execute-in-place (XIP) kernels: while the .text segment of the kernel can certainly continue executing from ROM we cannot store any kind of variables from the .data segment there. So first this is set up by either just copying the segment to RAM or, using some fancy extra code, uncompressed to RAM (saving more silicon memory).

Next we zero out the .bss segment we got to know earlier: in the Linux kernel we rely on static variables being initialized to zero. Other C runtime environments may not do this, but when you run the Linux kernel you can rely on static variables being zero the first time you enter a function.

The machine has now switched to virtual memory and fully prepared the C runtime environment. We have also patched all cross-references between physical and virtual memory. We are ready to go.

So next we save the processor ID, machine type and ATAG or DTB pointer and branch to the symbol start_kernel(). This symbol resolves to an absolute address and is a C function found a bit down in the init/main.c file. It is fully generic: it is the same point that all other Linux architectures call, so we have now reached the level of generic kernel code written in C.

Let’s see where this is: to disassemble the kernel I just use the objdump tool from the toolchain and pipe it to less:

arm-linux-gnueabihf-objdump -D vmlinux |less

By searching for start_kernel by /start_kernel in less and skipping to the second hit we find:

c088c9d8 <start_kernel>:
c088c9d8:       e92d4ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, lr}
c088c9dc:       e59f53e8        ldr     r5, [pc, #1000] ; c088cdcc
c088c9e0:       e59f03e8        ldr     r0, [pc, #1000] ; c088cdd0
c088c9e4:       e5953000        ldr     r3, [r5]
c088c9e8:       e24dd024        sub     sp, sp, #36     ; 0x24
c088c9ec:       e58d301c        str     r3, [sp, #28]
c088c9f0:       ebde25e8        bl      c0016198 <set_task_stack_end_magic>

Oh this is nice! We are executing C at address 0xC088C9D8, and we can now disassemble and debug the kernel as we like. When I have random crash dumps of the kernel I usually use exactly this method of combining objdump and less to disassemble the kernel and just search for the symbol that crashed me to figure out what is going on.

Another common trick among kernel developers is to enable low-level kernel debugging and putting a print at start_kernel() so that we know that we reach this point. My own favourite construction looks like this (I just insert these lines first in start_kernel()):

#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
{
    extern void printascii(char *);
    printascii("start_kernel\n");
}
#endif

As can be seen, getting low-level debug prints like this to work involves enabling CONFIG_DEBUG_LL, and then you should be able to see a life sign from the kernel before even the kernel banner is “Linux …” is printed.

This file and function should be well known to Linux kernel developers and a good thing to do on a boring day is to read through the code in this file, as this is how the generic parts of Linux bootstrap themselves.

The happiness of generic code will not last as we soon call setup_arch() and we are then back down in arch/arm again. We know for sure that our crude initial translation table will be replaced by a more elaborate proper translation table. There is not yet any such thing as a virtual-to-physical mapping of the userspace memory for example. But this is the topic of another discussion.

ARM traditionally uses compressed kernels. This is done for two major reasons:

  • It saves space on the flash memory or other storage media holding the kernel, and memory is money. For example for the Gemini platform that I work on, the vmlinux uncompressed kernel is 11.8 MB while the compressed zImage is a mere 4.8 MB, we save more than 50%
  • It is faster to load because the time it takes for the decompression to run is shorter than the time that it takes to transfer an uncompressed image from the storage media, such as flash. For NAND flash controllers this can easily be the case.

This is intended as a comprehensive rundown of how the Linux kernel self-decompresses on ARM 32-bit legacy systems. All machines under arch/arm/* uses this method if they are booted using a compressed kernel, and most of them are using compressed kernels.

Bootloader

The bootloader, whether RedBoot, U-Boot or EFI places the kernel image somewhere in physical memory and executes it passing some parameters in the lower registers.

Russell King defined the ABI for booting the Linux kernel from a bootloader in 2002 in the Booting ARM Linux document. The boot loader puts 0 into register r0, an architecture ID into register r1 and a pointer to the ATAGs in register r2. The ATAGs would contain the location and size of the physical memory. The kernel would be placed somewhere in this memory. It can be executed from any address as long as the decompressed kernel fits. The boot loader then jumps to the kernel in supervisor mode, with all interrupts, MMUs and caches disabled.

On contemporary device tree kernels, r2 is repurposed as a pointer to the device tree blob (DTB) in physical memory. (In this case r1 is ignored.) A DTB can also be appended to the kernel image, and optionally amended using the ATAGs from r2. We will discuss this more below.

Decompression of the zImage

If the kernel is compressed, execution begins in arch/arm/boot/compressed/head.S in the symbol start: a little bit down the file. (This is not immediately evident.) It begins with 8 or 7 NOP instructions for legacy reasons. It jumps over some magic numbers and saves the pointer to the ATAGs. So now the kernel decompression code is executing from the physical address of the physical memory where it was loaded.

The decompression code then locates the start of physical memory. On most modern platforms this is done with the Kconfig-selected code AUTO_ZRELADDR, which means a logical AND between the program counter and 0xf8000000. This means that the kernel readily assumes that it has been loaded and executed in the first part of the first block of physical memory.

There are patches being made that would instead attempt to get this information from the device tree.

Then the TEXT_OFFSET is added to the pointer to the start of physical memory. As the name says, this is where the kernel .text segment (as output from the compiler) should be located. The .text segment contains the executable code so this is the actual starting address of the kernel after decompression. The TEXT_OFFSET is usually 0x8000 so the kernel will be located 0x8000 bytes into the physical memory. This is defined in arch/arm/Makefile.

The 0x8000 (32KB) offset is a convention, because usually there is some immobile architecture-specific data placed at 0x00000000 such as interrupt vectors, and many elder systems place the ATAGs at 0x00000100. There also must be some space, because when the kernel finally boots, it will subtract 0x4000 (or 0x5000 for LPAE) from this address and store the initial kernel page table there.

For some specific platforms the TEXT_OFFSET will be pushed downwards in memory, notably some Qualcomm platforms will push it to 0x00208000 because the first 0x00200000 (2 MB) of the physical memory is used for shared memory communication with the modem CPU.

Next the decompression code sets up a page table, if it is possible to fit one over the whole uncompressed+compressed kernel image. The page table is not for virtual memory, but for enabling cache, which is then turned on. The decompression will for natural reasons be much faster if we can use cache.

Next the kernel sets up a local stack pointer and malloc() area so we can handle subroutine calls and small memory allocation going forward, executing code written in C. This is set to point right after the end of the kernel image.

Memory set-up Compressed kernel in memory with an attached DTB.

Next we check for an appended DTB blob enabled by the ARM_APPENDED_DTB symbol. This is a DTB that is added to the zImage during build, often with the simple cat foo.dtb >> zImage. The DTB is identified using a magic number, 0xD00DFEED.

If an appended DTB is found, and CONFIG_ARM_ATAG_DTB_COMPAT is set, we first expand the DTB by 50% and call atagstofdt that will augment the DTB with information from the ATAGs, such as memory blocks and sizes.

Next. the DTB pointer (what was passed in as r2 in the beginning) is overwritten with a pointer to the appended DTB, we also save the size of the DTB, and set the end of the kernel image after the DTB so the appended DTB (optionally modified with the ATAGs) is included in the total size of the compressed kernel. If an appended DTB was found, we also bump the stack and the malloc() location so we don’t destroy the DTB.

Notice: if a device tree pointer was passed in in r2, and an appended DTB was also supplied, the appended DTB “wins” and is what the system will use. This can sometimes be used to override a default DTB passed by a boot loader.

Notice: if ATAGs were passed in in r2, there certainly was no DTB passed in through that register. You almost always want the CONFIG_ARM_ATAG_DTB_COMPAT symbol if you use an elder boot loader that you do not want to replace, as the ATAGs properly defines the memory on elder platforms. It is possible to define the memory in the device tree, but more often than not, people skip this and rely on the boot loader to provide this, one way (the bootloader alters the DTB) or another (the ATAGs augment the appended DTB at boot).

Memory overlap The decompressed kernel may overlap the compressed kernel.

Next we check if we would overwrite the compressed kernel with the uncompressed kernel. That would be unfortunate. If this would happen, we check where in the memory the uncompressed kernel would end, and then we copy ourselves (the compressed kernel) past that location.

Then the code simply does a trick to jump back to the relocated address of a label called restart: which is the start of the code to set up the stack pointer and malloc() area, but now executing at the new physical address.

This means it will again set up the stack and malloc() area and look for the appended DTB and everything will look like the kernel was loaded in this location to begin with. (With one difference though: we have already augmented the DTB with ATAGs, so that will not be done again.) This time the uncompressed kernel will not overwrite the compressed kernel.

Moving the compressed kernel We move the compressed kernel down so the decompressed kernel can fit.

There is no check for if the memory runs out, i.e. if we would happen to copy the kernel beyond the end of the physical memory. If this happens, the result is unpredictable. This can happen if the memory is 8MB or less, in these situations: do not use compressed kernels.

Moving the compressed kernel below the decompressed kernel The compressed kernel is moved below the decompressed kernel.

Now we know that the kernel can be decompressed into a memory that is below the compressed image and that they will not collide during decompression and we execute at the label wont_overwrite:.

We check if we are executing on the address the decompressor was linked to, and possibly alter some pointer tables. This is for the C runtime environment executing the decompressor.

We make sure that the caches are turned on. (There is not certainly space for a page table.)

We clear the BSS area (so all uninitialized variables will be 0), also for the C runtime environment.

Next we call the decompress_kernel() symbol in boot/compressed/misc.c which in turn calls do_decompress() which calls __decompress() which will perform the actual decompression.

This is implemented in C and the type of decompression is different depending on Kconfig options: the same decompressor as the compression selected when building the kernel will be linked into the image and executed from physical memory. All architectures share the same decompression library. The __decompress() function called will depend on which of the decompressors in lib/decompress_*.c that was linked into the image. The selection of decompressor happens in arch/arm/boot/compressed/decompress.c by simply including the whole decompressor into the file.

All the variables the decompressor needs about the location of the compressed kernel are set up in the registers before calling the decompressor.

After decompression After decompression, the decompressed kernel is at TEXT_OFFSET and the appended DTB (if any) remains where the compressed kernel was.

After the decompression, we call get_inflated_image_size() to get the size of the final, decompressed kernel. We then flush and turn off the caches again.

We then jump to the symbol __enter_kernel which sets r0, r1 and r2 as the boot loader would have left them, unless we have an attached device tree blob, in which case r2 now points to that DTB. We then set the program counter to the start of the kernel, which will be the start of physical memory plus TEXT_OFFSET, typically 0x00008000 on a very conventional system, maybe 0x20008000 on some Qualcomm systems.

We are now at the same point as if we had loaded an uncompressed kernel, the vmlinux file, into memory at TEXT_OFFSET, passing (typically) a device tree in r2.

Kernel startup: executing vmlinux

The uncompressed kernel begins executing at the symbol stext(), start of text segment. This code can be found in arch/arm/kernel/head.S.

This is a subject of another discussion. However notice that the code here does not look for an appended device tree! If an appended device tree should be used, you must use a compressed kernel. The same goes for augmenting any device tree with ATAGs. That must also use a compressed kernel image, for the code to do this is part of the assembly that bootstraps a compressed kernel.

Looking closer at a kernel uncompress

Let us look closer at a Qualcomm APQ8060 decompression.

First you need to enable CONFIG_DEBUG_LL, which enables you to hammer out characters on the UART console without any intervention of any higher printing mechanisms. All it does is to provide a physical address to the UART and routines to poll for pushing out characters. It sets up DEBUG_UART_PHYS so that the kernel knows where the physical UART I/O area is located. Make sure these definitions are correct.

First enable a Kconfig option called CONFIG_DEBUG_UNCOMPRESS. All this does is to print the short message “Uncompressing Linux…” before decompressing the kernel and , “done, booting the kernel” after the decompression. It is a nice smoke test to show that the CONFIG_DEBUG_LL is set up and DEBUG_UART_PHYS is correct and decompression is working but not much more. This does not provide any low-level debug.

The actual kernel decompression can be debugged and inspected by enabling the DEBUG define in arch/arm/boot/compressed/head.S, this is easiest done by tagging on -DDEBUG to the AFLAGS (assembler flags) for head.S in the arch/arm/boot/compressed/Makefile like this:

AFLAGS_head.o += -DTEXT_OFFSET=$(TEXT_OFFSET) -DDEBUG

We then get this message when booting:

C:0x403080C0-0x40DF0CC0->0x41801D00-0x422EA900

This means that as we were booting I loaded the kernel to 0x40300000 which would collide with the uncompressed kernel. Therefore the kernel was copied to 0x41801D00 which is where the uncompressed kernel will end. Adding some further debug prints we can see that an appended DTB is first found at 0x40DEBA68 and after moving the kernel down it is found at 0x422E56A8, which is where it remains when the kernel is booted.

Recently my less-used desktop computer became sluggish, and would randomly crash. It seemed to be fully occupied with disk activity and quickly became uninteractive to the point that not even ssh login would work. This is easily identified as thrashing: constantly swapping to disk because of short core memory.

When Linux runs out of memory processes will of course be killed by the OOM killer, but if you have ample swap space, instead you will get thrashing. In this case the OOM killer would have been better: the system was so uninteractive that there is no point in trying to use swap. This was on a flash drive but still would just thrash.

Normally you would interact with the machine throught the UI or a terminal to shut down some processes, but it would not work: the memory used by the interactive processes like the desktop itself or even an SSH terminal was subject to swap!

After an update to the latest Fedora distribution the thrashing was the same but with one difference: the UI did not become completely uninteractive, making it possible to close down e.g. the web browser and recover the system.

The thrashing was caused by one of the DIMMs in the computer starting to malfunction reducing the core memory to a mere 4GB. (I have since replaced the memory.)

Something happened in Fedora that made it cope better with thrashing.

The most likely improvement in Fedora is BFQ the Budget Fair Queue block scheduler, that will use heuristics to keep the interactive processes higher in priority. This was recently made default for single queue devices in Fedora using a udev ruleset.

My flash drive was a single queue elder device – no fancy NVME – so it would become the bottleneck while constantly swapping, but with BFQ inbetween the interactive processes got a priority boost and the system remains interactive under this heavy stress, and swapping is again a better alternative to the OOM killer.

Having worked a bit with BFQ over the years this is a welcome surprise: the user-percieved stability of the system is better.

This might also illustrate the rule to make swap space around double the physical memory: now that my swap space suddenly became 4 times the physical memory the OOM killer would never step in, if it was just 2 times the physical memory, maybe it would. (I do not know if this holds or if the thrashing would be the same.)