people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from Konstantin Ryabitsev

Message-ID's are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it's only natural to want to use memorable ones. Or maybe it's just me.

Regardless, here's what I do with neomutt and coolname:

  1. If coolname isn't yet packaged for your distro, you can install it with pip:

    pip install --user coolname
    
  2. Create this file as ~/bin/my-msgid.py:

    #!/usr/bin/python3
    import sys
    import random
    import string
    import datetime
    import platform
    
    from coolname import generate_slug
    
    parts = []
    parts.append(datetime.datetime.now().strftime('%Y%m%d'))
    parts.append(generate_slug(3))
    parts.append(''.join(random.choices(string.hexdigits, k=6)).lower())
    
    sys.stdout.write('-'.join(parts) + '@' + platform.node().split('.')[0])
    
  3. Create this file as ~/.mutt-fix-msgid:

    my_hdr Message-ID: <`/path/to/my/bin/my-msgid.py`>
    
  4. Add this to your .muttrc (works with mutt and neomutt):

    send-hook . "source ~/.mutt-fix-msgid"
    
  5. Enjoy funky message-id's like 20240227-flawless-capybara-of-drama-e09653@lemur. :)

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Intro

The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 – v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.

Core and protocol stack

Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM_PIDFD and SO_PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.

v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems – the wireless stack and devlink.

Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.

Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page_pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE_SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page_pool statistics over Netlink. Looking forward – there’s ongoing work to allow page_pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page_pool world – a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.

Number of key networking data structures (skb, netdevice, page_pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.

In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).

Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.

Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.

Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.

The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.

Among changes which may be more immediately visible to users – starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.

The AF_XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF_XDP sockets (checksumming, segmentation) as well as timestamping.

Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far).

Device APIs

Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)

In the “ethtool” family – v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames).

After many attempts we have finally gained solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.

In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.

Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.

Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).

Removed

Inspired by the returning discussion about code removal at the Maintainer Summit let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act_ipt, act_rsvp, act_tcindex, sch_atm, sch_cbq, sch_dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.

Community & process

The maintainers, developers and community members had a chance to meet at the BPF/netdev track at Linux Plumbers in Richmond, and the netdev.conf 0x17 conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small netconf gathering – thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris!

We have made minor improvements to the mailing list development process by allowing a wider set of folks to update patch status using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.

The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.

A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.

Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).

syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.

6.3: https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/ 6.4: https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/ 6.5: https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/ 6.6: https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/ 6.7: https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/ 6.8: https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/

 
Read more...

from arnd

Most compilers have an option to warn about a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the sparse checker introduced -Wdecl back in 2005. Ensuring that each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as static where possible to improve inter-function optimizations.

The warnings are not enabled in a default build, but are part of both make W=1 and make C=1 build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the W=1 warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.

After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of “make randconfig” builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total.

I uploaded the patches to https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from W=1 to the default set.

The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.

Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the feature request being open since 2017.

 
Read more...

from Jakub Kicinski

The LWN's development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.

Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.

Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.

Local stats

With that in mind it's only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.

Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list's history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel community not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.

netdev stats

During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven't received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.

We've also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can't take all the full credit :)

Random example

I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.

Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################

As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven't said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I'll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I'm hoping we can learn something by tracking its changes over time.

Fin

To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer's time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN's code is also available publicly (search for gitdm, IIRC).

 
Read more...

from linusw

As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the init task and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:

  • How tasks are actually scheduled on ARM32
  • How the kernelspace and userspace are actually separated, and thus how we move from one to the other

This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.

Terminology

With tasks we mean processes, threads and kernel threads. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.

Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called kthreadd — the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.

Processes in userspace are in essence executing computer programs, or objects with an older terminology, giving the origin of expressions such as object file format. The kernel will start very few such processes, but modprobe and init (which always has process ID 1) are notable exceptions. Any other userspace processes are started by init. Processes can fork new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. POSIX threads is usually the way this happens and further abstractions such as the GLib GThread etc exist.

Task pie chart A pie chart of tasks according to priority on a certain system produced using CGFreak shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.

The userspace is the commonplace name given to a specific context of execution where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but not from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls “syscalls” or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)

The kernelspace conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.

Background

This separation, along with everything else we take for granted in modern computers and operating systems was created in the first time-sharing systems such as the CTSS running on the IBM 700/7000 series computers in the late 1950ies. The Ferranti Atlas Computer in 1962-1967 and its supervisor program followed shortly after these. The Atlas invented nifty features such as virtual memory and memory-mapped I/O, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.

The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the Motorola 68000 CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program code or TEXT segments and a DATA segments.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.

All this said to give you a clear idea why the acronym SVC as in Supervisor Call is used rather than e.g. operating system call or kernel call which would have been more natural. This naming is historical.

Execution Modes or Levels

We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.

When it comes to the older CPUs in the ARMv4, ARMv5 and ARMv6 range these have a special supervisor mode (SVC mode) and a user mode, and as you could guess these two modes are mapped directly to kernelspace and userspace in Linux. In addition to this there are actually 5 additional exception modes for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the user mode belong to kernelspace.

Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from supervisor mode and this way it is clear who is in control.

The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper secure monitor or just monitor mode.

For reference, these modes in the ARMv8 architecture correspond to “privilege levels”. Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 “higher” privilege levels. EL2 is used for hypervisor (virtualization) and EL3 is used for a secure monitor that oversee the switch back and forth to the trusted execution environment (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in drivers/tee in the kernel) but it is a different thing than Linux entirely.

These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.

Exceptional Events

We discussed that different execution modes pertain to certain exceptions. So let's recap ARM32 exceptions.

As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.

The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.

Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the SVC (supervisor call) instruction. This very same instruction — i.e. with the same binary operation code — was previously called SWI (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named TRAP, on x86 there is the INT instruction and RISC-V has the SBI (supervisor binary interface) call.

In my article about how the ARM32 architecture is set up I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace and back.

The transitions occurs at these distinct points:

  • A hardware RESET occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.
  • An undefined instruction is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)
  • A software interrupt occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special SVC (aka SWI) instruction that passes a 1-byte parameter to the software interrupt handler.
  • A prefetch abort occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.
  • A data abort occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.
  • An address exception occurs. This doesn't happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.
  • A hardware interrupt occurs – since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: IRQ and FIQ. Each can be routed to an external interrupt controller, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.
  • A fault occurs such as through division by zero or other arithmetic fault – the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a processor abort.

That's all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and preempted, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.

So how does that actually happen? Let's look at that next.

Entering Kernelspace

Everything has a beginning. I have explained in a previous article how the kernel bootstraps from the function start_kernel() in init/main.c and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.

Further down start_kernel() we initialize the timers, start the clocksource (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the start_kernel() function where arch_call_rest_init() is called. This is in most cases a call to rest_init() in the same file (only S390 does anything different) and that in turn actually initializes some processes:

pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
(...)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

We create separate threads running the in-kernel functions kernel_init and kthreadd, which is the kernel thread daemon which in turn spawns all kernel threads.

The user_mode_thread() or kernel_thread() calls create a new processing context: they both call kernel_clone() which calls copy_process() with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using dup_task_struct() passing current as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach copy_thread() which calls back into the architecture to initialize struct thread_info for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel_init passed instead of just forking), the following happens:

memset(childregs, 0, sizeof(struct pt_regs));
thread->cpu_context.r4 = (unsigned long)args->fn_arg;
thread->cpu_context.r5 = (unsigned long)args->fn;
childregs->ARM_cpsr = SVC_MODE;
(...)
thread->cpu_context.pc = (unsigned long)ret_from_fork;

fn_arg will be NULL in this case but fn is kernel_init or kthreadd. And we execute in SVC_MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to ret_from_fork will be significant, so notice this!

Neither of the functions kernel_init or kthreadd will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called schedule() a single time, which means nothing happens, because nothing is yet scheduled.

kernel_init is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for kthreadd to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged __init) and only then proceed to run_init_process(). As indicated, this will start the init process using kernel_execve(), usually /sbin/init which will then proceed to spawn all usermode processes/tasks. kernel_execve() will check for supported binary formats and most likely call the ELF loader to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START_THREAD() which in turn wraps the ARM32-specific start_thread() which will, amongst other things, do this:

regs->ARM_cpsr = USR_MODE;
(...)
regs->ARM_pc = pc & ~1;

So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. regs->ARM_cpsr will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.

kthreadd on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.

But as said: neither is executing.

In order to actually start the scheduling we call schedule_preempt_disabled() which will issue schedule() with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more “blocky” at this point. However: we already have the clockevent timer running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler_tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.

Until this point we have been running in the context of the Linux init task which is a elusive hard-coded kernel thread with PID 0 that is defined in init/init_task.c and which I have briefly discussed in a previous article. This task does not even appear in procfs in /proc.

As we call schedule(), the kernel init task will preempt and give way to kthreadd and then to the userspace init process. However when the scheduler again schedules the init task with PID 0, we return to rest_init(), and we will call cpu_startup_entry(CPUHP_ONLINE) and that function is in kernel/sched/idle.c and looks like this:

void cpu_startup_entry(enum cpuhp_state state)
{
        arch_cpu_idle_prepare();
        cpuhp_online_idle(state);
        while (1)
                do_idle();
}

That's right: this function never returns. Nothing ever breaks out of the while(1) loop. All that do_idle() does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU “idle” i.e. sleep, since nothing is going on. Then the loop repeats. The kernel init task, PID 0 or “main() function” that begins at start_kernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the init task. The kernel has some documentation of the inner loop that assumes that you know this context.

Let's look closer at do_idle() in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):

while (!need_resched()) {
    local_irq_disable();
    enter_arch_idle_code();
    /* here a considerable amount of wall-clock time can pass */
    exit_arch_idle_code();
    local_irq_enable();
}
(...)
schedule_idle();

This will spin here until something else needs to be scheduled, meaning the init task has the TIF_NEED_RESCHED bit set, and should be preempted. The call to schedule_idle() soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic schedule() call which we will see later.

We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.

Scheduling the first task

So we know that schedule() has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.

We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.

Scheduler model A mental model of the scheduler: scheduler_tick() sets the flag TIF_NEED_RESCHED and a later call to schedule() will actually call out to check_and_switch_context() that does the job of switching task.

Scheduler tick and TIF_NEED_RESCHED

As part of booting the kernel in start_kernel() we first initialized the scheduler with a call to sched_init() and the system tick with a call to tick_init() and then the timer drivers using time_init(). The time_init() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in drivers/clocksource.

There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.

The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it's interrupt will not be able to fire until local_irq_enable() is called further down in start_kernel(). After this point the system has a running scheduling tick.

As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.

The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO_HZ).

What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.

When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own ->event_handler() which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to do_timer() followed by a call to scheduler_tick(). (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)

As a result of calling scheduler_tick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, “rescheduled” and calls resched_curr(rq) on the runqueue for the CPU, which in turn will call set_tsk_need_resched(curr) on the current task, which flags it as ready to be rescheduled.

set_tsk_need_resched() will set the flag TIF_NEED_RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in arch/arm/include/asm/thread_info.h and ARM32 has a bitmask version of this flag helpfully named _TIF_NEED_RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.

This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means “at your earliest convenience, yield to another task”. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when schedule() is called.

The Task State and Stack

We mentioned the architecture-specific struct thread_info so let's hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread_info is simply part of the task_struct. The struct task_struct is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:

struct task_struct {
    struct thread_info thread_info;
    (...)
    unsigned int state;
    (...)
    void *stack;
    (...)
    struct mm_struct *mm;
    (...)
    pid_t pid;
    (...)
};

The struct thread_info which in our case is a member of task_struct contains all the architecture-specific aspects of the state.

The task_struct refers to thread_info, but also to a separate piece of memory void *stack called the task stack, which is where the task will store its activation records when executing code. The task stack is of size THREAD_SIZE, usually 8KB (2 * PAGE_SIZE). These days, in most systems, the task stack is mapped into the VMALLOC area.

The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first enabled THREAD_INFO_IN_TASK which enabled thread info to be contained in the task_struct and then enabled CONFIG_VMAP_STACK for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!

Task struct The task_struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread_info is inside it, and inside the thread_info is a cpu_context_save. It has a task stack of size THREAD_SIZE for ARM32 which is typically twice the PAGE_SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task_struct and thread_info is such that task_struct is Linux-generic and thread_info is architecture-specific and they correspond 1-to-1.

Actual Preemption

In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of schedule() getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?

schedule() can be called for two reasons:

  • Voluntary preemption: such as when a kernel thread want to give up it's time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call schedule_preempt_disabled() the very first time, we voluntarily preempt the kernel execution of the init task with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the kthreadd process. Other places can be found by git grep:ing for calls to cond_resched() or just an explicit call to schedule().
  • Forced preemption: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its' timeslice, and schedule_tick() has set the TIF_NEED_RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.

Places where forced preemption happens:

The short answer to the question “where does forced preemption happen?” is “at the end of exception handlers”. Here are the details.

The most classical place for preemption of userspace processes is on the return path of a system call. This happens from arch/arm/kernel/entry-common.S in the assembly snippets for ret_slow_syscall() and ret_fast_syscall(), where the ARM32 kernel makes an explicit call to do_work_pending() in arch/arm/kernel/signal.c. This will issue a call to schedule() if the flag _TIF_NEED_RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is ret_from_fork which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.

The most common place for preemption is however when returning from a hardware interrupt. Interrupts on ARM32 are handled in assembly in arch/arm/kernel/entry-armv.S with a piece of assembly that saves the processor state for the current CPU into a struct pt_regs and from there just calls the generic interrupt handling code in kernel/irq/handle.c named generic_handle_arch_irq(). This code is used by other archs than ARM32 and will nominally just store the system state and registers in a struct pt_regs record on entry and restore it on exit. However when the simplistic code in generic_handle_arch_irq() is done, it exits through the same routines in arch/arm/kernel/entry-common.S as fast and slow syscalls, and we can see that in ret_to_user_from_irq the code will explicitly check for the resched and other flags with ldr r1, [tsk, #TI_FLAGS] and branch to the handler doing do_work_pending(), and consequently preempt to another task instead of returning from an interrupt.

Now study do_work_pending():

do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
        /*
         * The assembly code enters us with IRQs off, (...)
         */

        do {
                if (likely(thread_flags & _TIF_NEED_RESCHED)) {
                        schedule();
                } else {
                        (...)
                }
                local_irq_disable();
                thread_flags = read_thread_flags();
        } while (...);
        return 0;
}

Notice the comment: we enter do_work_pending() with local IRQs disabled so we can't get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is local_irq_disable() – we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do_work_pending you will see that this looks the same on other architectures with similar setup.

In reality do_work_pending() does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local_irq_disable().

The struct pt_regs should be understood as “processor trace registers” which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the CPU state, including the program counter pc, which is the place where the task was supposed to resume execution, unless it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt_regs are stored in the task stack during the call to generic_handle_arch_irq().

The assembly in entry-common.S can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:

	(...)
slow_work_pending:
	mov	r0, sp				@ 'regs'
	mov	r2, why				@ 'syscall'
	bl	do_work_pending
	cmp	r0, #0
	beq	no_work_pending
	(...)

ENTRY(ret_to_user_from_irq)
	ldr	r1, [tsk, #TI_FLAGS]
	movs	r1, r1, lsl #16
	bne	slow_work_pending
no_work_pending:
	asm_trace_hardirqs_on save = 0
	ct_user_enter save = 0
	restore_user_regs fast = 0, offset = 0

We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by do_work_pending() which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to no_work_pending and restore the usersmode registers and continue execution.

Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler_tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler_tick() gets called by the timer interrupt, and if it sets TIF_NEED_RESCHED a different thread will start to execute on our way out of the exception handler!

The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named restore_user_regs can be found in entry-header.S and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):

	mov	r2, sp
	(...)
	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
	add	sp, sp, #\offset + PT_REGS_SIZE
	movs	pc, lr				@ return & move spsr_svc into cp

r2 is set to the stack pointer, where pt_regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) — NOTE: the little caret (^) after the ldmdb instruction means “also load CPSR from the stack” — then moves the stackpointer past the saved registers and returns.

Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task task stack, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.

Also notice that this must happen in the end of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.

But let's not skip the last part of what schedule() does.

Setting the Program Counter

So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do_work_pending() which in turn will call schedule() for us.

The schedulers schedule() call is supposed to very quickly select a process to run next. Eventually it will call context_switch() in kernel/sched/core.c, which in turn will do essentially two things:

  • Check if the next task has a unique memory management context (next->mm is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call switch_mm_irqs_off() which in the ARM32 case is just defined to the ARM32-specific switch_mm() which will call the ARM32-specific check_and_switch_context()NOTE that this function for any system with MMU is hidden in the arch/arm/include/asm/mmu_context.h header file — which in turn does one of two things:
    • If interrupts are disabled, we will just set mm->context.switch_pending = 1 so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific finish_arch_post_lock_switch() which is implemented right below and gets called right after dropping the runqueue lock.
    • If interrupts are not disabled, we will immediately call cpu_switch_mm(). This is a per-cpu callback witch is written in assembly for each CPU as cpu_NNNN_switch_mm() inside arch/arm/mm/proc-NNNN.S. For example, all v7 CPUs have the cpu_v7_switch_mm() in arch/arm/mm/proc-v7.S.
  • Switch context (such as the register states and stack) to the new task by calling switch_to() with the new task and the previous one as parameter. In most cases this latches to an architecture-specific __switch_to(). In the ARM32 case, this routine is written in assembly and can be found in arch/arm/kernel/entry-armv.S.

Now the final details happens in __switch_to() which is supplied the struct thread_info (i.e. the architecture-specific state) for both the current and the previous task:

  • We store the registers of the current task in the task stack, at the TI_CPU_SAVE index of struct thread_info, which corresponds to the .cpu_context entry in the struct, which is in turn a struct cpu_context_save, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to continue as if nothing has happened when we “return” after the schedule() call. I put “return” in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.
  • Then the TLS (Thread Local Storage) settings for the new task are obtained and we issue switch_tls(). On v6 CPUs this has special implications, but in most cases we end up using switch_tls_software() which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the kernel-provided user helpers, which in turn are a few kernel routines “similar to but different from VDSO” that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the struct thread_info pointer, so it cannot be used for TLS on ARMv7. (More on this later.)
  • We then broadcast THREAD_NOTIFY_SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet __switch_to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.

Then we reach the final step in __switch_to(), which is a bit different depending on whether we use CONFIG_VMAP_STACK or not.

The simple path when we are not using VMAP:ed stacks looks like this:

	set_current r7, r8
	ldmia	r4, {r4 - sl, fp, sp, pc}	@ Load all regs saved previously

Here r7 contains a pointer to the next tasks thread_info (which will somewhere the kernel virtual memory map), and set_current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label __current but ARMv7 and SMP systems have a dedicated special CP15 TPIDRURO thread ID register to store this in the CPU so that the thread_info can be located very quickly. (The only user of this information is, no surprise, the get_current() assembly snippet, but that is in turn called from a lot of places and contexts.)

The next ldmia instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) and pc(r15) from the location pointed out by r4, which again is the .cpu_context entry in the struct thread_info, the struct cpu_context_save, which is all the context there is including pc so the next instruction after this will be whatever pc was inside the struct cpu_context_save. We have switched to the new task and preemption is complete.

But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn't the task we're switching to going to miss those registers?

For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be “clobbered” during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It's a function call to inline assembly and not an exception.

And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or (C) explicitly called schedule() when this happened, this just doesn't matter.

Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)

When starting a completely new task all the contents of struct cpu_context_save will be zero, and the return address will be set to ret_from_fork or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.

If we're on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn't matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt_regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.

This is why r0 thru r3 are missing from struct cpu_context_save and need not be preserved.

When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution right after the schedule(); call. If we were on our way out of an exception in do_work_pending() we will proceed to return from the exception handler, and to the process it will “feel” like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.

Running init

So how does /sbin/init actually come to execute?

We saw that after start_kernel we get to rest_init which creates the thread with pid = user_mode_thread(kernel_init, NULL, CLONE_FS).

Then kernel_init calls on kernel_execve() to execute /sbin/init. It locates an ELF parser to read and page in the file. Then it will eventually issue start_thread() which will set regs->ARM_cpsr = USR_MODE and regs->ARM_pc to the start of the executable.

Then this tasks task_struct including memory context etc will be selected after a call to schedule().

But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we “cheat”:

When we initialized the process in arch/arm/kernel/process.c, we set the program counter to ret_from_fork so we are not going back after any schedule() call: we are going back to ret_from_fork! And this is just an exception return path, so this will restore regs->ARM_cpsr to USR_MODE, and “return from an exception” into whatever is in regs->ARM_pc, which is the start of the binary program from the ELF file!

So /sbin/init is executed as a consequence of returning from a fake exception through ret_from_fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.

This is how ARM32 schedules and executes processes.

 
Read more...

from Christian Brauner

The original blogpost is at https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

Early on when the LXD project was started we were clear that we wanted to make it possible to change settings while the container is running. On of the very first things that came to our mind was making it possible to insert new mounts into a running container. When I was still at Canonical working on LXD we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.

Back then the only way to create mounts or change mount option was by using the mount(2) system call. The mount system call multiplexes a lot of different operations. For example, it doesn't just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes. Mounting is overall a pretty complex operation as it doesn't just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.

I want to take a look at our legacy solution to this problem and a new approach that I've used and that has existed for a while but never talked about widely.

Creative uses of mount(2)

Before openat2(2) came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks. A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem. These attacks used to be quite common and there was no straightforward solution available; at least not before the RESOLVE_* flag namespace of openat2(2) improved things so considerably that symlink attacks on new kernels can be effectively blocked.

But before openat2() symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm. I won't go into too much detail but it is roughly done by verifying each path component in userspace using O_PATH file descriptors making sure that the paths point into the container's rootfs.

But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that mount(2) only operates on paths. So you are still susceptible to symlink attacks as soon as you call mount(source, target, ...).

The way we solved this problem was by realizing that mount(2) was perfectly happy to operate on /proc/self/fd/<nr> paths (This is similar to how fexecve() used to work before the addition of the execveat() system call.). So we could verify the whole path and then open the last component of the source and target paths at which point we could call mount("/proc/self/fd/1234", "/proc/self/fd/5678", ...).

We immediately thought that if mount(2) allows you to do that then we could easily use this to mount into namespaces. So if the container is running it its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container's mount namespace (and it's owning user namespace) and then simply call mount("/proc/self/fd/1234", "/mnt", ...). In pseudo C code it would look roughly:

fd_mnt = openat(-EBADF, "/opt", O_PATH, ...);
setns(fd_userns, CLONE_NEWUSER);
setns(fd_mntns, CLONE_NEWNS);
mount("/proc/self/fd/fd_mnt", "/mnt", ...);

However, this isn't possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller's mount namespace. Since the caller will be located in the container's mount namespace after the setns() call but the source file descriptors refers to a mount located in the host's mount namespace this check fails. The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted. Back then it would've also meant that adding mounts to containers would've only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.

Mount namespace tunnels

So we came up with the idea of mount namespace tunnels. Since we spearheaded this idea it has been picked up by various projects such as systemd for system services and it's own systemd-nspawn container runtime.

The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
# Create new mount namespace with all mounts turned into dependent mounts.
unshare --mount --propagation=slave

and then create a mount on or beneath the shared /opt mount on the host:

mkdir /opt/a
mount --bind /tmp /opt/a

then the new mount of /tmp on the dentry /opt/a will propagate into the mount namespace we created earlier. Since the /opt mount at the /opt dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:

mount --move /opt/a /mnt

As a last step we can unmount /opt/a in the host mount namespace. And as long as the /mnt dentry doesn't reside on a mount that is a dependent mount of /opt's peer group the unmount of /opt/a we just performed on the host will only unmount the mount in the host mount namespace.

There are various problems with this solution:

  • It's complex.
  • The container manager needs to set up the mount tunnel when the container starts. In other words, it needs to part of the architecture of the container which is always unfortunate.
  • The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted. Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.

Mounting into mount namespaces

A few years ago a new mount api made it into the kernel. Shortly after I've also added the mount_setattr(2) system call. Since then I've been expanding the abilities of this api and to put it to its full use.

Unfortunately the adoption of the new mount api has been slow. Mostly, because people don't know about it or because they don't yet see the many advantages it offers over the old one. But with the next release of the mount(8) binary a lot of us use the new mount api will be used whenever possible.

I won't be covering all the features that the mount api offers. This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let's us get rid of the complex mount propagation scheme.

Luckily, the new mount api is designed around file descriptors.

Filesystem Mounts

To create a new filesystem mount using the old mount api is simple:

mount("/dev/sda", "/mnt", "xfs", ...);

We pass the source, target, and filesystem type and potentially additional mount options. This single system call does a lot behind the scenes. A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller's mount namespace.

In the new mount api the various steps are split into separate system calls. While this makes mounting more complex it allows allows for greater flexibility. Mounting doesn't have to be a fast operation and never has been.

So in the new mount api we would create a new filesystem mount with the following steps:

/* Create a new filesystem context. */
fd_fs = fsopen("xfs");

/*
 * Set the source of the filsystem mount. Whether or not this is required
 * depends on the type of filesystem of course. For example, mounting a tmpfs
 * filesystem would not require us to set the "source" property as it's not
 * backed by a block device. 
 */
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);

/* Actually create the superblock and prepare to allocate a mount. */
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

The fd_fs file descriptor refers to VFS context object that doesn't concern us here. Let it suffice that it is an opaque object that can only be used to configure the superblock and the filesystem until fsmount() is called:

/* Create a new detached mount and return an O_PATH file descriptor refering to the mount. */
fd_mnt = fsmount(fd_fs, 0, 0);

The fsmount() call will turn the context file descriptor into an O_PATH file descriptor that refers to a detached mount. A detached mount is a mount that isn't attached to any mount namespace.

Bind Mounts

The old mount api created bind mounts via:

mount("/opt", "/mnt", MNT_BIND, ...)

and recursive bind mounts via:

mount("/opt", "/mnt", MNT_BIND | MS_REC, ...)

Most people however will be more familiar with mount(8):

mount --bind /opt /mnt
mount --rbind / /mnt

Bind mounts play a major role in container runtimes and system services as run by systemd.

The new mount api supports bind mounts through the open_tree() system call. Calling open_tree() on an existing mount will just return an O_PATH file descriptor referring to that mount. But if OPEN_TREE_CLONE is specified open_tree() will create a detached mount and return an O_PATH file descriptor. That file descriptor is indistinguishable from an O_PATH file descriptor returned from the earlier fsmount() example:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLONE, ...)

creates a new detached mount of /opt and:

fd_mnt = open_tree(-EBADF, "/", OPEN_TREE_CLONE | AT_RECURSIVE, ...)

would create a new detached copy of the whole rootfs mount tree.

Attaching detached mounts

As mentioned before the file descriptor returned from fsmount() and open_tree(OPEN_TREE_CLONE) refers to a detached mount in both cases. The mount it refers to doesn't appear anywhere in the filesystem hierarchy. Consequently, the mount can't be found by lookup operations going through the filesystem hierarchy. The new mount api thus provides an elegant mechanism for:

mount("/opt", "/mnt", MS_BIND, ...);
fd_mnt = openat(-EABDF, "/mnt", O_PATH | O_DIRECTORY | O_CLOEXEC, ...);
umount2("/mnt", MNT_DETACH);

and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace. This alone is already a very powerful tool but we won't go into depth today.

Most of the time a detached mount isn't wanted however. Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it. So we need to attach them to the filesystem hierarchy.

In order to attach a mount we can use the move_mount() system call. For example, to attach the detached mount fd_mnt we create before we can use:

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

This will attach the detached mount of /opt at the /mnt dentry on the / mount. What this means is that the /opt mount will be inserted into the mount namespace that the caller is located in at the time of calling move_mount(). (The kernel has very tight semantics here. For example, it will enforce that the caller has CAP_SYS_ADMIN in the owning user namespace of its mount namespace. It will also enforce that the mount the /mnt dentry is located on belongs to the same mount namespace as the caller.)

After move_mount() returns the mount is permanently attached. Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to. In other words, move_mount() is an irreversible operation.

The main point is that before move_mount() is called a detached mount doesn't belong to any mount namespace and can thus be freely moved around.

Mounting a new filesystem into a mount namespace

To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:

fd_fs = fsopen("xfs");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fd_mnt = fsmount(fd_fs, 0, 0);

For filesystems that require host privileges such as xfs, ext4, or btrfs (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges. However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace. So we can simply attach to the user namespace and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and then use

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

to attach the detached mount anywhere we like in the container.

Mounting a new bind mount into a mount namespace

A bind mount is even simpler. If we want to share a specific host directory with the container we can just have the container manager call:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLOEXEC | OPEN_TREE_CLONE);

to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and as above we are free to attach the detached mount anywhere we like in the container.

Conclusion

This is really it and as simple as it sounds. It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers. We've making heavy use of this LXD and it is general the proper way to insert mounts into mount namespaces on newer kernels.

 
Read more...

from Konstantin Ryabitsev

At some point in the recent past, mutt changed the way it generates Message-ID header values. Instead of the perfectly good old way of doing it, the developers switched to using base64-encoded random bytes. The base64 dictionary contains the / character, which causes unnecessary difficulties when linking to these messages on lore.kernel.org, since the / character needs to be escaped as %2F for everything to work properly.

Mutt developers seem completely uninterested in changing this, so please save everyone a lot of trouble and do the following if you're using mutt for your kernel development needs (should work for all mutt versions):

  1. Create a ~/.mutt-hook-fix-msgid file with the following contents (change “mylaptop.local” to whatever you like):

    my_hdr Message-ID: <`uuidgen -r`@mylaptop.local>
    
  2. Add the following to your ~/.muttrc:

    send-hook . "source ~/.mutt-hook-fix-msgid"
    

UPDATE: if you have mutt 2.1 or later you can alternatively set the $message_id_format variable to restore the pre-mutt-2.0 behaviour:

# mutt-2.1+ only
set message_id_format = "<%Y%02m%02d%02H%02M%02S.G%c%p@%f>"

Thanks to Thomas Weißschuh for the suggestion!

 
Read more...

from Jakub Kicinski

NIC drivers pre-allocate memory for received packets. Once the packets arrive NIC can DMA them into the buffers, potentially hundreds of them, before host processing kicks in.

For efficiency reasons each packet-processing CPU (in extreme cases every CPU on the system) will have its own set of packet queues, including its own set of pre-allocated buffers.

The amount of memory pre-allocated for Rx is a product of:

  • buffer size
  • number of queues
  • queue depth

A reasonable example in data centers would be:

8k * 32 queues * 4k entries = 1GB

Buffer size is roughly dictated by the MTU of the network, for a modern datacenter network 8k (2 pages) is likely a right ballpark figure. Number of queues depends on the number of cores on the system and the request-per-second rate of the workload. 32 queues is a reasonable choice for the example (either 100+ threads or a network-heavy workload).

Last but not least – the queue depth. Because networking is bursty, and NAPI processing is at the whim of the scheduler (the latter is more of an issue in practice) the queue depths of 4k or 8k entries are not uncommon.

Can we do better?

Memory is not cheap, having 1GB of memory sitting around unused 99% of the time has a real cost. If we were to advise a NIC design (or had access to highly flexible devices like the Netronome/Corigine NICs) we could use the following scheme to save memory:

Normal processing rarely requires queue depth of more than 512 entries. We could therefore have smaller dedicated queues, and a larger “reserve” – a queue from which every Rx queue can draw, but which requires additional synchronization on the host side. To achieve the equivalent of 4k entries we'd only need:

8k * 32 queues * 512 entries + 8k * 1 reserve * 4k entries = 160MB

The NIC would try to use the 512 entries dedicated to each queue first, but if they run out (due to a packet burst or a scheduling delay) it could use the entries from the reserve. Bursts and latency spikes are rarely synchronized across the queues.

Can we do worse?

In practice memory savings are rarely top-of-mind for NIC vendors. Multiple drivers in Linux allocate a set of rings for each thread of the CPU. I can only guess that this is to make sure iperf tests run without a hitch...

As we wait for vendors to improve their devices – double check the queue count and queue size you use are justified (ethtool -g / ethtool -l).

 
Read more...

from kees

How to modernize C arrays for greater memory safety: a case-study in refactoring the Linux kernel and a look to the future

Kees Cook

C is not just a fancy assembler any more

Large projects written in C, especially those written close to the hardware layer like Linux, have long treated the language as a high-level assembler. Using C allowed for abstracting away much of the difficulty of writing directly in machine code while still providing easy low-level access to memory, registers, and CPU features. However, C has matured over the last half century, and many language features that improve robustness go unused in older codebases. This is especially true for arrays, where the historical lack of bounds checking has been a consistent source of security flaws.

Converting such codebases to use “modern” language features, like those in C99 (still from the prior millennium), can be a major challenge, but it is an entirely tractable problem. This post is a deep dive into an effort underway in the Linux kernel to make array index overflows (and more generally, buffer overflows) a thing of the past, where they belong. Our success hinges on replacing anachronistic array definitions with well-defined C99 flexible arrays. This approach can be used by developers to refactor C code, making it possible to leverage 21st century mitigations (like -fsanitize=bounds and FORTIFY_SOURCE), since such things can finally be cleanly applied to the modernized codebase.

The fraught history of arrays in C

For the compiler to successfully apply array index bounds checking, array sizes must be defined unambiguously, which is not always easy in C. Depending on the array definition, bounds checking falls roughly into three categories: fixed-sized arrays, dynamically-sized arrays, and pointer offsets. Each category of array definitions must be made unambiguous before the next, as they mostly build on top of each other. For example, if the compiler cannot protect a fixed-sized array, it certainly cannot protect a dynamically-sized array, and array indexing is just a specialized case of calculating a memory pointer offset.

Properly defined dynamically-sized arrays were introduced in C99 (int foo[]), and called “flexible arrays”. Before that, many C projects used the GNU extension of zero-length arrays (int foo[0]), which is not recognized by the C standard. This was done because, before the GNU extension, C projects would use single-element arrays (int foo[1]) which had several frustrating characteristics. (Using sizeof() on such a structure would include a single element as well, which would require additional handling to get allocation sizes to be accurate. This is not a problem for zero-element or true flexible arrays.)

However, due to yet more historical situations (e.g. struct sockaddr, which has a fixed-size trailing array that is not supposed to actually be treated as fixed-size), GCC and Clang actually treat all trailing arrays as flexible arrays. This behavior makes things even more problematic, since it becomes impossible to limit a flexible array heuristic to only 1-element or 0-element (i.e. zero-length) arrays. For example, a compiler can't tell the intent of variable's use here:

struct obj {
        ...
        unsigned char bytes;
        int variable[4];
};

Is it actually a 4 element array, or is it sized by the bytes member? As such, compilers have had to assume that trailing arrays must be intended to be dynamically sized (even though most are intended to be fixed-size).

To clear the way for sensible protection of fixed-size arrays, and to have a common framework for handling dynamically-sized arrays, Linux must have all the “fake” flexible array members replaced with actual C99 flexible array members so that the programmer's intent can actually be represented in an unambiguous way. With this done, -Warray-bounds (and similar things like __builtin_object_size()) will catch compile-time problems, and -fsanitize=bounds (and similar things like __builtin_dynamic_object_size()) can catch run-time problems.

Once fixed-sized arrays are protected, dynamically sized arrays can be protected as well, though this requires introducing a way to annotate structures that contain flexible arrays. Nearly all such structs also contain the count of allocated elements present in the flexible array:

struct obj {
        ...
        unsigned short count;
        struct foo items[]; /* Has "count" many "struct foo"s */
} *ptr;

Such structs therefore fully describe their contents at runtime (and are called “flexible array structures” from here on). In other words, their size can be determined at run-time as:

sizeof(*ptr) + sizeof(*ptr->items) * ptr->count

Teaching the compiler which struct member is associated with the count of a given flexible array member will allow -fsanitize=bounds and __builtin_dynamic_object_size() to reason about flexible array structure usage as well, covering all arrays in Linux with “known bounds”.

(Not covered here is the closely related work to tighten the FORTIFY_SOURCE implementation for the memcpy()-family of functions which also depends on making flexible array sizes unambiguous.)

Replacing “fake” flexible arrays

Compile-time diagnostics about the size of arrays use either internal value range checking or things similar to the FORTIFY_SOURCE macros (which use __builtin_object_size() for their implementations). This works well for arrays not at the end of the structure, but gets disabled for trailing arrays since the compiler must treat trailing arrays as flexible arrays (see struct sockaddr above). And for everything treated as a flexible array (i.e. dynamically sized), the compiler cannot know the array length at compile time, since it will be only known at runtime. To make such array declarations unambiguous (and therefore able to gain sane runtime bounds checking), compilers must gain an option to disable all “fake” flexible array heuristics, and treat only true flexible arrays as flexible arrays.

The creation of -fstrict-flex-arrays is now available in recent GCC and Clang builds, but any project using it will need to replace all fake flexible arrays with true flexible arrays first (to separate them from any fixed-size trailing arrays). This comes with several challenges.

Replace 0-length arrays

Most replacement of 0-length arrays with flexible arrays requires no special handling. Simply removing the “0” in the array declaration is sufficient. For example,

struct obj {
        ...
        int flex[0];
};

becomes:

struct obj {
        ...
        int flex[];

};

However, there are a few things of note that can go wrong with these conversions:

Changes to sizeof()

While sizeof(instance->flex) for a 0-length array returns 0, it becomes a compile-time failure once it becomes a true flexible array. This usually manifests within other complex macros that are examining the details of a given struct, and are usually hidden bugs that switching to a flexible array helps expose.

Pass by value

Converting to a true flexible array will expose any strange cases of trying to pass a flexible array struct by value. These are almost always a bug, so it's another case where a problem is exposed by cleaning up fake flexible arrays. For example:

net/core/flow_dissector.c: In function 'is_pppoe_ses_hdr_valid':
net/core/flow_dissector.c:898:13: note: the ABI of passing struct with a flexible array member has changed in GCC 4.4

898 | static bool is_pppoe_ses_hdr_valid(struct pppoe_hdr hdr)
    |                                   ^~~~~~~~~~~~~~~~~~~~~~

Flexible arrays in unions

C99 6.7.2.1 “Structure and union specifiers” #16 declares true flexible arrays may not be in unions nor otherwise empty structures: “As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member.”

However, this situation is allowed by the GNU “trailing array” extension, where such arrays are treated as flexible arrays. More importantly, flexible arrays (via the GNU extension) are used in unions in many places throughout Linux code. The C99 treatment of true flexible arrays appears to be only a definitional limitation (and likely just an oversight) since the restriction can be worked around with creative use of anonymous structs. For example, this will build:

struct obj {
        ...
        union {
                struct foo name1[0];
                struct bar name2[0];
        };
};

but this will not:

struct obj {
        ...
        union {
                struct foo name1[];
                struct bar name2[];
        };
};
<source>:5:22: error: flexible array member in union
  5 | struct foo name1[];
    |            ^~~~~

But in both cases, the compiler treats name1 and name2 as flexible arrays. What will happily compile, though, is wrapping true flexible arrays in a struct that has at least 1 other non-true-flexible array, including an empty anonymous struct (i.e. taking up no size):

struct obj {
        ...
        union {
                struct {
                        struct { } __unused_member1;
                        struct foo name1[];
                };
                struct {
                        struct { } __unused_member2;
                        struct bar name2[];
                };
        };
};

Thankfully, this was wrapped in Linux with the DECLARE_FLEX_ARRAY() macro:

struct obj {
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};

which makes this much more readable. I hope to see future C standards eliminate this restriction.

Overlapping composite structure members

This is another case of a real bug being exposed by true flexible array usage, as it is possible to create an implicit union of a flexible array and something else by including a flexible array structure in another struct. For example:

struct inner {
        ...
        int flex[0];
};

struct outer {
        ...
        struct inner header;
        int overlap;
        ...
} *instance;

Here, instance->overlap and instance->header.flex[0] share the same memory location. Whether or not this is intentional cannot be understood by the compiler. If it is a bug, then using a true flexible array will trigger a warning. If it's not a bug, rearranging the structures to use an actual union is needed (see above).

struct definition parsed by something other than a C compiler

If the converted struct is part of a source file that is parsed by something that is not a C compiler, it may not be prepared to handle empty square braces on arrays. For example, SWIG broke when the Linux Userspace API headers got converted. This is a known issue in SWIG, and can be worked around in various ways.

Replace 1-element arrays

Most 1-element array conversions are similar to 0-length array conversions, but with the effect that the surrounding structure's sizeof() changes. This leads to a few additional significant issues:

Size calculations

If a struct is used entirely internally to Linux, it is generally sufficient to make changes to both the struct and all size calculations, which will result in identical binary output. For example:

struct object {
        ...
        int flex[1];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p->flex[0]) * (count - 1)),
            GFP_KERNEL);

the above count - 1 becomes just count now:

struct object {
        ...
        int flex[];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p->flex[0]) * count),
            GFP_KERNEL);

If all size calculations are correctly adjusted, there should be no differences in the resulting allocation size, etc. If a discrepancy is found, it is going to be either a bug introduced by the conversion, or the discovery of an existing bug in the original size calculations.

Note that depending on the sizes of the structure, its flexible array element, and count, there is also the risk associated with arithmetic overflow. Linux uses the struct_size() macro to perform these calculations so that the result saturates to at most SIZE_MAX, which will cause an allocation failure rather than wrapping around. So the best way to perform this allocation would be:

p = kmalloc(struct_size(p, flex, count), GFP_KERNEL);

Padding and interface sizes

When a structure definition is also used by a codebase we don't control (e.g. firmware, userspace, virtualization), changing its layout or sizeof() may break such code. Specifically, it may break its ability to communicate correctly with the kernel across the shared interface. Such structures cannot suddenly lose the single element of its trailing array. In these cases, a new member needs to be used for kernel code, explicitly keeping the original member for backward compatibility. For example:

struct object {
        ...
        int flex[1];
};

becomes:

struct object {
        ...
        union {
                int flex[1];
                DECLARE_FLEX_ARRAY(int, data);
        };
};

Now the kernel will only use the newly named data member (and gain any potential bounds checking protections from the compiler), and external code that shares this structure definition can continue to use the flex member, all without changing the size of the structure.

This has the downside of needing to change the member name throughout Linux. However, if the other side of the interface doesn't actually use the original member, we can avoid this. We can convert the member to a flexible array and add explicit padding instead. This would mean no collateral changes with the member name in Linux are needed:

struct object {
        ...
        union {
                int __padding;
                DECLARE_FLEX_ARRAY(int, flex);
        };
};

Replace multi-element arrays

In the cases of trailing arrays with larger element counts, the usage needs to be even more carefully studied. Most problems end up looking very similar to 1-element interface conversions above. For example, if there is some hardware interface that returns at least 4 bytes for an otherwise dynamically sized array, the conversion would start from here:

struct object {
        ...
        unsigned char data[4];
};

which becomes:

struct object {
        ...
        union {
                unsigned char __padding[4];
                DECLARE_FLEX_ARRAY(unsigned char, data);
        };
};

Enable -Warray-bounds

With all fixed-size array bounds able to be determined at build time, -Warray-bounds can actually perform the checking, keeping provably bad code out of Linux. (This option is already part of -Wall, which Linux isn't quite able to use itself yet, but is strongly recommended for other C projects.) As a reminder, optimization level will impact this option. The kernel is built with -O2, which is likely the right choice for most C projects.

Enable -Wzero-length-array

If all zero length arrays have been removed from the code, future uses can be kept out of the code by using -Wzero-length-array. This option is currently only available in Clang, and will warn when finding the definition of such structure members, rather than warning when they are accessed in code. Because of this, it is unlikely to ever be enabled in Linux since some array sizes are constructed from build configurations, and may drop to 0 when they are unused (i.e. they were never used as flexible arrays). As such, it is sufficient to use -fstrict-flex-arrays (see below) and -Warray-bounds.

Enable -fstrict-flex-arrays

Once all the fake flexible arrays have been converted to true flexible arrays, the remaining fixed-sized trailing arrays can start being treated as actually fixed-size by enabling -fstrict-flex-arrays. Future attempts to add fake flexible arrays to the code will then elicit warnings as part of the existing diagnostics from -Warray-bounds, since all fake flexible arrays are now treated as fixed-size arrays. (Note that this option sees the subset of 0-length arrays caught by -Wzero-length-array when they are actually used in the code, so -Wzero-length-array may be redundant.)

Coming soon: annotate bounds of flexible arrays

With flexible arrays now a first-class citizen in Linux and the compilers, it becomes possible to extend their available diagnostics. What the compiler is missing is knowledge of how the length of a given flexible array is tracked. For well-described flexible array structs, this means associating the member holding the element count with the flexible array member. This idea is not new, though prior implementation proposals have wanted to make changes to the C language syntax. A simpler approach is the addition of struct member attributes, and is under discussion and early development by both the GCC and Clang developer communities.

Add __attribute__((__counted_by__(member)))

In order to annotate flexible arrays, a new attribute could be used to describe the relationship between struct members. For example:

struct object {
        ...
        signed char items;
        ...
        int flex[];
} *p;

becomes:

struct object {
        ...
        signed char items;
        ...
        int flex[] __attribute__((__counted_by__(items)));
} *p;

This would allow -fsanitize=bounds to check for out-of-bounds accesses. For example, given the above annotation, each of the marked access into p->flex should trap:

sum += p->flex[-1];  // trap all negative indexes
sum += p->flex[128]; // trap when index larger than bounds type
sum += p->flex[0];   // trap when p->items <= 0
sum += p->flex[5];   // trap when p->items <= 5
sum += p->flex[idx]; // trap when p->items <= idx || idx < 0

The type associated with the bounds check (signed char in the example above) should perhaps be required to be an unsigned type, but Linux has so many counters implemented as int that it becomes an additional refactoring burden to change these to unsigned, especially since sometimes they are sneakily being used with negative values in some other part of the code. Better to leave them as-is (though perhaps emit a warning), and just add a negativity check at access time. Switching the counter to unsigned then potentially becomes a small performance improvement.

Similar to -fsanitize=bounds above, __builtin_dynamic_object_size() will perform the expected calculations with the items member as the basis for the resulting size (and where values less than 0 are considered to be 0 to avoid pathological calculations):

p->items = 5;
assert(__builtin_dynamic_object_size(p, 1) ==
        sizeof(*p) + 5 * sizeof(*p->flex));
assert(__builtin_dynamic_object_size(p->flex, 1) ==
        5 * sizeof(*p->flex));
assert(__builtin_dynamic_object_size(&p->flex[0], 1) ==
        sizeof(*p->flex));
assert(__builtin_dynamic_object_size(&p->flex[2], 0) ==
        3 * sizeof(*p->flex));

p->items = -10;
assert(__builtin_dynamic_object_size(p, 0) == sizeof(*p));
assert(__builtin_dynamic_object_size(p, 1) == sizeof(*p));
assert(__builtin_dynamic_object_size(p->flex, 1) == 0);
assert(__builtin_dynamic_object_size(&p->flex[2], 1) == 0);

Additional attributes may be needed if structures explicitly use byte counts rather than element counts.

Scope considerations

Composite structures need to be able to define __counted_by__ across struct boundaries:

struct object {
        ...
        char items;
        ...
        struct inner {
                ...
                int flex[] __attribute__((__counted_by__(.items)));
        };
} *ptr;

This may mean passing &ptr->inner to a function will lose the bounds knowledge, but it may be possible to automatically include a bounds argument as an invisible function argument, as any function able to understand the layout of struct inner must by definition have visibility into the definition of struct object. For example, with this:

struct object instance;
...
func(&instance.inner);
...
void func(struct inner *ptr) {
        ...
        ptr->flex[foo]; /* "items" is not scope */
        ...
}

The prototype could either be rejected due to lack of available scope, or could be automatically converted into passing the outer object pointer with an injected scope:

void func(struct object *__ptr) {
        struct inner *ptr = &__ptr->inner;
        ...
        ptr->flex[foo]; /* __ptr->items is in scope */
        ...
}

Annotate kernel flexible array structs

With the compiler attribute available, all of Linux's flexible arrays can be updated to include the annotation, and CONFIG_FORTIFY_SOURCE can be expanded to use __builtin_dynamic_object_size().

Replace DECLARE_FLEX_ARRAY with DECLARE_BOUNDED_ARRAY

Most uses of DECLARE_FLEX_ARRAY() can be replaced with DECLARE_BOUNDED_ARRAY(), explicitly naming the expected flex array bounds member. For example, if we had:

struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};

it would become:

struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_BOUNDED_ARRAY(struct foo, name1, items);
                DECLARE_BOUNDED_ARRAY(struct bar, name2, items);
        };
};

Add manual annotations

Any flexible array structures not already using DECLARE_BOUNDED_ARRAY() can be annotated manually with the new attribute. For example, assuming the proposed __attribute__((__counted_by__(member))) is wrapped in a macro named __counted_by():

struct obj {
        ...
        int items;
        ...
        int flex[];
};

becomes:

struct obj {
        ...
        int items;
        ...
        int flex[] __counted_by(items);
};

Future work: expand attribute beyond arrays

It will also be possible to use the new attribute on pointers and function arguments as well as flexible arrays. All the same details are available, though there would be the obvious differences for enclosing structure sizes, as the pointers are aimed (usually) outside the struct itself. Regardless, having it be possible to check offsets and inform __builtin_dynamic_object_size() would allow for several more places where runtime checking could be possible. For example, given this:

struct object {
        ...
        unsigned char items;
        ...
        int *data __attribute__((__counted_by__(items)));
        ...
} *p;

It should be possible to detect sizing information:

p->items = 5;
assert(__builtin_dynamic_object_size(p->data, 1) ==
        5 * sizeof(*p->data));
assert(__builtin_dynamic_object_size(*p->data, 1) ==
        sizeof(*p->data));
assert(__builtin_dynamic_object_size(*p->data, 0) ==
        5 * sizeof(*p->data));

And it should be possible to trap on the following bad accesses:

int *ptr = p->data;
sum += ptr[-1];  // trap all negative indexes
sum += ptr[500]; // trap when index larger than bounds type
sum += ptr[0];   // trap when p->items <= 0
sum += ptr[5];   // trap when p->items <= 5
ptr += 5;        // don't trap yet: allow ptr++ in a for loop
sum += *ptr;     // trap when p->items <= 5

A safer code base

A C codebase that has refactored all of its arrays into proper flexible arrays can now finally build by using:

        -Warray-bounds
        -fstrict-flex-arrays
        -fsanitize=bounds
        -fsanitize-undefined-trap-on-error
        -D_FORTIFY_SOURCE=3

With this, the burdens of C array index bounds checking will have been shifted to the toolchain, and array index overflow flaw exploitation can be a thing of the past, reducing severity to a simple denial of service (assuming the traps aren't handled gracefully). For the next trick, new code can be written in a language that is memory safe to start with (e.g. Rust).

Acknowledgements

Thanks to many people who gave me feedback on this post: Nick Desaulniers, Gustavo A. R. Silva, Bill Wendling, Qing Zhao, Kara Olive, Chris Palmer, Steven Rostedt, Allen Webb, Julien Voisin, Guenter Roeck, Evan Benn, Seth Jenkins, Alexander Potapenko, Ricardo Ribalda, and Kevin Chowski.

Discussion

Please join this thread with your thoughts, comments, and corrections. :)

 
Read more...

from mcgrof

kdevops logo

After 3 years since the announcement of the first release of kdevops I'd like to announce the release of v6.2-rc1 of kdevops!

kdevops is designed to help with automation of Linux kernel development workflows. At first, is was not clear how and if kdevops could be used outside of filesystems testing easily. In fact my last post about it 3 years ago explained how one could only use kdevops in an odd way for other things, one had to fork it to use it for different workflows. That's old nonsense now. kdevops has grown to adopt kconfig and so in one single tree different workflows are now possible. Embracing other things such as using jinja2 for file templating with ansible and having to figure out a way to add PCI-E passthrough support through kconfig has made me realize that the growth component to the project is no longer a concern, it is actually a feature now. It is clear now that new technologies and very complex workflows can easily be added to kdevops.

But it is easy to say unless you have proof, and fortunately I have it. There are two new technologies that go well supported in kdevops that folks who are curious can start mucking around with, which otherwise may take a bit of time to ramp up with. The technologies are: Zoned storage and CXL. Supporting new technologies also means ensuring you get whatever tooling you might need to want to test or work with such technologies.

So for instance, getting a full Linux kernel development workflow going for CXL with the meson unit tests, even by enabling PCI-E passthrough, with the latest linux-next kernel is now reduced to just a few basic commands, in a Linux distribution / cloud provider agnostic manner:

make dynconfig
make
make bringup
make linux
make cxl
make cxl-test-meson

Just ask around a typical CXL Linux kernel developer how long it took them to get a CXL Linux kernel development & test environment up and running that they were happy with. And ask if it was reproducible. This is all now just reduced to 6 commands.

As for the details, it has been 8 months since the last release, and over that time the project has received 680 commits. I'd like to thank the developers who contributed:

Adam Manzanares
Amir Goldstein
Chandan Babu R
Jeff Layton
Joel Granados
Josef Bacik
Luis Chamberlain
Pankaj Raghav

I'd also like to thank my employer for trusting in this work, and allowing me to share a big iron server to help the community with Linux kernel stable work and general kernel technology enablement.

As for the exact details of changes merged, there so many! So I've tried to provide a nice terse summary on highlights on the git tag for v6.2-rc1. 8 months was certainly a long time to wait for a new release, so my hope is we'll try to bake a release now in tandem with the Linux kernel, in cadence with the same Linux kernel versioning and release timeline.

Based on feedback at LSFMM this year the project is now under the github linux-kdevops organization. This enables other developers to push into the tree. This let's us scale, specially as different workflows are supported.

If you see value in enabling rapid ramp up with Linux kernel development through kdevops for your subsystem / technology / feel free to join the party and either send a pull request to the group or just send patches.

 
Read more...

from Konstantin Ryabitsev

While b4 started out as a way for maintainers to retrieve patches from mailing lists, it also has contributor-oriented features. Starting with version 0.10 b4 can:

  • create and manage patch series and cover letters
  • track and auto-reroll series revisions
  • display range-diffs between revisions
  • apply trailers received from reviewers and maintainers
  • submit patches without needing a valid SMTP gateway

These features are still considered experimental, but they should be stable for most work and I'd be happy to receive further feedback from occasional contributors.

In this article, we'll go through the process of submitting an actual typo fix patch to the upstream kernel. This bug was identified a few years ago and submitted via bugzilla, but never fixed:

Accompanying video

This article has an accompanying video where I go through all the steps and submit the actual patch at the end:

Installing the latest b4 version

Start by installing b4. The easiest is to do it via pip, as this would grab the latest stable version:

$ pip install --user b4
[...]
$ b4 --version
0.11.1

If you get an error or an older version of b4, please check that your $PATH contains $HOME/.local/bin where pip installs the binaries.

Preparing the tree

  • b4 prep -n [name-of-branch] -f [nearest-tag]

Next, prepare a topical branch where you will be doing your work. We'll be fixing a typo in arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts, and we'll base this work on tag v6.1:

$ b4 prep -n lanyang-dts-typo -f v6.1
Created new branch b4/lanyang-dts-typo
Created the default cover letter, you can edit with --edit-cover.

This is just a regular branch prepended with “b4/”:

$ git branch
* b4/lanyang-dts-typo
  master

You can do all the normal operations with it, and the only special thing about it is that it has an “empty commit” at the start of the series containing the template of our cover letter.

Editing the cover letter

  • b4 prep --edit-cover

If you plan to submit a single patch, then the cover letter is not that necessary and will only be used to track the destination addresses and changelog entries. You can delete most of the template content and leave just the title and sign-off. The tracking information json will always be appended to the end automatically — you don't need to worry about it.

Here's what the commit looks like after I edited it:

$ git cat-file -p HEAD
tree c7c1b7db9ced3eba518cfc1f711e9d89f73f8667
parent 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
author Konstantin Ryabitsev <icon@mricon.com> 1671656701 -0500
committer Konstantin Ryabitsev <icon@mricon.com> 1671656701 -0500

Simple typo fix for the lanyang dts

Signed-off-by: Konstantin Ryabitsev <icon@mricon.com>

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
  "series": {
    "revision": 1,
    "change-id": "20221221-lanyang-dts-typo-8509e8ffccd4",
    "base-branch": "master",
    "prefixes": []
  }
}

Committing your work

You can add commits to this branch as you normally would with any other git work. I am going to fix two obvious typos in a single file and make a single commit:

$ git show HEAD
commit 820ce2d9bc7c88e1515642cf3fc4005a52e4c490 (HEAD -> b4/lanyang-dts-typo)
Author: Konstantin Ryabitsev <icon@mricon.com>
Date:   Wed Dec 21 16:17:21 2022 -0500

    arm: lanyang: fix lable->label typo for lanyang dts

    Fix an obvious spelling error in the dts file for Lanyang BMC.
    This was reported via bugzilla a few years ago but never fixed.

    Reported-by: Jens Schleusener <Jens.Schleusener@fossies.org>
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205891
    Signed-off-by: Konstantin Ryabitsev <icon@mricon.com>

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
index c0847636f20b..e72e8ef5bff2 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
@@ -52,12 +52,12 @@ hdd_fault {
                        gpios = <&gpio ASPEED_GPIO(B, 3) GPIO_ACTIVE_HIGH>;
                };
                bmc_err {
-                       lable = "BMC_fault";
+                       label = "BMC_fault";
                        gpios = <&gpio ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH>;
                };

                sys_err {
-                       lable = "Sys_fault";
+                       label = "Sys_fault";
                        gpios = <&gpio ASPEED_GPIO(H, 7) GPIO_ACTIVE_HIGH>;
                };
        };

Collecting To: and Cc: addresses

  • b4 prep --auto-to-cc

After you've committed your work, you will want to collect the addresses of people who should be the ones reviewing it. Running b4 prep --auto-to-cc will invoke scripts/get_maintainer.pl with the default recommended flags to find out who should go into the To: and Cc: headers:

$ b4 prep --auto-to-cc
Will collect To: addresses using get_maintainer.pl
Will collect Cc: addresses using get_maintainer.pl
Collecting To/Cc addresses
    + To: Rob Herring <...>
    + To: Krzysztof Kozlowski <...>
    + To: Joel Stanley <...>
    + To: Andrew Jeffery <...>
    + Cc: devicetree@vger.kernel.org
    + Cc: linux-arm-kernel@lists.infradead.org
    + Cc: linux-aspeed@lists.ozlabs.org
    + Cc: linux-kernel@vger.kernel.org
    + Cc: Jens Schleusener <...>
---
You can trim/expand this list with: b4 prep --edit-cover
Invoking git-filter-repo to update the cover letter.
New history written in 0.06 seconds...
Completely finished after 0.33 seconds.

These addresses will be added to the cover letter and you can edit them to add/remove destinations using the usual b4 prep --edit-cover command.

Creating your patatt keypair for web endpoint submission

(This needs to be done only once.)

  • patatt genkey

Note: if you already have a PGP key and it's set as user.signingKey, then you can skip this section entirely.

Before we submit the patch, let's set up the keypair to sign our contributions. This is not strictly necessary if you are going to be using your own SMTP server to submit the patches, but it's a required step if you will use the kernel.org patch submission endpoint (which is what b4 will use in the absence of any [sendemail] sections in your git config).

The process is very simple. Run patatt genkey and add the resulting [patatt] section to your ~/.gitconfig as instructed by the output.

NOTE: You will want to back up the contents of your ~/.local/share/patatt so you don't lose access to your private key.

Dry-run and checkpatch

  • b4 send -o /tmp/tosend
  • ./scripts/checkpatch.pl /tmp/tosend/*

Next, generate the patches and look at their contents to make sure that everything is looking sane. Good things to check are:

  • the From: address
  • the To: and Cc: addresses
  • general patch formatting
  • cover letter formatting (if more than 1 patch in the series)

If everything looks sane, one more recommended step is to run checkpatch.pl from the top of the kernel tree:

$ ./scripts/checkpatch.pl /tmp/tosend/*
total: 0 errors, 0 warnings, 14 lines checked

/tmp/tosend/0001-arm-lanyang-fix-lable-label-typo-for-lanyang-dts.eml has no obvious style problems and is ready for submission.

Register your key with the web submission endpoint

(This needs to be done only once, unless you change your keys.)

  • b4 send --web-auth-new
  • b4 send --web-auth-verify [challenge]

If you're not going to use your own SMTP server to send the patch, you should register your new keypair with the endpoint:

$ b4 send --web-auth-new
Will submit a new email authorization request to:
  Endpoint: https://lkml.kernel.org/_b4_submit
      Name: Konstantin Ryabitsev
  Identity: icon@mricon.com
  Selector: 20221221
    Pubkey: ed25519:24L8+ejW6PwbTbrJ/uT8HmSM8XkvGGtjTZ6NftSSI6I=
---
Press Enter to confirm or Ctrl-C to abort
Submitting new auth request to https://lkml.kernel.org/_b4_submit
---
Challenge generated and sent to icon@mricon.com
Once you receive it, run b4 send --web-auth-verify [challenge-string]

The challenge is a UUID4 string and this step is a simple verification that you are able to receive email at the address you want associated with this key. Once you receive the challenge, complete the process as described:

$ b4 send --web-auth-verify 897851db-9b84-4117-9d82-1d970f9df5f8
Signing challenge
Submitting verification to https://lkml.kernel.org/_b4_submit
---
Challenge successfully verified for icon@mricon.com
You may now use this endpoint for submitting patches.

OR, set up your [sendemail] section

You don't have to use the web endpoint — it exists primarily for people who are not able or not willing to set up their SMTP information with git. Setting up a SMTP gateway is not a straightforward process for many:

  • platforms using OAuth require setting up “application-specific passwords”
  • some companies only provide Exchange or browser-based access to email and don't offer any other way to send mail
  • some company SMTP gateways rewrite messages to add lengthy disclaimers or rewrite links to quarantine them

However, if you have access to a functional SMTP gateway, then you are encouraged to use it instead of submitting via the web endpoint, as this ensures that the development process remains distributed and not dependent on any central services. Just follow instructions in man git-send-email and add a valid [sendemail] section to your git config. If b4 finds it, it will use it instead of relying on the web endpoint.

[sendemail]
    smtpEncryption = tls
    smtpServer = smtp.gmail.com
    smtpServerPort = 465
    smtpEncryption = ssl
    smtpUser = yourname@gmail.com
    smtpPass = your-gmail-app-password

Reflect the email to yourself

  • b4 send --reflect

This is the last step to use before sending off your contribution. Note, that it will fill out the To: and Cc: headers of all messages with actual recipients, but it will NOT actually send mail to them, just to yourself. Mail servers don't actually pay any attention to those headers — the only thing that matters to them is what was specified in the RCPT TO outer envelope of the negotiation.

This step is particularly useful if you're going to send your patches via the web endpoint. Unless your email address is from one of the following domains, the From: header will be rewritten in order to not violate DMARC policies:

  • @kernel.org
  • @linuxfoundation.org
  • @linux.dev

If your email domain doesn't match the above, the From: header will be rewritten to be a kernel.org dummy address. Your actual From: will be added to the body of the message where git expects to find it, and the Reply-To: header will be set so anyone replying to your message will be sending it to the right place.

Send it off!

  • b4 send

If all your tests are looking good, then you are ready to send your work. Fire off “b4 send”, review the “Ready to:” section for one final check and either Ctrl-C to get out of it, or hit Enter to submit your work upstream.

Coming up next

In the next post, I will go over:

  • making changes to your patches using: git rebase -i
  • retrieving and applying follow-up trailers using: b4 trailers -u
  • comparing v2 and v1 to see what changes you made using: b4 prep --compare-to v1
  • adding changelog entries using: b4 prep --edit-cover

Documentation

All contributor-oriented features of b4 are documented on the following site:

 
Read more...

from joelfernandes

Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that: 1. Normal hrtimers will have both a soft and hard expiry which are equal to each other. 2. But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.

The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimer_run_queues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimer_update_next_event() –> __hrtimer_get_next_event() –> __hrtimer_next_event_base()–>hrtimer_get_expires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.

The commit containing the original code added for range timers is:

commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Mon Sep 1 15:47:08 2008 -0700

    hrtimer: turn hrtimers into range timers
   
    this patch turns hrtimers into range timers;
    they have 2 expire points
    1) the soft expire point
    2) the hard expire point
   
    the kernel will do it's regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future.

The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimer_interrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.

To see the issue, consider the case where we have 2 timers queued:

  1. The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.

  2. The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.

Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.

As shown below, the loop in hrtimer_interrupt() goes through all the active timers in the timerqueue, _softexpires is made to be the real expiry, and the old _expires now becomes _softexpires + slack.

       while((node = timerqueue_getnext(&base->active))) {
              struct hrtimer *timer;

              timer = container_of(node, struct hrtimer, node);

              /*
               * The immediate goal for using the softexpires is
               * minimizing wakeups, not running timers at the
               * earliest interrupt after their soft expiration.
               * This allows us to avoid using a Priority Search
               * Tree, which can answer a stabbing querry for
               * overlapping intervals and instead use the simple
               * BST we already have.
               * We don't add extra wakeups by delaying timers that
               * are right-of a not yet expired timer, because that
               * timer will have to trigger a wakeup anyway.
               */

              if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
                      ktime_t expires;

                      expires = ktime_sub(hrtimer_get_expires(timer),
                                          base->offset);
                      if (expires.tv64 < expires_next.tv64)
                              expires_next = expires;
                      break;
              }

              __run_hrtimer(timer, &basenow);
      }

However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because __hrtimer_next_event_base() calls hrtimer_get_expires() which correctly use the “hard expiry” times to do the programming.

As of v6.2, the __hrtimer_run_queues() function looks like this:

static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
				 unsigned long flags, unsigned int active_mask)
{
	struct hrtimer_clock_base *base;
	unsigned int active = cpu_base->active_bases & active_mask;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *node;
		ktime_t basenow;

		basenow = ktime_add(now, base->offset);

		while ((node = timerqueue_getnext(&base->active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing query for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don't add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow < hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(cpu_base, base, timer, &basenow, flags);
			if (active_mask == HRTIMER_ACTIVE_SOFT)
				hrtimer_sync_wait_running(cpu_base, flags);
		}
	}
}

The utilization of hrtimer_get_softexpires_tv64() might be perplexing, as it may raise the question of how this loop expires non-slack timers that possess only a hard expiry time. To clarify, it's important to note that what was once referred to as expiry is now considered soft expiry for non-slack timers. Consequently, the condition basenow < hrtimer_get_softexpires_tv64(timer) is capable of expiring both slack and non-slack timers effectively.

 
Read more...

from Jakub Kicinski

Kernel TLS implements the record encapsulation and cryptography of the TLS protocol. There are four areas where implementing (a portion of) TLS in the kernel helps:

  • enabling seamless acceleration (NIC or crypto accelerator offload)
  • enabling sendfile on encrypted connections
  • saving extra data copies (data can be encrypted as it is copied into the kernel)
  • enabling the use of TLS on kernel sockets (nbd, NFS etc.)

Kernel TLS handles only data records turning them into a cleartext data stream, all the control records (TLS handshake etc.) get sent to the application via a side channel for user space (OpenSSL or such) to process. The first implementation of kTLS was designed in the good old days of TLS 1.2. When TLS 1.3 came into the picture the interest in kTLS had slightly diminished and the implementation, although functional, was rather simple and did not retain all the benefits. This post covers developments in the Linux 5.20 implementation of TLS which claws back the performance lost moving to TLS 1.3. One of the features we lost in TLS 1.3 was the ability to decrypt data as it was copied into the user buffer during read. TLS 1.3 hides the true type of the record. Recall that kTLS wants to punt control records to a different path than data records. TLS 1.3 always populates the TLS header with application_data as the record type and the real record type is appended at the end, before record padding. This means that the data has to be decrypted for the true record type to be known.

Problem 1 – CoW on big GRO segments is inefficient

kTLS was made to dutifully decrypt the TLS 1.3 records first before copying the data to user space. Modern CPUs are relatively good at copying data, so the copy is not a huge problem in itself. What’s more problematic is how the kTLS code went about performing the copy. The data queued on TCP sockets is considered read-only by the kernel. The pages data sits in may have been zero-copy-sent and for example belong to a file. kTLS tried to decrypt “in place” because it didn’t know how to deal with separate input/output skbs. To decrypt “in place” it calls skb_cow_data(). As the name suggests this function makes a copy of the memory underlying an skb, to make it safe for writing. This function, however, is intended to be run on MTU-sized skbs (individual IP packets), not skbs from the TCP receive queue. The skbs from the receive queue can be much larger than a single TLS record (16kB). As a result TLS would CoW a 64kB skb 4 times to extract the 4 records inside it. Even worse if we consider that the last record will likely straddle skbs so we need to CoW two 64kB skbs to decrypt it “in place”. The diagram below visualizes the problem and the solution. SKB CoW The possible solutions are quite obvious – either create a custom version of skb_cow_data() or teach TLS to deal with different input and output skbs. I opted for the latter (due to further optimizations it enables). Now we use a fresh buffer for the decrypted data and there is no need to CoW the big skbs TCP produces. This fix alone results in ~25-45% performance improvement (depending on the exact CPU SKU and available memory bandwidth). A jump in performance from abysmal to comparable with the user space OpenSSL.

Problem 2 – direct decrypt

Removing pointless copies is all well and good, but as mentioned we also lost the ability to decrypt directly to the user space buffer. We still need to copy the data to user space after it has been decrypted (A in the diagram below, here showing just a single record not full skb). SKB direct decrypt We can’t regain the full efficiency of TLS 1.2 because we don’t know the record type upfront. In practice, however, most of the records are data/application records (records carrying the application data rather than TLS control traffic like handshake messages or keys), so we can optimize for that case. We can optimistically decrypt to the user buffer, hoping the record contains data, and then check if we were right. Since decrypt to a user space buffer does not destroy the original encrypted record if we turn out to be wrong we can decrypting again, this time to a kernel skb (which we can then direct to the control message queue). Obviously this sort of optimization would not be acceptable in the Internet wilderness, as attackers could force us to waste time decrypting all records twice. The real record type in TLS 1.3 is at the tail of the data. We must either trust that the application will not overwrite the record type after we place it in its buffer (B in the diagram below), or assume there will be no padding and use a kernel address as the destination of that chunk of data (C). Since record padding is also rare – I chose option (C). It improves the single stream performance by around 10%.

Problem 3 – latency

Applications tests have also showed that kTLS performs much worse than user space TLS in terms of the p99 RPC response latency. This is due to the fact that kTLS holds the socket lock for very long periods of time, preventing TCP from processing incoming packets. Inserting periodic TCP processing points into the kTLS code fixes the problem. The following graph shows the relationship between the TCP processing frequency (on the x axis in kB of consumed data, 0 = inf), throughput of a single TLS flow (“data”) and TCP socket state. TCP CWND SWND The TCP-perceived RTT of the connection grows the longer TLS hogs the socket lock without letting TCP process the ingress backlog. TCP responds by growing the congestion window. Delaying the TCP processing will prevent TCP from responding to network congestion effectively, therefore I decided to be conservative and use 128kB as the TCP processing threshold. Processing the incoming packets has the additional benefit of TLS being able to consume the data as it comes in from the NIC. Previously TLS had access to the data already processed by TCP when the read operation began. Any packets coming in from the NIC while TLS was decrypting would be backlogged at TCP input. On the way to user space TLS would release the socket lock, allowing the TCP backlog processing to kick in. TCP processing would schedule a TLS worker. TLS worker would tell the application there is more data.

 
Read more...

from linusw

We are discussing and working toward adding the language Rust as a second implementation language in the Linux kernel. A year ago Jake Edge made an excellent summary of the discussions so far on Rust for the Linux kernel and we (or rather Miguel and Wedson) have made further progress since then. For the record I think this is overall a good idea and worth a try. I wanted to add some background that was sketched in a mail thread for the kernel summit.

TL;DR: my claim is that Rust is attempting to raise the abstraction in the programming language and ultimately to join computer science and software engineering into one single discipline, an ambition that has been around since these disciplines were created.

Beginning with ALGOL

The first general high-level language was FORTRAN, which is still in use for some numerical analysis tasks around the world. Then came ALGOL, which attracted a wider audience.

The first “real” operating system (using virtual memory etc) for the Atlas Machine supervisor in 1962 was as far as I can tell implemented in Atlas autocode which was a dialect of ALGOL, which was the lingua franca at the time. Pure ALGOL could not be used because ALGOL 60 had no input/output primitives, so every real-world application of ALGOL, i.e. any application not solely relying on compiled-in constants, required custom I/O additions.

Algol specifications Copies of the first specifications of ALGOL 60, belonging at one time to Carl-Erik Fröberg at Lund University.

ALGOL inspired CPL that inspired BCPL that inspired the B programming language that inspired the C programming language, which we use for the Linux kernel.

Between 1958 and 1968 ALGOL was the nexus in a wide attempt to join computer languages with formal logic. In this timespan we saw the ALGOL 58, ALGOL 60 and ALGOL 68 revisions come out. The outcome was that it established computer science as a discipline and people could start building their academic careers on that topic. One notable outcome was the BNF form for describing syntax in languages. This time was in many ways formative for computer science: the first three volumes of Donald Knuths The Art of Computer Programming were published in close proximity to these events.

To realize that ALGOL was popular and widespread at the time that Unix was born, and that C was in no way universally accepted, it would suffice to read a piece of the original Bourne Shell source code tree for example:

setlist(arg,xp)
	REG ARGPTR	arg;
	INT		xp;
{
	WHILE arg
	DO REG STRING	s=mactrim(arg->argval);
	   setname(s, xp);
	   arg=arg->argnxt;
	   IF flags&execpr
	   THEN prs(s);
		IF arg THEN blank(); ELSE newline(); FI
	   FI
	OD
}

This doesn't look much like C as we know it, it looks much more like ALGOL 68. The ALGOL 68 definition added constructions such as IF/FI, DO/OD etc, which were not present in ALGOL 60. The reason is that Stephen Bourne was an influential contributor to ALGOL 68 and created a set of macros so that the C preprocessor would turn his custom dialect of ALGOL into C, for which I think someone on Reddit suggested to nominate bash for the obfuscated C contest.

This is just one of the instances where we can see that the C programming language was not universally loved. The Bourne Shell scripting language that we all love and use is also quite close to ALGOL 68, so the descendants of this language is used more than we may think.

Around 1970 Niklaus Wirth was working to improve ALGOL68 with what he called ALGOL W. Tired of the slowness of the language committee process he forked ALGOL and created the programming language Pascal which was a success in its own right. In his very interesting IEEE article named A Brief History of Software Engineering Professor Wirth gives his perspective on some of the events around that time: first he writes about the very influential NATO conference on software engineering 1968 in Garmisch, Germany which served to define software engineering as a distinct discipline. To counter the so-called software crisis – the problems presented by emerging large complex systems – the suggestion was to raise the abstraction in new languages.

To raise the abstraction means to use more mathematical, machine independent constructs in the language. First consider the difference between low-level and high-level languages: a simple operation such as x = x + 1 is not high level, and just a fancy assembly instruction; if we compile it we can readily observe the resulting code in some kind of ADD instruction in the resulting object code. However a[i] = x + 1 raises abstraction past the point of high-level languages. This is because indexing into an array requires knowledge of the target machine specifics: base addresses, memory layout, etc. This makes the instruction more high-level and thus raises the abstraction of the language. The assumption is that several further higher levels of abstraction exist. We will look into some of these languages in the following sections.

The Garmisch conference is famous in Unix circles because Douglas McIlroy was present and presented his idea of componentized software as a remedy against rising complexity, an idea that was later realized in the form of Unix's pipes and filters mechanism. D-Bus and similar component interoperation mechanisms are contemporary examples of such software componentry — another way to counter complexity and make software less fragile, but not the focus in this article.

Wirth makes one very specific and very important observation about the Garmisch conference:

Ultimately, analytic verification and correctness proofs were supposed to replace testing.

This means exactly what it says: with formally verified programming languages, all the features and constructs that are formally proven need not be tested for. Software engineering is known for advocating test-driven development (TDD) to this day, and the ambition was to make large chunks of TDD completely unnecessary. Software testing has its own chapter in the mentioned report from the Garmisch NATO conference where the authors A.I. Llewelyn and R.F. Wickens conclude:

There are, fundamentally, two different methods of determining whether a product meets its specification. One can analyse the product in great detail and from this determine if it is in accordance with its specification, or one can measure its performance experimentally and see if the results are in accord with the specification; the number and sophistication of the experiments can be varied to provide the degree of confidence required of the results.

The first part of this paragraph i.e. “analyze in great detail” is what Wirth calls analytic verification and is today called formal verification. The latter part of this paragraph is what we call test-driven development, TDD. Also: the former is a matter of computer science, while the latter is a matter of software engineering. So here is a fork in the road.

Wirth also claims the discussions in Garmisch had a distinct influence on Pascal. This can be easily spotted in Pascal strings, which was one of his principal improvements over ALGOL: Pascal strings are arrays of char, but unlike C char, a Pascal char is not the same as a byte; instead it is defined as belonging to an “ordered character set”, which can very well be ISO8859-1 or Unicode, less, more or equal to 255 characters in size. Strings stored in memory begin with an positive integer array length which defines how long the string is, but this is none of the programmer's business, this shall be altered by the language runtime and not by any custom code. Indexing out of bounds is therefore not possible and can be trivially prohibited during compilation and at runtime. This raises the abstraction of strings: they are set-entities, they have clear boundaries, they need special support code to handle the length field in memory. Further Pascal also has set types, such as:

var
    JanuaryDays : set of 1..31;

Perhaps Pascal's application to real-world problems didn't work out as expected, as it has since also defined PChar as a NULL-terminated pointer to a sequence of characters, akin to C strings. However it should be noted that Pascal pointers are persistently typed and cannot be converted: casting is not possible in Pascal. A Pascal pointer to an integer is always a pointer to an integer.

From Wirth's perspective, C “presented a great leap backward” and he claims “it revealed that the community at large had hardly grasped the true meaning of the term 'high-level language' which became an ill-understood buzzword”. He attributes the problem to Unix which he says “acted like a Trojan horse for C”. He further details the actual technical problems with C:

C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous.

His point about C lacking index checking is especially important: it can be brought into question if C is really a high-level language. It is not fully abstracting away the machine specifics of handling an array. Language theorists can occasionally refer to C as a “big macro assembler”, the only thing abstracted away is really the raw instruction set.

Wirth however also goes on to state the appealing aspects of the C programming language:

people at large, particularly in academia, found it intriguing and “better than assembly code” (...) its rules could easily be broken, exactly what many programmers cherished. It was possible to manage access to all of a computer’s idiosyncracies, to items that a high-level language would properly hide. C provided freedom, where high-level languages were considered as straight-jackets enforcing unwanted discipline. It was an invitation to use tricks which had been necessary to achieve efficiency in the early days of computers.

We can see why an efficiency-oriented operating system kernel such as Linux will tend toward C.

It's not like these tricks stopped after the early days of computing. Just the other day I wrote a patch for Linux with two similar code paths, which could be eliminated by cast:ing a (const void *) into a (void *) which I then quipped about in the commit message of the revised patch. The reason for violating formal rules in this case — is that of a choice between two evils, and chosing the lesser evil: in a choice between formal correctness and code reuse I chose code reuse. And C enables that kind of choice. The languages presented later in this article absolutely do not allow that kind of choice, and C casts are seen as nothing less than an abomination.

The language family including C and also Pascal is referred to as imperative programming languages. The defining character is that the programmer “thinks like a computer” or imagine themselves as the program counter to be exact. “First I do this, next I do this, then I do this” – a sequence of statements executed in order, keeping the computer state (such as registers, memory locations and stacks) in the back of your head.

The immediate appeal to operating system programmers should be evident: this closely models what an OS developer needs to keep in mind, such as registers, stacks, cache frames, MMU tables, state transitions in hardware and so on. It is possible to see the whole family of imperative languages as domain specific languages for the domain of writing operating systems, so it would be for operating system developers what OpenGL is for computer graphics software developers.

Lambda Calculus for Defining Languages

In 1966 one of the early adopters and contributors to ALGOL (alongside Peter Naur, Tony Hoare and Niklaus Wirth), Peter Landin, published two articles in the Journal of the ACM titled Correspondence between ALGOL 60 and Church's Lambda-notation part I and part II. In the first article he begins with a good portion of dry humour:

Anyone familiar with both Church's λ-calculi and ALGOL 60 will have noticed a superficial resemblance between the way variables tie up with the λ's in a nest of λ-expressions, and the way that identifiers tie up with the headings in a nest of procedures and blocks.

He is of course aware that no-one beside himself had been in the position to realize this: the overlap between people familiar with Alonzo Church's λ-calculus and with ALGOL 60 was surprisingly down to one person on the planet. What is surprising is that it was even one person.

Alonzo Church was a scholar of mathematical logic and computability, the supervisor of Alan Turing's doctoral thesis and active in the same field as Kurt Gödel (those men quoted each other in their respective articles). The lambda calculus ties into the type set theory created by Bertrand Russell and the logical-mathematical programme, another universe of history we will not discuss here.

What λ-calculus (Lambda-calculus) does for a programming language definition is analogous to what regular expressions does for a languages syntax, but for it's semantics. While regular expressions can express how to parse a body of text in a language with regular grammar, expressions in λ-calculus can go on from the abstract syntax tree and express what an addition is, what a subtraction is, or what a bitwise OR is. This exercise is seldomly done in e.g. compiler construction courses, but defining semantics is an inherent part of a programming language definition.

Perhaps the most remembered part of Landin's papers is his humorous term syntactic sugar which denotes things added to a language to make the life of the programmer easier, but which has no semantic content that cannot be expressed by the basic features of the language. The basic mathematical features of the language, on the other hand, are best expressed with λ-calculus.

A notable invention in Landin's first article about defining ALGOL in terms of λ-calculus are the keywords let and where chosen to correspond to λ-calculus' Applicable Expressions. These keywords do not exist in ALGOL: they are part of a language to talk about a language, or in more complicated terms: a meta-language. So here we see the first steps toward a new language derived from λ-calculus. Landin does not give this language a name in this article, but just refers to it as “AE”. The AE executes in a theoretical machine called SECD, which is another trick of the trade, like Alan Turings “turing machine”: rather close to a mathematicians statement “let's assume we have...” The complete framework for defining ALGOL in λ-calculus is called AE/SECD.

Functional Programming

Functional programming languages then, implements lambda calculus. The central idea after some years of experience with defining languages such as ALGOL in terms of lambda calculus, is to just make the language resemble lambda calculus expressions to begin with, and the verification of the semantics will be simple and obvious.

In 1966 Peter Landin followed up his articles using λ-calculus to describe ALGOL with his article The Next 700 Programming Languages. Here he invents functional programming in the form of an invented language called ISWIM (If You See What I Mean), as you can see again with a good dry humour. The language is λ-calculus with “syntactic sugar” on top, so a broad family of languages are possible to create using the framework as a basis. Landin's article was popular, and people did invent languages. Maybe not 700 of them. Yet.

In section 10 of his article, named Eliminating explicit sequencing, Landin starts speculating and talks about a game that can be played with ALGOL: by removing any goto statements and labels, the program get a less sequential nature, i.e. the program counter is just advancing to the next line or iterating a loop. He quips:

What other such features are there? This question is considered because, not surprisingly, it turns out that an emphasis on describing things in terms of other things leads to the same kind of requirements as an emphasis against explicit sequencing.

He then goes on to show how to transform an ALGOL program into a purely functional ISWIM program and concludes:

The special claim of ISWlM is that it grafts procedural notions onto a purely functional base without disturbing many of the desirable properties. (...) This paper can do no more than begin the task of explaining their practical significance.

This reads as a call to action: we need to create functional programming languages akin to ISWIM, and we need to get rid of the J operator (the program control flow operator). Landin never did that himself.

The Meta Language ML

A few years later, in 1974, computer scientist Robin Milner, inspired by ISWIM and as a response to Landin's challenge, created the language ML, short for Meta Language. This is one of the 700 next languages and clearly recognized Landin's ideas about a language for defining languages, a grammar for defining grammar: a meta language with a meta grammar.

He implemented the language on the DEC10 computer with the help of Malcolm Newey, Lockwood Morris, Mike Gordon and Chris Wadswort. The language was later ported to the VAX architectures.

The language was based on ISWIM and dropped the so-called J operator (program point operator). It is domain-specific, and intended for authoring a tool for theorem proving called LCF. Standard ML has been fully semantically specified and formally verified. This language became widely popular, both in academia and industry.

Removing the J operator made ML a declarative language, i.e. it does not specify the order of execution of statements, putting it in the same class of languages as Prolog or for that matter: Makefiles: there is no control flow in a Makefile, just a number of conditions that need to be evaluated to arrive at a complete target.

ML still has one imperative language feature: assignment. Around this time, some scholars thought both the J operator and assignment were unnecessary and went on to define purely functional languages such as Haskell. We will not consider them here, they are outside the scope of this article. ML and everything else we discuss can be labelled as impure: a pejorative term invented by people who like purely functional languages. These people dislike not only the sequencing nature of imperative languages but also the assignment (such as happens with the keyword let) and prefer to think about evaluating relationships between abstract entities.

ML can be grasped intuitively. For example this expression in ML evaluates to the integer 64:

let
    val m : int = 4
    val n : int = m*m
in
    m*n
end

Here we see some still prominent AE/SECD, ISIWM features such as the keyword let for binding variables, or rather, associate names with elements such as integers and functions (similar to := assignment in some languages). The we see an implementation section in. We can define functions in ML, like this to compute the square root of five times x:

val rootfivex : real -> real =
    fn x : real => Math.sqrt (5.0 * x)

Notice absence of constructs such as BEGIN/END or semicolons. ML, like Python and other languages use whitespace to find beginning and end of basic blocks. The notation real –> real clearly states that the function takes a real number as input and produces a real number as output. The name real reflects some kind of mathematical ambition. The language cannot handle the mathematical set of real numbers — the ML real is what other languages call a float.

ML has more syntactic sugar, so the following is equivalent using the keyword fun (fun-notation):

fun rootfivex (x:real):real = Math.sqrt (5.0 * x)

The syntax should be possible to grasp intuitively. Another feature of ML and other functional languages is that they easily operate on tuples i.e. an ordered sequence of variables, and tuples can also be returned from functions. For example you can calculate the distance between origin and two coordinates in a x/y-oriented plane like this:

fun dist (x:real, y:real):real = Math.sqrt (x*x + y*y)

This function can then be called elsewhere like this:

val coor (x:real, y:real)
val d = dist(coor)

The type real of d will be inferred from the fact that the dist() function returns a real.

ML gets much more complex than this. One of the upsides of the language that is universally admired is that ML programs, like most programs written in functional languages can be proven correct in the computational sense. This can be done within certain ramifications: for example input/output operations need to specify exactly which values are input or an undefined behaviour will occur.

CAML and OCaml

In 1987 Ascánder Suárez at the French Institute for Research in Computer Science and Automation (INRIA) reimplemented a compiler and runtime system for ML in LISP and called the result CAML for Categorical Abstract Machine Language, a pun on the fact that it ran on a virtual machine (Category Abstract Machine) and the heritage from ML proper. The abstract machine used was the LLM3 abstract LISP machine, which in turn ran on another computer. It was not fast.

CAML was reimplemented in C in 1990-91 by Xavier Leroy, creating Caml Light, which was faster, because it was not written in a virtual machine running a virtual machine. Caml Light was more like Java and used a bytecode interpreter for its virtual machine.

In 1995, Caml Special Light introduced a native compiler, so the bytecode produced from the Caml compiler could be compiled to object code and executed with no virtual machine overhead, using a native runtime environment. Didier Rémy, Jérôme Vouillon and Jacques Garrigue continued the development of Caml.

Objective Caml arrived in 1996 and added some object oriented features to Caml. In 2011 the extended Caml Special Light compiler, and language derivative (dialect) of ML was renamed OCaml. In essence the compiler and language has a symbiotic relationship. There is no second implementation of OCaml.

From the 1990s and forward, what is now the OCaml language and implementation has gained traction. It is a very popular functional programming language, or rather, popular as far as functional programming goes. It has optimized implementations for most architectures. The compiler itself is now written mostly in OCaml, but the runtime in C is still around, to hook into each operating system where the program will eventually run. The language and compiler has been used for a variety of applications. Every major Linux distribution carries packages with the OCaml compiler and libraries. There is even a GTK+ 3 OCaml library binding, so OCaml GUI programs can be created.

OCaml simplifies binding labels to numbers etc, here is bubblesort implemented in OCaml:

(* Bubblesort in OCaml, Linus Walleij 2022 *)
let sort v =
  let newv = Array.make (Array.length v) 0 in
  for i = 1 to (Array.length v) - 1 do
    if v.(i - 1) > v.(i) then begin
      newv.(i - 1) <- v.(i);
      newv.(i) <- v.(i - 1);
      (* Copy back so we are working on the same thing *)
      v.(i - 1) <- newv.(i - 1);
      v.(i) <- newv.(i);
    end else begin
      newv.(i - 1) <- v.(i - 1);
      newv.(i) <- v.(i);
    end
  done;
  newv

let rec ordered v =
  if Array.length v = 0 then true
  else if Array.length v = 1 then true
  (* ... or if the rest of the array is ordered *)
  else if v.(0) < v.(1) && ordered (Array.sub v 1 (Array.length v - 1)) then true
  else false;;

let plist v =
  print_string "V = ";
  for i = 0 to (Array.length v) - 1 do begin
    print_int v.(i);
    if i < (Array.length v - 1) then print_string ",";
    end
  done;
  print_endline "";;

let rec sortme v =
  if ordered v then v
  else sortme (sort v);;

let v = [| 14 ; 4 ; 55 ; 100 ; 11 ; 29 ; 76 ; 19 ; 6 ; 82 ; 99 ; 0 ; 57 ; 36 ; 61 ; 30 |];;
plist v;;
plist (sortme v);;

My experience with working with this example is that OCaml makes a “bit of resistance” to changing contents of things like arrays by indexing. It “dislikes” any imperative constructs and kind of nudges you in the direction of purely logical constructs such as the ordered function above. This is just my personal take.

OCaml is still a dialect of ML. The file ending used on all files is .ml as well. OCaml – like Pythons pip or Perls CPAN has its own package system and library called opam. The prime application is still the OCaml Ergo Library, a library for automatic theorem proving. If your first and foremost use of computers is theorem proving, ML and OCaml continue to deliver since 1974. The more recent and widely popular Coq theorem prover is also written in OCaml.

Rust then

Rust was initially developed in 2006 as a hobby project by Graydon Hoare who was at the time working at Mozilla. OCaml and ML is mentioned as the biggest influence on the language, apart from C/C++. A typical sign of this influence would be that the first compiler for Rust was written in OCaml. A notable contributor to this codebase, apart from Hoare, is Brendan Eich, one of the founders of the Mozilla project and the inventor of JavaScript. While Brendan did not contribute much code he was at the time CTO of Mozilla, and this shows that when Mozilla started supporting the project in 2009 Rust was certainly well anchored in the organization, and Eich's early contributions to the language should be noted. (It may be commonplace that people in the CTO position at middle sized companies make commits to complex code bases, but I am not aware in that case.)

Despite the OCaml codebase the first documentation of the language talks more about other functional or declarative languages such as NIL, Hermes, Erlang, Sather, Newsqueak, Limbo and Napier. These origins with extensive quotes from e.g. Joe Armstrong (the inventor of Erlang) have been toned down in contemporary Rust documentation. It is however very clear that Graydon has a deep interest in historical computer languages and is convinced that they have something to teach us, and the expressed ambition is to draw on these languages to pick the best parts. In his own words:

I've always been a language pluralist — picture my relationship towards languages like a kid enjoying a wide variety of building blocks, musical instruments or plastic dinosaurs — and I don't think evangelism or single-language puritanism is especially helpful.

What is unique about Rust is that it fuses “impure” functional programming with imperative programming, bringing several concepts from ML and OCaml over into the language.

Another characteristic is that Rust compiled to target machine code from day one, rather than using any kind of virtual machine as did Peter Landins ISWIM, or the ML and OCaml languages (and as does say Java, or Python). Graydon probably did this intuitively, but a post he made in 2019 underscores the point: that virtual machines, even as an intermediate step, is bad language engineering and just generally a bad idea.

In 2013 Graydon stepped down as main lead for Rust for personal reasons which he has detailed in a posting on Reddit.

Rust has had the same symbiotic relationship between language and a single compiler implementation as OCaml, but this is changing, as there is now a second, GCC-based implementation in the works.

Here is bubblesort implemented in Rust:

/* Bubblesort in Rust, Linus Walleij 2022 */
fn sort(array : &mut [i32]) {
   let mut x : i32;
   if array.len() == 1 {
      return;
   }
   for i in 1..array.len() {
      if array[i - 1] > array[i] {
      	 x = array[i - 1];
	 array[i - 1] = array[i];
	 array[i] = x;
      }
   }
}

fn is_ordered(array : &[i32]) -> bool {
   if array.len() <= 1 {
     return true;
   }
   for i in 1..array.len() {
     if array[i - 1] > array[i] {
       return false;
     }
   }
   return true;
}

fn parray(array : &[i32]) {
   let mut x : i32;
   print!("V = ");
   for i in 0..array.len() {
       x = array[i];
       print!("{x}");
       if i != (array.len() - 1) {
       	  print!(",");
       }
   }
   println!("");
}

fn main() {
   let mut array: [i32; 16] = [14, 4, 55, 100, 11, 29, 76, 19, 6, 82, 99, 0, 57, 36, 61, 30];
   parray(&array);
   while !is_ordered(&array) {
     sort(&mut array);
   }
   parray(&array);
}

Rust leaves itself to easier imperative programming than OCaml: the keyword mut becomes quite similar to C:s const correctness tagging in this example. Since is_ordered and parray isn't altering the contents of the array these functions do not need to be marked with mut. You see some familiar virtues from Pascal: arrays “know” their length, and we use a method to obtain it: array.len().

The stated ambition is improved memory safety, data-race safety (concurrency) and type safety. The article Safe Systems Programming in Rust certainly presents the ambition in a straight-forward manner. Graydon also underscores the focus on memory and concurrency safety in a 2016 blog post.

But make no mistake. The current underlying ambition is definitely nothing different from the ambition of the ALGOL committee between 1958 and 1968: to raise the abstraction of the language through the ambition to join computer programming with formal logic. This comes from the arrival of strong academic support for the language.

A typical indication of this ambition is the well-funded RustBelt project involving a large amount of academic researchers, all familiar with formal logic, and resulting in such artefacts as Ralf Jung's PhD thesis Understanding and Evolving the Rust Programming Language. Here, formal logic in Rust Belt and the Coq proof assistant is used and concludes (from the abstract):

Together, these proofs establish that, as long as the only unsafe code in a well-typed λRust program is confined to libraries that satisfy their verification conditions, the program is safe to execute.

What is meant by “safe to execute” is that no use-after-free, dangling pointers, stale references, NULL pointer exceptions etc can ever occur in safe Rust code, because it is proven by formal logic: QED. It does not stop you from e.g. dividing by zero however, that problem is out-of-scope for the exercise.

To me personally the most astonishing fact about Jung's thesis is that it manages to repeatedly cite and reference the computer scientist Tony Hoare without quoting the inventor of the Rust language, Graydon Hoare, a single time. In a way it confirms Graydon's own statement that Rust “contains nothing new” from a language point of view.

The C programming language cannot be subject to the same scrutiny as Rust, simply because of all the (ab)use it allows, and which was mentioned by Wirth in his historical perspective: if a type can be changed by a cast and array indexing is not even part of the language, there is nothing much to prove. What has been interesting for scholars to investigate is a well-defined subset of C, such as the eBPF subset, which also partly explains the strong interest in eBPF: like with Rust, the build environment and language runtime has been defined with much stricter constraints and thus can be subject to formal verification.

The ambition of Rust is, as I perceieve it, and whether the people driving it even knows it or not, to finish what the ALGOL committe as primus motor started in 1958, and what the Garmisch NATO conference concluded was necessary in 1968: to develop a language for systems programming that rely on formal logic proof, and to fulfil what ALGOL never could, what Pascal never could, and what the whole maybe-not-700 functional programming languages never could: a language that joins the disciplines of computer science and software Engineering into ONE discipline, where the scholars of each can solve problems together.

That is the ambition of Rust as an implementation language for operating systems, such as Linux: provide a language backed by current top-of-the-line computer science research, for immediate application to software engineering developing the top-of-the-line operating system.

What it offers Linux is raised abstraction to counter the problems of complexity identified in the 1968 Garmisch NATO conference and now bleeding obvious given the recurring security incidents, and thereby would bring the engineering project Linux closer to computer science.

Other approaches to increased Linux (memory- concurrency-) safety are possible: notably increased testing, which is the engineering go-to panacea. And automated testing of Linux has indeed increased a lot in recent years. Raising the abstraction of the implementation language and proving it formally comes with the ambition to make testing less important.

[Mathieu Poirer and Jesper Jansson has helped out in reviewing this blog post, for which I am forever grateful: remaining errors, bugs and biased opinions are my own.]

 
Read more...

from Jakub Kicinski

In light of ongoing work to improve the TCP Tx zero-copy efficiency [1] one begins to wonder what can be done on the Rx side. Tx zero-copy is generally easier to implement because it requires no extra HW support. It's primarily a SW exercise in keeping references to user data rather than owning a copy of it.

[1] https://lore.kernel.org/all/cover.1653992701.git.asml.silence@gmail.com/

There had been efforts to support direct Rx to user space buffers by performing header-data splitting in the HW and sending headers (for kernel consumption) to a different buffer than the data. There are two known implementations which depend on header-data splitting (HDS).

First one, which is upstream today [2] depends on data payloads being received in page-sized chunks and mapping (mmap'ing?) the data into the process's virtual address space. Even though modifying the virtual address map is not cheap this scheme works well for large transfers. Importantly applications which use zero-copy are often more DRAM bandwidth constrained than CPU bound. Meaning that even for smaller transfers they may prefer to burn extra cycles modifying page tables than copying data and using up the precious DRAM transfers.

[2] https://lwn.net/Articles/752188/

The second approach is to pre-register user memory and let the NIC DMA the data directly there [3]. The memory in this case can be main DRAM or accelerator memory, that's not really important for us here. The model is similar to that of AF_XDP UMEMs. It still depends on header-data split as we don't want the application to be able to modify the headers, as they flow thru the networking stack and crash the kernel. Additionally the device must provide sufficient flow steering to be able to direct traffic for an application to its buffers / queues – we don't want the data to end up in memory of the wrong application. Once the TCP stack is done processing the headers, it simply tells the application where the data is.

[3] https://lore.kernel.org/netdev/20200727224444.2987641-1-jonathan.lemon@gmail.com/

Apart from the ability to direct the data to memory other than DRAM (compute accelerators, disks) the second approach has the advantage of not requiring page table changes and neatly page-sized payloads. It is harder to implement in SW because the stack does not necessarily have access to the payload memory. Although netgpu/zctap patches have stalled the idea may have sufficient merit to eventually come to fruition.

Looking ahead

Both of the approaches described above deal with packet-sized chunks of data. Admittedly some HW supports data coalescing (GRO-HW/LRO) but it's limited in size, best effort, latency inducing and generally unproven at scale.

Given that we already modify the HW to support zero-copy Rx (HDS for the former case, HDS+steering for the latter) what modifications would help us get to the next level? The data coalescing can certainly be improved.

LRO depends on intelligence in the NIC and maintaining state about connections which always poses scaling challenges. At the same time modern applications most often know the parameters of the transfer upfront, so they can put processing hints in the packet.

The receiver needs to know (1) which memory pool / region to place the data into and (2) at what offset. We can assume the sender gets these parameters thru the RPC layer. Sender can insert the placement information into the packet it sends. Either wrapped in a UDP header in front of the TCP header, or as a TCP option.

One useful modification may be to express the offset in terms of the TCP sequence number. Instead of providing the absolute offset where the packet data has to land (i.e. “data of this packet has to land in memory X at offset Y”) provide a base TCP sequence number and offset. The destination address would then be computed as

address = mem_pool[packet.dma.mem_id].base + 
          packet.dma.offset +
          packet.tcp.seq_no - packet.dma.base_tcp_seq

This simplifies the sender's job as it no longer has to change the offset as it breaks up a TSO super-frame into MTU-sized segments.

Last thing – we must protect applications from rogue writes. This can be done by programming an allow list of flow to memory pool pairs into the NIC. Unfortunately the full list scales linearly with the number of flows. A better approach would be to depend on a security protocol like Google's PSP [4]. PSP hands out association IDs to the senders, and only after authentication. The PSP IDs are much smaller (4B) than a IPv6 flow key (36B) if we have to keep a full list. We can try to be clever about how we hand them out – for instance allocating the “direct write” IDs from high numbers and less trusted connections from low numbers. We can also modify PSP to maintain a key per memory pool rather than per device, or incorporate the memory region ID in the key derivation algorithm. I'm not sufficiently crypto-savvy to know if the latter would weaken the protection too much.

[4] https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf

To sum up in the scheme I'm dreaming up we add the following fields after the PSP header:

  • 16b – memory id – ID of the region;
  • 64b – base offset – absolute offset within the memory region;
  • 32b – TCP sequence number base for the transfer.

Since we have 16 extra bits to round up to full 128b header we can consider adding a generation number for the memory region. This could be useful if memory region configuration / page table update is asynchronous so that we can post the read request before NIC confirmed the configuration is complete. In most cases configuration should be done by the time the data arrives, if it's not we can fall back to non-zero-copy TCP.

I don't work for Netronome/Corigine any more but it's certainly something their HW can easily support with a FW update. Much like PSP itself. What an amazing piece of HW that is...

 
Read more...

from Vlastimil Babka

In this post I would like to raise the awareness a bit about an effort to reduce the limitations of anonymous VMA merging, in the form of an ongoing master thesis by Jakub Matena, which I'm supervising. I suspect there might be userspace projects that would benefit and maybe their authors are aware of the current limitations and would welcome if they were relaxed, but they don't read the linux-mm mailing list – the last version of the RFC posted there is here

In a high-level summary, merging of anonymous VMAs in Linux generally happens as soon as they become adjacent in the address space and have compatible access protection bits (and also mempolicies etc.). However due to internal implementation details (involving VMA and page offsets) some operations such as mremap() that moves VMAs around the address space can cause anonymous VMAs not to merge even if everything else is compatible. This is then visible as extra entries in /proc/pid/maps that could be in theory be one larger entry, the associated larger memory and CPU overhead of VMA operations, or even hitting the limit of VMAs per process, set by the vm.max_map_count sysctl. A related issue is that mremap() syscall itself cannot currently process multiple VMAs, so a process that needs to further mremap() the non-merged areas would need to somehow learn the extra boundaries first and perform a sequence of multiple mremap()'s to achieve its goal.

Does any of the above sound familiar because you found that out already while working on a Linux application? Then we would love your feedback on the RFC linked above (or privately). The issue is that while in many scenarios the merging limitations can be lifted by the RFC, it doesn't come for free in both of some overhead of e.g. mremap(), and especially the extra complexity of an already complex code. Thus identifying workloads that would benefit a lot would be helpful. Thanks!

 
Read more...