people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from paulmck

This is part of the Kernel Recipes 2025 blog series.

The other posts in this series help with small improvements over a long time. But what do you do if you only have a few weeks until your presentation? Yes, it is best to avoid procrastination, but sometimes you simply don't have all that much notice.

First, have a very clear picture of what you want the audience to gain from your presentation. A carefully chosen and tight focus will save you time that might otherwise have been wasted on irrelevant details.

Second, do dry-run presentations, preferably to people who won't be shy about giving you honest feedback. If your dry-run audience has shy people, you can ask them questions to see if they picked up on the key points of your presentation. If you cannot scare up a human audience on short notice, record your presentation (on your smartphone if nothing else) and review it. In the old pre-smartphone days, we would do our audience-free dry runs in front of a mirror, which can still be useful, for example, if your smartphone's battery is empty.

Third, repeat the important portions of your presentation, which usually includes the opening, the conclusion, and any surprise “reveals” in the middle of the presentation. If it is an important presentation (but aren't they all?), do about 20 repetitions of the important portions. If it is an extremely important presentation, dry-run the entire presentation about 20 times. Yes, this can take time, but on the other hand, most of my extremely important presentations were quite short, on the order of 3-5 minutes.

Fourth and finally, get a good night's sleep before the day of the presentation.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

I have been consciously working on speaking skills for more than half a century.  This section lists a few of the experiences along the way. My hope is that this motivates you to take the easier and faster approaches laid out in the rest of this blog series.

Comic Relief

A now-disgraced comedian who was immensely popular in the 1960s was said to have learned his craft at school.  They said that he discovered that if he could make the schoolyard bullies laugh, they would often forget about roughing him up.  I tried the same approach, though with just barely enough success to persist.  Part of my problem was that I spent most of my time focusing on academic skills, which certainly proved to be a wise choice longer term, but did limit the time available to improve my comedic capabilities.  I was also limited by my not-so-wise insistence on taking myself too seriously.  Choices, choices!

My classmates often told very funny jokes, and I firmly believed that making up jokes was a cognitive skill, and I just as firmly believed (and with some reason) that I was a cognitive standout.  If they could do it, so could I!!!

But for a very long time, my jokes were extremely weak compared to theirs.

Until one day, I told a joke that everyone laughed at.  Hard.  For a long time.  (And no, I do not remember that joke, but then again, it was a joke targeted towards seventh graders and you most likely are not in seventh grade.)

Once they recovered, one of them asked “What show did you see that on?”

Suddenly the awful truth dawned on me.  My classmates were not making up these jokes.  They were seeing them on television, and rushing to be the first to repeat them the next day.  Why was this not obvious to me?  Because my family did not have a television.

My surprise did not prevent me from replying “The Blank Wall”.  Which was the honest truth: I had in fact been staring at a blank wall the previous evening while composing my first successful joke.

The next day, my classmates asked me what channel “The Blank Wall” was on.  I of course gave evasive answers, but in a few minutes they figured out that I meant a literal blank wall.  They were not impressed with my attitude.  You saw jokes on television, after all, and no one in their right mind would even try to make one up!

I also did some grade-school acting, though my big role was Jonathan Brewster in a seventh-grade production of “Arsenic and Old Lace” rather than anything comedic.  The need to work prevented my acting in any high-school plays, though to be fair it is not clear that my acting abilities would have kept up with those of my classmates in any case.

Besides, those working in retail can attest that carefully deployed humor can be extremely useful.  So my high-school grocery-store job likely provided me with more and better experience than the high-school plays could possibly have done.  At least that is what I keep telling myself!

Speech Team

For reasons that were never quite clear to me, the high-school speech-team coach asked me to try out. I probably would have ignored her, but I well recalled my father telling me that those who have nothing to say, but can say it well, will often do better than those who have something to say but cannot say it. So, against my better 13-year-old judgment, I signed up.

I did quite well in extemporaneous speech during my first year due to my relatively deep understanding of the science behind the hot topic of that time, namely the energy crisis. During later years, the hot topics reverted to the usual political and evening-news fare, so the remaining three years were good practice, but did not result in wins. Until the end of my senior year, when the coach suggested that I try radio commentary, which had the great advantage of hiding my horribly geeky teenaged face from the judges. I did quite well, qualifying for district-level competition on the strength of my first-ever radio-commentary speech.

But I can only be thankful that my 17-year-old self decided to go to an engineering university as opposed to seeking employment at a local radio station.

University Coursework

I tested out of Freshman English Composition, but I did take a couple of courses on technical writing and technical presentation. A ca. 1980 mechanical-engineering presentation on ground-loop heat pumps featured my first use of cartoons in a technical presentation, courtesy of a teammate who knew a professional cartoonist. The four of us were quite proud of having kept the class’s attention during the full duration of our talk, which took place only a few days before the start of Christmas holidays.

1980s and 1990s Presentations

I did impromptu work-related presentations for my contract-programming work in the early 1980s. In the late 1980s, I joined a research institute where I was expected to do formal presentations, including at academic venues. I joined a startup in 1990, where I continued academic presentations, but focused mainly on internal training presentations.

Toastmasters

I became a founding member of a local Toastmasters club in 1993, and during the next seven years received CTM (“Competent Toastmaster) and ATM (“Advanced Toastmaster”) certifications. There is very likely a Toastmasters club near you, and you can search here: https://www.toastmasters.org/.

The purpose of Toastmasters is to help people develop public-speaking skills in a friendly environment. The members of the club help each other, evaluating each others’ short speeches and providing topics for even shorter impromptu speeches. The CTM and ATM certifications each have a manual that guides the member through a series of different types of speeches. For example, the 1990s CTM manual starts with a 4-6-minute speech in which the member introduces themselves. This has the benefit of ensuring that the speaker is expert on the topic, though I have come across an amnesiac who was an exception that proves this rule.

For me, the best of Toastmasters was “table topics”, in which someone is designated to bring a topic to the next meeting. The topic is called out, and people are expected to volunteer to give a short speech (a minute or two) on that topic. This is excellent preparation for those times when someone calls you out during a meeting.

Benchmarking

By the year 2000, I felt very good about my speaking ability. I was aware of some shortcomings, for example, I had difficulty with audiences larger than about 100 people, but was doing quite well, both in my own estimation and that of others. In short, it was time to benchmark myself against a professional speaker.

In that year, I attended an event whose keynote was given by none other than one of the least articulate of the US Presidents, George H. W. Bush. Now, Bush’s speaking abilities might have been unfairly compared to the larger-than-life capabilities of his predecessor (Ronald Reagan, AKA “The Great Communicator”) and his successor (Bill Clinton, whose command of people skills is the stuff of legends). In contrast, here is Ann Richards’s assessment of Bush’s skills: “born with a silver foot in his mouth”.

As noted above, I had just completed seven years in Toastmasters, so I was more than ready to do a Toastmasters-style evaluation of Bush’s keynote. I would record all the defects in this speech and email it to my Toastmasters group for their amusement.

Except that it didn’t turn out that way.

Bush gave a one-hour speech during which he did everything that I knew how to do, and did it effortlessly. Not only that, there were instances where he clearly expected a reaction from the audience, and got that reaction. I was watching him like a hawk the whole time and had absolutely no idea how he had made it happen.

Bush might well have been the most inarticulate of the US Presidents, but he was incomparably better than this software developer will ever be.

But that does not mean that I cannot continue to improve. In fact, I can now do a better job of presenting that Bush can. Not just due to my having spent the intervening decades practicing (practice makes perfect!), but mostly due to the fact that Bush has since passed away.

Linux Community

I joined the Linux community in 2001, where I faced large and diverse audiences. It quickly became obvious that I needed to apply my youthful Warner Brothers lessons, especially given that I was presenting things like RCU to audiences that were mostly innocent of any knowledge of or experience in concurrency.

This experience also gave me much-needed practice dealing with larger audiences, in a few cases, on the order of 1,000.

So I continue to improve, but there is much more for me to learn.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

This blog series has covered why public speaking is important, ways and means, building bridges from your audience to where they need to go, who owns your words, telling stories, knowing your destination, use of humor, and speaking on short notice.

But if you would rather learn about what I actually did rather than what I advise you to do, please see here.

I close this series by reiterating the value and ubiquity of Toastmasters and the usefulness of both dry runs and reviewing videos of your past talks.

Best of everything in your presentations!

Acknowledgments

And last, but definitely not least, a big “thank you” (in chronological order) to Anne Nicolas, Willy Tarreau, Steven Rostedt, Gregory Price, and Michael Opendacker for their careful review of early versions of this series.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

Humor is both difficult and dangerous, especially in a large and diverse group such as the audience for Kernel Recipes. My advice is to do many formal presentations before attempting much in the way of humor.

This section will nevertheless talk about use of humor in technical presentations.

One issue is that audience members have a wide range of languages and dialects, and a given joke in (say) American English might not go over well to (say) Welsh English speakers. And it might be completely mangled in translation to another language. For example, during a 1980s visit to China, George Bush Senior is said to have quipped “We are oriented to the Orient.” This translates to something like ”我们面向东方”, which translates back to something like “We face East”, completely destroying Bush’s oriented/Orient pun. So what did the poor translator say? “是笑话,笑吧”, which translates to something like “It is a joke, laugh.”

So if you tell jokes, keep translations to other cultures and languages firmly in mind. (To be fair, this is advice that I could do well to better heed myself!)

In addition, jokes make fun of some person or group or are based on what is considered to be abnormal, excessive, or unacceptable, all of which differ greatly across cultures. Besides which, given a large and diverse audience such as that of Kernel Recipes, there will almost certainly be someone in attendance who identifies with the person or group in question or who has strong feelings about the joke’s implications about abnormality, excessiveness, or unacceptability. That someone just might have a strong negative reaction. And this should be absolutely no surprise, given that humor is used with great effect as a weapon in social conflicts.

In my youth, there were outgroups that were frequently the butt of jokes. These were often groups that were not represented in my small community, but were just as often a single-person outgroup made up of some hapless fellow student. Then as now, the most cruel jokes all too often get the best laughs.

Yet humor can also make a speech much more enjoyable. So what is a speaker to do?

Outgroups are often used, with technical talks making jokes at the expense of managers, salespeople, marketing departments, lawyers, users, and occasionally even an especially incompetent techie. But these jokes always eventually find their way to the outgroup in question, sometimes with devastating consequences to the hapless speaker.

It is better to tell jokes where you yourself are the butt of the joke. This can be difficult at first: Let’s face it, most of us would prefer to be taken seriously. However, becoming comfortable with this is well worth the effort. For one thing, once you have demonstrated a willingness to make a joke at your own expense, the audience will usually be much more willing to accept their own shortcomings and need for improvement. Such an audience will usually also be more willing to learn, and the best technical talks are after all those that audiences learn from.

What jokes should you tell on yourself? I paraphrase advice from the late humorist Patrick McManus: The worst day of your life will make the audience laugh the hardest.

That said, you need to make sure that the audience can relate to the challenges you faced on that day. For example, my interactions with the legal profession would likely seem strange and irrelevant to a general audience. However, almost all members of a Kernel Recipes audience will have chased down a difficult bug, so a story about some idiotic mistake I made while chasing down an RCU bug will likely resonate. And this might be one way of entertaining a general audience while providing needed information to those wanting an RCU deep dive.

Or maybe you can figure out how to work some bathroom humor into your talk. Who is the butt of this joke? You decide! ;–)

Adding humor to your talk often does not come for free. Time spent telling jokes is not available for presenting on technology. This tradeoff can be tricky: Too much humor makes for a lightweight talk, and too little for a dry talk. Especially if you are just starting out, I strongly advise you to err in the direction of dryness. Instead, make your technical content be the source of your audience’s excitement.

Use of humor in technical talks is both difficult and dangerous, but careful use of humor can be a very powerful public-speaking tool.

Perhaps some day I, too, will master the use of humor.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

An earlier section expounded on the importance of building your bridge starting from where your target audience already is.  The previous section talked about using stories to build your bridge.

However, it is just as important to understand where your bridge’s destination lies.  This might seem blindingly obvious, but suppose that you were just invited to speak.  That is right, a conference is going to give you a precious 30 minutes in front of their audience.  But where do you want to take them?  Where do they need to go?

I cannot decide this for you.  Instead, you must decide, based on your experiences and those of the audience.

But I can list the destinations that I chose for the example talks from the previous section:

  1. “What Happens When 4096 Cores All Do synchronize_rcu_expedited()?”: Demonstrate extreme scalability is possible, some techniques for scaling, and exposition of portions of Linux-kernel RCU.
  2. “RCU's First-Ever CVE, and How I Lived to Tell the Tale”: Show that ease of use is important even for low-level synchronization primitives, “a year in the life of the RCU maintainer”.
  3. “Bare-Metal Multicore Performance in a General-Purpose Operating System (Adventures in Ubiquity)”: Describe how extreme stress testing proves to not be all that extreme, introduction to RCU callback offloading and NOHZFULL, exposition of portions of Linux-kernel RCU.
  4. “Cautionary Tales on Implementing the Software That People Want”: “My users don’t know what they want” is not a valid excuse and never has been, connection between validation and natural selection, hazards of refusing to fix irrelevant bugs (never mind that we all too often have no choice).

Of course, the fact that we choose a particular destination does not necessarily mean that the audience will arrive there!

 
Read more...

from linusw

As I was working my way toward a mergeable version of generic entry for ARM32, there was an especially nasty bug that I could not for my life iron out: when booting Debian for armhf I just kept running into a boot splat, while everything else worked fine. It would look something like this:

8<--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute
[eaffff76] *pgd=eae1141e(bad)
Internal error: Oops: 8000000d [#1] SMP ARM
CPU: 0 UID: 997 PID: 304 Comm: sd-resolve Not tainted 6.13.0-rc1+ #22
Hardware name: ARM-Versatile Express
PC is at 0xeaffff76
LR is at __invoke_syscall_ret+0x0/0x18
pc : [<eaffff76>]    lr : [<80100a68>]    psr: a0030013
sp : fbc11f68  ip : fbc11e78  fp : 76539420
r10: 10c5387d  r9 : 841f4ec0  r8 : 80100284
r7 : ffffffff  r6 : 7653941c  r5 : 76cb6000  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : 00080003  r0 : ffffff9f
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 8222006a  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: NULL pointer
Register r5 information: non-paged memory
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: non-slab/vmalloc memory
Register r9 information: slab task_struct start 841f4ec0 pointer offset 0 size 2240
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: 2-page vmalloc region starting at 0xfbc10000 allocated at copy_process+0x150/0xd88
Process sd-resolve (pid: 304, stack limit = 0xbab1c12b)
Stack: (0xfbc11f68 to 0xfbc12000)
1f60:                   00000000 76cb6000 fbc11fb0 ffffffff 80100284 80cdd330
1f80: 80100284 841f4ec0 10c5387d 80111eac 00000000 76cb6000 7653941c 00000119
1fa0: 76539420 80100280 00000000 76cb6000 ffffff9f 00080003 00000000 00000000
1fc0: 00000000 76cb6000 7653941c 00000119 76539c48 76539c44 76539b6c 76539420
1fe0: 76f3a450 765392c4 76c72a4d 76c60108 20030030 00000010 00000000 00000000
Call trace: 
Code: 00000000 00000000 00000000 00000000 (00000000) 
---[ end trace 0000000000000000 ]---

The paging request means that we are in kernel mode, and we have tried to page in a page that does not exist, such as reading from random uninitialized memory somewhere. If this was userspace, we would get a “segmentation fault”. So this is a pretty common error in C programs.

Notice the following: no call trace. This always happens when you least of all want it: how am I supposed to know how we got here?

But the die() splat has this helpful information: PID: 304 Comm: sd-resolve which reads: this was caused by the process initiated by the command sd-resolve executing as PID 304. But sd-resolve is just something systemd fires temporarily when bringing up some other service, so it must be part of a service. Luckily we also have dmesg:

       Starting systemd-timesyncd… - Network Time Synchronization...
8<--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute

Aha it's the NTP service. We can verify that this process is causing the mess by issuing:

systemctl restart systemd-timesyncd

And indeed, we get a second reproducable splat. OK great, let's use ftrace to tell us what happened.

The excellent article Secrets of the Ftrace function tracer tells us that it's as simple as echoing the PID of the process into set_ftrace_pid in the kernel debug/tracing sysfs filetree.

There is a problem though: this process is by its very nature transient. I don't know the PID! After some googling it turns out you can ask systemd what PID a certain service is running as:

systemctl show --property MainPID --value systemd-timesyncd

So ... we can echo this into set_ftrace_pid, and then start the trace, but the service is transient, how can I restart the service, obtain the PID, start tracing and wait for the service to finish restarting and then end the trace? No that's a tall order.

After some fooling around with trying to restart the service in one window, then quickly switching to another window and start the trace while the restart is happening, I realized what everyone should already know: never put a person to do a machine's job.

I had to write a script that would restart the service, start the trace, wait for the restart to finish and then stop the trace. It ended up looking like this:

#!/bin/bash
TRACEDIR=/sys/kernel/debug/tracing
SERVICE=systemd-timesyncd
TRACEFILE=/root/trace.dat

echo 0 > ${TRACEDIR}/tracing_on
echo "function" > ${TRACEDIR}/current_tracer
(systemctl restart ${SERVICE})& PID=$!
echo 1 > ${TRACEDIR}/tracing_on
echo "Wait for restart to commence"
wait "${PID}"
echo 0 > ${TRACEDIR}/tracing_on
echo "Restarted "
trace-cmd extract -o ${TRACEFILE}
scp ${TRACEFILE} linus@169.254.1.2:/tmp/trace.dat

This does what we want: turn off the tracing, activate the function tracer, restart the systemd-timesyncd service and capture it's PID, start tracing, wait for the restart to commence, then extract the trace and copy it to the development system. No need to figure out the PID of the forked sd-resolve, just capture everything: this window will be small enough that we can capture all the relevant trace information.

After this I brought up kernelshark to inspect the resulting tracefile trace.dat (right-click open in new tab/window to see the details):

Kernelshark looking at logs

We search for the die() invocation (here I had a pr_info() added with the word “CRASH” as well). Sure enough there it is and the task is indeed sd-resolve. But what happens before? We need to know why we crashed here.

For that we need to re-run the trace but now with the function_graph tracer so we can see the program flow, the indentation helps us to follow the messiness:

Kernelshark looking at logs for function graph

So we switch to the function_graph view and start off from die(), then we just move upward: first we find a prefetch abort, and that is what we already knew: we are getting a page fault, in kernel mode, for a page that does not have a backing storage.

So browse backwards, and:

Kernelshark looking at logs for function graph

Aha. We are invoking some syscall, and that doesn't really work. So what kind of odd syscall can we invoke that just crash on us like that? We instrument invoke_syscall() to print that out for us, but not for all invocations but just for the task we are interested in, and we know that is sd-resolve:

if (!strcmp(current->comm, "sd-resolve"))
   pr_info("%s invoke syscall %d\n", current->comm, scno);

We take advantage of the fact that the global variable current in the kernel always points to the currently active task. And the field .comm should contain the command name that it was invoked with, such as “sd-resolve”.

Then we run this and oh:

[  OK  ] Started systemd-timesyncd.…0m - Network Time Synchronization.
sd-resolve invoke syscall 291
sd-resolve invoke syscall -1
8<--- cut here ---
Unable to handle kernel paging request at virtual address e7f001f2 when execute
[e7f001f2] *pgd=e7e1141e(bad)

So system call -1 was invoked. This may seem weird to you, but it is actually a legal value: when tracing system calls (such as with strace) the kernel will filter system calls, and the filter will return -1 to indicate that the system call should not even be taken, “skipped”, and my new generic entry code was not taking this properly into account, and the low-level assembly tried to vector -1 into a table and it failed miserably, vectoring us out in the unknown.

At this point I could quickly patch up the code and call it a day.

I have no idea why sd-resolve turns on system call tracing by default, because it obviously does. It could be related to some seccomp security features that are being called in BPF programs prior to every system call in the same code path? I think those need to intercept the system calls anyway. Not particularly efficient, but I suppose quite secure.

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Another busy year has passed so let us punctuate the never ending stream of development with a retrospective of our accomplishments over the last 12 months. The previous, 2023 retrospective has covered changes from Linux v6.3 to v6.8, for 2024 we will cover Linux v6.9 to v6.13, one fewer, as Linux releases don’t align with calendar years. We will focus on the work happening directly on the netdev mailing list, having neither space nor expertise to do justice to developments within sub-subsystems like WiFi, Bluetooth, BPF etc.

Core

After months of work and many patch revisions we have finally merged support for Device Memory TCP, which allows TCP payloads to be placed directly in accelerator (GPU, TPU, etc.) or user space memory while still using the kernel stack for all protocol (header) processing (v6.12). The immediate motivation for this work is obviously the GenAI boom, but some of the components built to enable Device Memory TCP, for example queue control API (v6.10), should be more broadly applicable.

The second notable area of development was busy polling. Additions to the epoll API allow enabling and configuring network busy polling on a per-epoll-instance basis, making the feature far easier to deploy in a single application (v6.9). Even more significant was the addition of a NAPI suspension mechanism which allows for efficient and automatic switching between busy polling and IRQ-driven operation, as most real life applications are not constantly running under highest load (v6.12). Once again the work was preceded by paying off technical debt, it is now possible to configure individual NAPI instances rather than an entire network interface (v6.13).

Work on relieving the rtnl_lock pressure has continued throughout the year. The rtnl_lock is often mentioned as one of the biggest global locks in the kernel, as it protects all of the network configuration and state. The efforts can be divided into two broad categories – converting read operations to rely on RCU protection or other fine grained locking (v6.9, v6.10), and splitting the lock into per-network namespace locks (preparations for which started in v6.13).

Following discussions during last year’s LPC, the Real Time developers have contributed changes which make network processing more RT-friendly by allowing all packet processing to be executed in dedicated threads, instead of the softirq thread (v6.10). They also replaced implicit Bottom Half protection (the fact that code in BH context can’t be preempted, or migrated between CPUs) with explicit local locks (v6.11).

The routing stack has seen a number of small additions for ECMP forwarding, which underpins all modern datacenter network fabrics. ECMP routing can now maintain per-path statistics to allow detecting unbalanced use of paths (v6.9), and to reseed the hashing key to remediate the poor traffic distribution (v6.11). The weights used in ECMP’s consistent hashing have been widened from 8 bits to 16 bits (v6.12).

The ability to schedule sending packets at a particular time in the future has been extended to survive network namespace traversal (v6.9), and now supports using the TAI clock as a reference (v6.11). We also gained the ability to explicitly supply the timestamp ID via a cmsg during a sendmsg call (v6.13).

The number of “drop reasons”, helping to easily identify and trace packet loss in the stack is steadily increasing. Reason codes are now also provided when TCP RST packets are generated (v6.10).

Protocols

The protocol development wasn’t particularly active in 2024. As we close off the year 3 large protocol patch sets are being actively reviewed, but let us not steal 2025’s thunder, and limit ourselves to changes present in Linus’s tree by the end of 2024.

AF_UNIX socket family has a new garbage collection algorithm (v6.10). Since AF_UNIX supports file descriptor passing, sockets can hold references to each other, forming reference cycles etc. The old home grown algorithm which was a constant source of bugs has been replaced by one with more theoretical backing (Tarjan’s algorithm).

TCP SYN cookie generation and validation can now be performed from the TC subsystem hooks, enabling scaling out SYN flood handling across multiple machines (v6.9). User space can peek into data queued to a TCP socket at a specified offset (v6.10). It is also now possible to set min_rto for all new sockets using a sysctl, a patch which was reportedly maintained downstream by multiple hyperscalers for years (v6.11).

UDP segmentation now works even if the underlying device doesn’t support checksum offload, e.g. TUN/TAP (v6.11). A new hash table was added for connected UDP sockets (4-tuple based), significantly speeding-up connected socket lookup (v6.13).

MPTCP gained TCP_NOTSENT_LOWAT support (v6.9), and automatic tracking of destinations which blackhole MPTCP traffic (6.12).

IPsec stack now adheres to RFC 4301 when it comes to forwarding ICMP Error messages (v6.9).

Bonding driver supports independent control state machine in addition to the traditional coupled one, per IEEE 802.1AX-2008 5.4.15 (v6.9).

The GTP protocol gained IPv6 support (v6.10).

The High-availability Seamless Redundancy (HSR) protocol implementation gained the ability to work as a proxy node connecting non-HSR capable node to an HSR network (RedBOX mode) (v6.11).

The netconsole driver can attach arbitrary metadata to the log messages (v6.9).

The work on making Netlink easier to interface with in modern languages continued. The Netlink protocol descriptions in YAML can now express Netlink “polymorphism” (v6.9), i.e. a situation where parsing of one attribute depends on the value of another attribute (e.g. link type determines how link attributes are parsed). 7 new specs have been added, as well as a lot of small spec and code generation improvements. Sadly we still only have bindings/codegen for C, C++ and Python.

Device APIs

The biggest addition to the device-facing APIs in 2024 was the HW traffic shaping interface (v6.13). Over the years we have accumulated a plethora of single-vendor, single-use case rate control APIs. The new API promises to express most use cases, ultimately unifying the configuration from the user perspective. The immediate use for the new API is rate limiting traffic from a group of Tx queues. Somewhat related to this work was the revamp of the RSS context API which allows directing Rx traffic to a group of queues (v6.11, v6.12, v6.13). Together the HW rate limiting and RSS context APIs will hopefully allow container networking to leverage HW capabilities, without the need for complex full offloads.

A new API for reporting device statistics has been created (qstat) within the netdev netlink family (v6.9). It allows reporting more detailed driver-level stats than old interfaces, and breaking down the stats by Rx/Tx queue.

Packet processing in presence of TC classifier offloads has been sped up, the software processing is now fully skipped if all rules are installed in HW-only mode (v6.10).

Ethtool gained support for flashing firmware to SFP modules, and configuring thresholds used by automatic IRQ moderation (v6.11). The most significant change to ethtool APIs in 2024 was, however, the ability to interact with multiple PHYs for a single network interface (v6.12).

Work continues on adding configuration interfaces for supplying power over network wiring. Ethtool APIs have been extended with Power over Ethernet (PoE) support (v6.10). The APIs have been extended to allow reporting more information about the devices and failure reasons, as well as setting power limits (v6.11).

Configuration of Energy Efficient Ethernet is being reworked because the old API did not have enough bits to cover new link modes (2.5GE, 5GE), but we also used this as an opportunity to share more code between drivers (especially those using phylib), and encourage more uniform behavior (v6.9).

Testing

2024 was the year of improving our testing. We spent the previous winter break building out an automated testing system, and have been running the full suite of networking selftests on all code merged since January. The pre-merge tests are catching roughly one bug a day.

We added a handful of simple libraries and infrastructure for writing tests in Python, crucially allowing easy use of Netlink YAML bindings, and supporting tests for NIC drivers (v6.10).

Later in the year we added native integration of packetdrill tests into kselftest, and started importing batches of tests from the packetdrill library (v6.12).

Community and process

The maintainers, developers and community members have met at two conferences, the netdev track at Linux Plumbers and netconf in Vienna, and the netdev.conf 0x18 conference in Santa Clara.

We have removed the historic requirement for special formatting of multi-line comments in netdev (although it is still the preferred style), documented our guidance on the use of automatic resource cleanup, as well as sending cleanup patches (such as “fixing” checkpatch warnings in existing code).

In April, we announced the redefinition of the “Supported” status for NIC drivers, to try to nudge vendors towards more collaboration and better testing. Whether this change has the desired effect remains to be seen.

Last but not least Andrew Lunn and Simon Horman have joined the netdev maintainer group.

6.9: https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org 6.10: https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org 6.11: https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org 6.12: https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org 6.13: https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com

 
Read more...

from kees

Or, how to break all the tools that parse the “Fixes:” tag

Kees Cook

There was a recent discussion about how Linux's “Fixes” tag, which traditionally uses the 12 character commit SHA prefix, has an ever increasing chance of collisions. There are already 11-character collisions, and Geert wanted to raise the minimum short id to 16 characters. This was met with push-back for various reasons. One aspect that bothered me was some people still treating this like a theoretical “maybe in the future” problem. To clear up that problem, I generated a 12-character prefix collision against the start of Git history, commit 1da177e4c3f4 (“Linux-2.6.12-rc2”), which shows up in “Fixes” tags very often:

$ git log --no-merges --oneline --grep 'Fixes: 1da177e4c3f4' | wc -l
590

Tools like linux-next's “Fixes tag checker”, the Linux CNA's commit parser, and my own CVE lifetime analysis scripts do programmatic analysis of the “Fixes” tag and had no support for collisions (even shorter existing collisions).

So, in an effort to fix these tools, I broke them with commit 1da177e4c3f4 (“docs: git SHA prefixes are for humans”):

$ git show 1da177e4c3f4
error: short object ID 1da177e4c3f4 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

This is not yet in the upstream Linux tree, for fear of breaking countless other tools out in the wild. But it can serve as a test commit for those that want to get this fixed ahead of any future collisions (or this commit actually landing).

Lots of thanks to the lucky-commit project, which will grind trailing commit message whitespace in an attempt to find collisions. Doing the 12-character prefix collision took about 6 hours on my OpenCL-enabled RTX 3080 GPU.

For any questions, comments, etc, see this thread.

 
Read more...

from linusw

In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:

  • PAN for LPAE CONFIG_CPU_TTBR0_PAN
  • KCFI on ARM32 CONFIG_CFI_CLANG

As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!

A good rundown of these and other historically interesting security features can be found in Russell Currey's abridged history of kernel hardening which sums up what has been done up to now in a very approachable form.

PAN for LPAE

PAN is an abbreviation for the somewhat grammatically incorrect Privileged Access Never.

The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions get_from_user() and put_to_user(). Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.

In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.

The ARM32 hardware has for a few years had a config option named CONFIG_SW_DOMAIN_PAN which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.

For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.

As of the patch originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate translation table base registers (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).

By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace – unless explicitly enabled in get|put_[from|to]_user() – we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.

KCFI on ARM32

The Kernel Control Flow Integrity is a “forward edge control flow checker”, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.

KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag LLVM=1 to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.

Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option -fsanitize=kcfi else the option will be disabled.

The patch set is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special SYM_TYPED_FUNC_START() and SYM_FUNC_END() macros, inserting KCFI hashes also before functions written in assembly.

The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.

 
Read more...

from Gustavo A. R. Silva

The counted_by attribute

The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.

Here is an example of a flexible array annotated with this attribute:

struct bounded_flex_struct {
    ...
    size_t count;
    struct foo flex_array[] __attribute__((__counted_by__(count)));
};

In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.

In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).

The __counted_by() macro

In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.

#if __has_attribute(__counted_by__)
# define __counted_by(member)  __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif
  • c8248faf3ca27 (“Compiler Attributes: counted_by: Adjust name...“)

And with this we have been annotating flexible-array members across the whole kernel tree over the last year.

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };
  • ceba9725fb45 (“cxgb4: Annotate struct sched_table with ...“)

However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.

__counted_by() annotations in the kernel

There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) > packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event->datalen = datalen;
 	event->code = code;
 	event->ifidx = event_packet->msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&event->emsg, &event_packet->msg, sizeof(event->emsg));
 	memcpy(event->data, data, datalen);
-	event->datalen = datalen;
 	memcpy(event->ifaddr, event_packet->eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);
  • 62d19b358088 (“wifi: brcmfmac: fweh: Add __counted_by...“)

In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.

In the kernel, we've been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:

  • 6dc445c19050 (“clk: bcm: rpi: Assign –>num before accessing...“)

  • 9368cdf90f52 (“clk: bcm: dvp: Assign –>num before accessing...“)

Another common issue is when the counter is updated inside a loop. See the patch below.

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request->wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif->scan_request = request;
 	mod_timer(&vif->scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil->discovery_mode && cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil->discovery_mode && cmd->scan_type == WMI_ACTIVE_SCAN) {
+		cmd->discovery_mode = 1;
 		wil_dbg_misc(wil, "active scan with discovery_mode=1\n");
 	}
 
 	if (vif->mid == 0)
 		wil->radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif->mid,
-		      &cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd->num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)
  • 34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by...“)

The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that's material for a future post. For now, let's focus on the following excerpt.

-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;

Notice that in this case, num_channels is our counter, and it's set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.

As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.

Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id < 0xf0 &&
-	     pdu.cp.num_cis < ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis < pdu->num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &conn->iso_qos;
 
-		cis = &pdu.cis[pdu.cp.num_cis++];
+		cis = &pdu->cis[aux_num_cis++];
 		cis->cis_id = cis_id;
 		cis->c_sdu  = cpu_to_le16(conn->iso_qos.ucast.out.sdu);
 		cis->p_sdu  = cpu_to_le16(conn->iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis->c_rtn  = qos->ucast.out.rtn;
 		cis->p_rtn  = qos->ucast.in.rtn;
 	}
+	pdu->num_cis = aux_num_cis;
 
 ...
  • ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and...“)

Again, the entire patch does more than merely annotate the flexible-array member, but let's just focus on how aux_num_cis is used to access flexible array pdu->cis[].

In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.

Both solutions are acceptable, so use whichever is convenient for you. :)

Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:

  • [PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*

In a future post, I'll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don't do it!

Latest version: How to use the new counted_by attribute in C (and Linux)

 
Read more...

from Konstantin Ryabitsev

Message-ID's are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it's only natural to want to use memorable ones. Or maybe it's just me.

Regardless, here's what I do with neomutt and coolname:

  1. If coolname isn't yet packaged for your distro, you can install it with pip:

    pip install --user coolname
    
  2. Create this file as ~/bin/my-msgid.py:

    #!/usr/bin/python3
    import sys
    import random
    import string
    import datetime
    import platform
    
    from coolname import generate_slug
    
    parts = []
    parts.append(datetime.datetime.now().strftime('%Y%m%d'))
    parts.append(generate_slug(3))
    parts.append(''.join(random.choices(string.hexdigits, k=6)).lower())
    
    sys.stdout.write('-'.join(parts) + '@' + platform.node().split('.')[0])
    
  3. Create this file as ~/.mutt-fix-msgid:

    my_hdr Message-ID: <`/path/to/my/bin/my-msgid.py`>
    
  4. Add this to your .muttrc (works with mutt and neomutt):

    send-hook . "source ~/.mutt-fix-msgid"
    
  5. Enjoy funky message-id's like 20240227-flawless-capybara-of-drama-e09653@lemur. :)

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Intro

The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 – v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.

Core and protocol stack

Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM_PIDFD and SO_PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.

v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems – the wireless stack and devlink.

Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.

Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page_pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE_SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page_pool statistics over Netlink. Looking forward – there’s ongoing work to allow page_pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page_pool world – a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.

Number of key networking data structures (skb, netdevice, page_pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.

In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).

Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.

Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.

Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.

The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.

Among changes which may be more immediately visible to users – starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.

The AF_XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF_XDP sockets (checksumming, segmentation) as well as timestamping.

Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far).

Device APIs

Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)

In the “ethtool” family – v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames).

After many attempts we have finally gained solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.

In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.

Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.

Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).

Removed

Inspired by the returning discussion about code removal at the Maintainer Summit let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act_ipt, act_rsvp, act_tcindex, sch_atm, sch_cbq, sch_dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.

Community & process

The maintainers, developers and community members had a chance to meet at the BPF/netdev track at Linux Plumbers in Richmond, and the netdev.conf 0x17 conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small netconf gathering – thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris!

We have made minor improvements to the mailing list development process by allowing a wider set of folks to update patch status using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.

The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.

A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.

Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).

syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.

6.3: https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/ 6.4: https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/ 6.5: https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/ 6.6: https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/ 6.7: https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/ 6.8: https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/

 
Read more...

from arnd

Most compilers have an option to warn about a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the sparse checker introduced -Wdecl back in 2005. Ensuring that each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as static where possible to improve inter-function optimizations.

The warnings are not enabled in a default build, but are part of both make W=1 and make C=1 build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the W=1 warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.

After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of “make randconfig” builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total.

I uploaded the patches to https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from W=1 to the default set.

The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.

Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the feature request being open since 2017.

 
Read more...

from Jakub Kicinski

The LWN's development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.

Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.

Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.

Local stats

With that in mind it's only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.

Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list's history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel community not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.

netdev stats

During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven't received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.

We've also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can't take all the full credit :)

Random example

I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.

Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################

As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven't said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I'll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I'm hoping we can learn something by tracking its changes over time.

Fin

To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer's time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN's code is also available publicly (search for gitdm, IIRC).

 
Read more...

from linusw

As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the init task and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:

  • How tasks are actually scheduled on ARM32
  • How the kernelspace and userspace are actually separated, and thus how we move from one to the other

This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.

Terminology

With tasks we mean processes, threads and kernel threads. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.

Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called kthreadd — the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.

Processes in userspace are in essence executing computer programs, or objects with an older terminology, giving the origin of expressions such as object file format. The kernel will start very few such processes, but modprobe and init (which always has process ID 1) are notable exceptions. Any other userspace processes are started by init. Processes can fork new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. POSIX threads is usually the way this happens and further abstractions such as the GLib GThread etc exist.

Task pie chart A pie chart of tasks according to priority on a certain system produced using CGFreak shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.

The userspace is the commonplace name given to a specific context of execution where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but not from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls “syscalls” or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)

The kernelspace conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.

Background

This separation, along with everything else we take for granted in modern computers and operating systems was created in the first time-sharing systems such as the CTSS running on the IBM 700/7000 series computers in the late 1950ies. The Ferranti Atlas Computer in 1962-1967 and its supervisor program followed shortly after these. The Atlas invented nifty features such as virtual memory and memory-mapped I/O, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.

The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the Motorola 68000 CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program code or TEXT segments and a DATA segments.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.

All this said to give you a clear idea why the acronym SVC as in Supervisor Call is used rather than e.g. operating system call or kernel call which would have been more natural. This naming is historical.

Execution Modes or Levels

We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.

When it comes to the older CPUs in the ARMv4, ARMv5 and ARMv6 range these have a special supervisor mode (SVC mode) and a user mode, and as you could guess these two modes are mapped directly to kernelspace and userspace in Linux. In addition to this there are actually 5 additional exception modes for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the user mode belong to kernelspace.

Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from supervisor mode and this way it is clear who is in control.

The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper secure monitor or just monitor mode.

For reference, these modes in the ARMv8 architecture correspond to “privilege levels”. Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 “higher” privilege levels. EL2 is used for hypervisor (virtualization) and EL3 is used for a secure monitor that oversee the switch back and forth to the trusted execution environment (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in drivers/tee in the kernel) but it is a different thing than Linux entirely.

These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.

Exceptional Events

We discussed that different execution modes pertain to certain exceptions. So let's recap ARM32 exceptions.

As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.

The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.

Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the SVC (supervisor call) instruction. This very same instruction — i.e. with the same binary operation code — was previously called SWI (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named TRAP, on x86 there is the INT instruction and RISC-V has the SBI (supervisor binary interface) call.

In my article about how the ARM32 architecture is set up I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace and back.

The transitions occurs at these distinct points:

  • A hardware RESET occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.
  • An undefined instruction is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)
  • A software interrupt occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special SVC (aka SWI) instruction that passes a 1-byte parameter to the software interrupt handler.
  • A prefetch abort occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.
  • A data abort occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.
  • An address exception occurs. This doesn't happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.
  • A hardware interrupt occurs – since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: IRQ and FIQ. Each can be routed to an external interrupt controller, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.
  • A fault occurs such as through division by zero or other arithmetic fault – the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a processor abort.

That's all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and preempted, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.

So how does that actually happen? Let's look at that next.

Entering Kernelspace

Everything has a beginning. I have explained in a previous article how the kernel bootstraps from the function start_kernel() in init/main.c and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.

Further down start_kernel() we initialize the timers, start the clocksource (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the start_kernel() function where arch_call_rest_init() is called. This is in most cases a call to rest_init() in the same file (only S390 does anything different) and that in turn actually initializes some processes:

pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
(...)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

We create separate threads running the in-kernel functions kernel_init and kthreadd, which is the kernel thread daemon which in turn spawns all kernel threads.

The user_mode_thread() or kernel_thread() calls create a new processing context: they both call kernel_clone() which calls copy_process() with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using dup_task_struct() passing current as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach copy_thread() which calls back into the architecture to initialize struct thread_info for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel_init passed instead of just forking), the following happens:

memset(childregs, 0, sizeof(struct pt_regs));
thread->cpu_context.r4 = (unsigned long)args->fn_arg;
thread->cpu_context.r5 = (unsigned long)args->fn;
childregs->ARM_cpsr = SVC_MODE;
(...)
thread->cpu_context.pc = (unsigned long)ret_from_fork;

fn_arg will be NULL in this case but fn is kernel_init or kthreadd. And we execute in SVC_MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to ret_from_fork will be significant, so notice this!

Neither of the functions kernel_init or kthreadd will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called schedule() a single time, which means nothing happens, because nothing is yet scheduled.

kernel_init is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for kthreadd to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged __init) and only then proceed to run_init_process(). As indicated, this will start the init process using kernel_execve(), usually /sbin/init which will then proceed to spawn all usermode processes/tasks. kernel_execve() will check for supported binary formats and most likely call the ELF loader to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START_THREAD() which in turn wraps the ARM32-specific start_thread() which will, amongst other things, do this:

regs->ARM_cpsr = USR_MODE;
(...)
regs->ARM_pc = pc & ~1;

So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. regs->ARM_cpsr will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.

kthreadd on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.

But as said: neither is executing.

In order to actually start the scheduling we call schedule_preempt_disabled() which will issue schedule() with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more “blocky” at this point. However: we already have the clockevent timer running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler_tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.

Until this point we have been running in the context of the Linux init task which is a elusive hard-coded kernel thread with PID 0 that is defined in init/init_task.c and which I have briefly discussed in a previous article. This task does not even appear in procfs in /proc.

As we call schedule(), the kernel init task will preempt and give way to kthreadd and then to the userspace init process. However when the scheduler again schedules the init task with PID 0, we return to rest_init(), and we will call cpu_startup_entry(CPUHP_ONLINE) and that function is in kernel/sched/idle.c and looks like this:

void cpu_startup_entry(enum cpuhp_state state)
{
        arch_cpu_idle_prepare();
        cpuhp_online_idle(state);
        while (1)
                do_idle();
}

That's right: this function never returns. Nothing ever breaks out of the while(1) loop. All that do_idle() does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU “idle” i.e. sleep, since nothing is going on. Then the loop repeats. The kernel init task, PID 0 or “main() function” that begins at start_kernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the init task. The kernel has some documentation of the inner loop that assumes that you know this context.

Let's look closer at do_idle() in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):

while (!need_resched()) {
    local_irq_disable();
    enter_arch_idle_code();
    /* here a considerable amount of wall-clock time can pass */
    exit_arch_idle_code();
    local_irq_enable();
}
(...)
schedule_idle();

This will spin here until something else needs to be scheduled, meaning the init task has the TIF_NEED_RESCHED bit set, and should be preempted. The call to schedule_idle() soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic schedule() call which we will see later.

We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.

Scheduling the first task

So we know that schedule() has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.

We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.

Scheduler model A mental model of the scheduler: scheduler_tick() sets the flag TIF_NEED_RESCHED and a later call to schedule() will actually call out to check_and_switch_context() that does the job of switching task.

Scheduler tick and TIF_NEED_RESCHED

As part of booting the kernel in start_kernel() we first initialized the scheduler with a call to sched_init() and the system tick with a call to tick_init() and then the timer drivers using time_init(). The time_init() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in drivers/clocksource.

There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.

The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it's interrupt will not be able to fire until local_irq_enable() is called further down in start_kernel(). After this point the system has a running scheduling tick.

As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.

The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO_HZ).

What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.

When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own ->event_handler() which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to do_timer() followed by a call to scheduler_tick(). (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)

As a result of calling scheduler_tick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, “rescheduled” and calls resched_curr(rq) on the runqueue for the CPU, which in turn will call set_tsk_need_resched(curr) on the current task, which flags it as ready to be rescheduled.

set_tsk_need_resched() will set the flag TIF_NEED_RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in arch/arm/include/asm/thread_info.h and ARM32 has a bitmask version of this flag helpfully named _TIF_NEED_RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.

This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means “at your earliest convenience, yield to another task”. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when schedule() is called.

The Task State and Stack

We mentioned the architecture-specific struct thread_info so let's hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread_info is simply part of the task_struct. The struct task_struct is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:

struct task_struct {
    struct thread_info thread_info;
    (...)
    unsigned int state;
    (...)
    void *stack;
    (...)
    struct mm_struct *mm;
    (...)
    pid_t pid;
    (...)
};

The struct thread_info which in our case is a member of task_struct contains all the architecture-specific aspects of the state.

The task_struct refers to thread_info, but also to a separate piece of memory void *stack called the task stack, which is where the task will store its activation records when executing code. The task stack is of size THREAD_SIZE, usually 8KB (2 * PAGE_SIZE). These days, in most systems, the task stack is mapped into the VMALLOC area.

The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first enabled THREAD_INFO_IN_TASK which enabled thread info to be contained in the task_struct and then enabled CONFIG_VMAP_STACK for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!

Task struct The task_struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread_info is inside it, and inside the thread_info is a cpu_context_save. It has a task stack of size THREAD_SIZE for ARM32 which is typically twice the PAGE_SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task_struct and thread_info is such that task_struct is Linux-generic and thread_info is architecture-specific and they correspond 1-to-1.

Actual Preemption

In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of schedule() getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?

schedule() can be called for two reasons:

  • Voluntary preemption: such as when a kernel thread want to give up it's time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call schedule_preempt_disabled() the very first time, we voluntarily preempt the kernel execution of the init task with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the kthreadd process. Other places can be found by git grep:ing for calls to cond_resched() or just an explicit call to schedule().
  • Forced preemption: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its' timeslice, and schedule_tick() has set the TIF_NEED_RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.

Places where forced preemption happens:

The short answer to the question “where does forced preemption happen?” is “at the end of exception handlers”. Here are the details.

The most classical place for preemption of userspace processes is on the return path of a system call. This happens from arch/arm/kernel/entry-common.S in the assembly snippets for ret_slow_syscall() and ret_fast_syscall(), where the ARM32 kernel makes an explicit call to do_work_pending() in arch/arm/kernel/signal.c. This will issue a call to schedule() if the flag _TIF_NEED_RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is ret_from_fork which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.

The most common place for preemption is however when returning from a hardware interrupt. Interrupts on ARM32 are handled in assembly in arch/arm/kernel/entry-armv.S with a piece of assembly that saves the processor state for the current CPU into a struct pt_regs and from there just calls the generic interrupt handling code in kernel/irq/handle.c named generic_handle_arch_irq(). This code is used by other archs than ARM32 and will nominally just store the system state and registers in a struct pt_regs record on entry and restore it on exit. However when the simplistic code in generic_handle_arch_irq() is done, it exits through the same routines in arch/arm/kernel/entry-common.S as fast and slow syscalls, and we can see that in ret_to_user_from_irq the code will explicitly check for the resched and other flags with ldr r1, [tsk, #TI_FLAGS] and branch to the handler doing do_work_pending(), and consequently preempt to another task instead of returning from an interrupt.

Now study do_work_pending():

do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
        /*
         * The assembly code enters us with IRQs off, (...)
         */

        do {
                if (likely(thread_flags & _TIF_NEED_RESCHED)) {
                        schedule();
                } else {
                        (...)
                }
                local_irq_disable();
                thread_flags = read_thread_flags();
        } while (...);
        return 0;
}

Notice the comment: we enter do_work_pending() with local IRQs disabled so we can't get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is local_irq_disable() – we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do_work_pending you will see that this looks the same on other architectures with similar setup.

In reality do_work_pending() does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local_irq_disable().

The struct pt_regs should be understood as “processor trace registers” which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the CPU state, including the program counter pc, which is the place where the task was supposed to resume execution, unless it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt_regs are stored in the task stack during the call to generic_handle_arch_irq().

The assembly in entry-common.S can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:

	(...)
slow_work_pending:
	mov	r0, sp				@ 'regs'
	mov	r2, why				@ 'syscall'
	bl	do_work_pending
	cmp	r0, #0
	beq	no_work_pending
	(...)

ENTRY(ret_to_user_from_irq)
	ldr	r1, [tsk, #TI_FLAGS]
	movs	r1, r1, lsl #16
	bne	slow_work_pending
no_work_pending:
	asm_trace_hardirqs_on save = 0
	ct_user_enter save = 0
	restore_user_regs fast = 0, offset = 0

We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by do_work_pending() which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to no_work_pending and restore the usersmode registers and continue execution.

Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler_tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler_tick() gets called by the timer interrupt, and if it sets TIF_NEED_RESCHED a different thread will start to execute on our way out of the exception handler!

The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named restore_user_regs can be found in entry-header.S and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):

	mov	r2, sp
	(...)
	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
	add	sp, sp, #\offset + PT_REGS_SIZE
	movs	pc, lr				@ return & move spsr_svc into cp

r2 is set to the stack pointer, where pt_regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) — NOTE: the little caret (^) after the ldmdb instruction means “also load CPSR from the stack” — then moves the stackpointer past the saved registers and returns.

Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task task stack, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.

Also notice that this must happen in the end of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.

But let's not skip the last part of what schedule() does.

Setting the Program Counter

So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do_work_pending() which in turn will call schedule() for us.

The schedulers schedule() call is supposed to very quickly select a process to run next. Eventually it will call context_switch() in kernel/sched/core.c, which in turn will do essentially two things:

  • Check if the next task has a unique memory management context (next->mm is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call switch_mm_irqs_off() which in the ARM32 case is just defined to the ARM32-specific switch_mm() which will call the ARM32-specific check_and_switch_context()NOTE that this function for any system with MMU is hidden in the arch/arm/include/asm/mmu_context.h header file — which in turn does one of two things:
    • If interrupts are disabled, we will just set mm->context.switch_pending = 1 so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific finish_arch_post_lock_switch() which is implemented right below and gets called right after dropping the runqueue lock.
    • If interrupts are not disabled, we will immediately call cpu_switch_mm(). This is a per-cpu callback witch is written in assembly for each CPU as cpu_NNNN_switch_mm() inside arch/arm/mm/proc-NNNN.S. For example, all v7 CPUs have the cpu_v7_switch_mm() in arch/arm/mm/proc-v7.S.
  • Switch context (such as the register states and stack) to the new task by calling switch_to() with the new task and the previous one as parameter. In most cases this latches to an architecture-specific __switch_to(). In the ARM32 case, this routine is written in assembly and can be found in arch/arm/kernel/entry-armv.S.

Now the final details happens in __switch_to() which is supplied the struct thread_info (i.e. the architecture-specific state) for both the current and the previous task:

  • We store the registers of the current task in the task stack, at the TI_CPU_SAVE index of struct thread_info, which corresponds to the .cpu_context entry in the struct, which is in turn a struct cpu_context_save, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to continue as if nothing has happened when we “return” after the schedule() call. I put “return” in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.
  • Then the TLS (Thread Local Storage) settings for the new task are obtained and we issue switch_tls(). On v6 CPUs this has special implications, but in most cases we end up using switch_tls_software() which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the kernel-provided user helpers, which in turn are a few kernel routines “similar to but different from VDSO” that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the struct thread_info pointer, so it cannot be used for TLS on ARMv7. (More on this later.)
  • We then broadcast THREAD_NOTIFY_SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet __switch_to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.

Then we reach the final step in __switch_to(), which is a bit different depending on whether we use CONFIG_VMAP_STACK or not.

The simple path when we are not using VMAP:ed stacks looks like this:

	set_current r7, r8
	ldmia	r4, {r4 - sl, fp, sp, pc}	@ Load all regs saved previously

Here r7 contains a pointer to the next tasks thread_info (which will somewhere the kernel virtual memory map), and set_current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label __current but ARMv7 and SMP systems have a dedicated special CP15 TPIDRURO thread ID register to store this in the CPU so that the thread_info can be located very quickly. (The only user of this information is, no surprise, the get_current() assembly snippet, but that is in turn called from a lot of places and contexts.)

The next ldmia instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) and pc(r15) from the location pointed out by r4, which again is the .cpu_context entry in the struct thread_info, the struct cpu_context_save, which is all the context there is including pc so the next instruction after this will be whatever pc was inside the struct cpu_context_save. We have switched to the new task and preemption is complete.

But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn't the task we're switching to going to miss those registers?

For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be “clobbered” during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It's a function call to inline assembly and not an exception.

And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or (C) explicitly called schedule() when this happened, this just doesn't matter.

Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)

When starting a completely new task all the contents of struct cpu_context_save will be zero, and the return address will be set to ret_from_fork or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.

If we're on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn't matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt_regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.

This is why r0 thru r3 are missing from struct cpu_context_save and need not be preserved.

When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution right after the schedule(); call. If we were on our way out of an exception in do_work_pending() we will proceed to return from the exception handler, and to the process it will “feel” like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.

Running init

So how does /sbin/init actually come to execute?

We saw that after start_kernel we get to rest_init which creates the thread with pid = user_mode_thread(kernel_init, NULL, CLONE_FS).

Then kernel_init calls on kernel_execve() to execute /sbin/init. It locates an ELF parser to read and page in the file. Then it will eventually issue start_thread() which will set regs->ARM_cpsr = USR_MODE and regs->ARM_pc to the start of the executable.

Then this tasks task_struct including memory context etc will be selected after a call to schedule().

But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we “cheat”:

When we initialized the process in arch/arm/kernel/process.c, we set the program counter to ret_from_fork so we are not going back after any schedule() call: we are going back to ret_from_fork! And this is just an exception return path, so this will restore regs->ARM_cpsr to USR_MODE, and “return from an exception” into whatever is in regs->ARM_pc, which is the start of the binary program from the ELF file!

So /sbin/init is executed as a consequence of returning from a fake exception through ret_from_fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.

This is how ARM32 schedules and executes processes.

 
Read more...

from Christian Brauner

The original blogpost is at https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

Early on when the LXD project was started we were clear that we wanted to make it possible to change settings while the container is running. On of the very first things that came to our mind was making it possible to insert new mounts into a running container. When I was still at Canonical working on LXD we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.

Back then the only way to create mounts or change mount option was by using the mount(2) system call. The mount system call multiplexes a lot of different operations. For example, it doesn't just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes. Mounting is overall a pretty complex operation as it doesn't just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.

I want to take a look at our legacy solution to this problem and a new approach that I've used and that has existed for a while but never talked about widely.

Creative uses of mount(2)

Before openat2(2) came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks. A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem. These attacks used to be quite common and there was no straightforward solution available; at least not before the RESOLVE_* flag namespace of openat2(2) improved things so considerably that symlink attacks on new kernels can be effectively blocked.

But before openat2() symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm. I won't go into too much detail but it is roughly done by verifying each path component in userspace using O_PATH file descriptors making sure that the paths point into the container's rootfs.

But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that mount(2) only operates on paths. So you are still susceptible to symlink attacks as soon as you call mount(source, target, ...).

The way we solved this problem was by realizing that mount(2) was perfectly happy to operate on /proc/self/fd/<nr> paths (This is similar to how fexecve() used to work before the addition of the execveat() system call.). So we could verify the whole path and then open the last component of the source and target paths at which point we could call mount("/proc/self/fd/1234", "/proc/self/fd/5678", ...).

We immediately thought that if mount(2) allows you to do that then we could easily use this to mount into namespaces. So if the container is running it its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container's mount namespace (and it's owning user namespace) and then simply call mount("/proc/self/fd/1234", "/mnt", ...). In pseudo C code it would look roughly:

fd_mnt = openat(-EBADF, "/opt", O_PATH, ...);
setns(fd_userns, CLONE_NEWUSER);
setns(fd_mntns, CLONE_NEWNS);
mount("/proc/self/fd/fd_mnt", "/mnt", ...);

However, this isn't possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller's mount namespace. Since the caller will be located in the container's mount namespace after the setns() call but the source file descriptors refers to a mount located in the host's mount namespace this check fails. The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted. Back then it would've also meant that adding mounts to containers would've only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.

Mount namespace tunnels

So we came up with the idea of mount namespace tunnels. Since we spearheaded this idea it has been picked up by various projects such as systemd for system services and it's own systemd-nspawn container runtime.

The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
# Create new mount namespace with all mounts turned into dependent mounts.
unshare --mount --propagation=slave

and then create a mount on or beneath the shared /opt mount on the host:

mkdir /opt/a
mount --bind /tmp /opt/a

then the new mount of /tmp on the dentry /opt/a will propagate into the mount namespace we created earlier. Since the /opt mount at the /opt dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:

mount --move /opt/a /mnt

As a last step we can unmount /opt/a in the host mount namespace. And as long as the /mnt dentry doesn't reside on a mount that is a dependent mount of /opt's peer group the unmount of /opt/a we just performed on the host will only unmount the mount in the host mount namespace.

There are various problems with this solution:

  • It's complex.
  • The container manager needs to set up the mount tunnel when the container starts. In other words, it needs to part of the architecture of the container which is always unfortunate.
  • The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted. Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.

Mounting into mount namespaces

A few years ago a new mount api made it into the kernel. Shortly after I've also added the mount_setattr(2) system call. Since then I've been expanding the abilities of this api and to put it to its full use.

Unfortunately the adoption of the new mount api has been slow. Mostly, because people don't know about it or because they don't yet see the many advantages it offers over the old one. But with the next release of the mount(8) binary a lot of us use the new mount api will be used whenever possible.

I won't be covering all the features that the mount api offers. This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let's us get rid of the complex mount propagation scheme.

Luckily, the new mount api is designed around file descriptors.

Filesystem Mounts

To create a new filesystem mount using the old mount api is simple:

mount("/dev/sda", "/mnt", "xfs", ...);

We pass the source, target, and filesystem type and potentially additional mount options. This single system call does a lot behind the scenes. A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller's mount namespace.

In the new mount api the various steps are split into separate system calls. While this makes mounting more complex it allows allows for greater flexibility. Mounting doesn't have to be a fast operation and never has been.

So in the new mount api we would create a new filesystem mount with the following steps:

/* Create a new filesystem context. */
fd_fs = fsopen("xfs");

/*
 * Set the source of the filsystem mount. Whether or not this is required
 * depends on the type of filesystem of course. For example, mounting a tmpfs
 * filesystem would not require us to set the "source" property as it's not
 * backed by a block device. 
 */
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);

/* Actually create the superblock and prepare to allocate a mount. */
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

The fd_fs file descriptor refers to VFS context object that doesn't concern us here. Let it suffice that it is an opaque object that can only be used to configure the superblock and the filesystem until fsmount() is called:

/* Create a new detached mount and return an O_PATH file descriptor refering to the mount. */
fd_mnt = fsmount(fd_fs, 0, 0);

The fsmount() call will turn the context file descriptor into an O_PATH file descriptor that refers to a detached mount. A detached mount is a mount that isn't attached to any mount namespace.

Bind Mounts

The old mount api created bind mounts via:

mount("/opt", "/mnt", MNT_BIND, ...)

and recursive bind mounts via:

mount("/opt", "/mnt", MNT_BIND | MS_REC, ...)

Most people however will be more familiar with mount(8):

mount --bind /opt /mnt
mount --rbind / /mnt

Bind mounts play a major role in container runtimes and system services as run by systemd.

The new mount api supports bind mounts through the open_tree() system call. Calling open_tree() on an existing mount will just return an O_PATH file descriptor referring to that mount. But if OPEN_TREE_CLONE is specified open_tree() will create a detached mount and return an O_PATH file descriptor. That file descriptor is indistinguishable from an O_PATH file descriptor returned from the earlier fsmount() example:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLONE, ...)

creates a new detached mount of /opt and:

fd_mnt = open_tree(-EBADF, "/", OPEN_TREE_CLONE | AT_RECURSIVE, ...)

would create a new detached copy of the whole rootfs mount tree.

Attaching detached mounts

As mentioned before the file descriptor returned from fsmount() and open_tree(OPEN_TREE_CLONE) refers to a detached mount in both cases. The mount it refers to doesn't appear anywhere in the filesystem hierarchy. Consequently, the mount can't be found by lookup operations going through the filesystem hierarchy. The new mount api thus provides an elegant mechanism for:

mount("/opt", "/mnt", MS_BIND, ...);
fd_mnt = openat(-EABDF, "/mnt", O_PATH | O_DIRECTORY | O_CLOEXEC, ...);
umount2("/mnt", MNT_DETACH);

and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace. This alone is already a very powerful tool but we won't go into depth today.

Most of the time a detached mount isn't wanted however. Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it. So we need to attach them to the filesystem hierarchy.

In order to attach a mount we can use the move_mount() system call. For example, to attach the detached mount fd_mnt we create before we can use:

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

This will attach the detached mount of /opt at the /mnt dentry on the / mount. What this means is that the /opt mount will be inserted into the mount namespace that the caller is located in at the time of calling move_mount(). (The kernel has very tight semantics here. For example, it will enforce that the caller has CAP_SYS_ADMIN in the owning user namespace of its mount namespace. It will also enforce that the mount the /mnt dentry is located on belongs to the same mount namespace as the caller.)

After move_mount() returns the mount is permanently attached. Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to. In other words, move_mount() is an irreversible operation.

The main point is that before move_mount() is called a detached mount doesn't belong to any mount namespace and can thus be freely moved around.

Mounting a new filesystem into a mount namespace

To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:

fd_fs = fsopen("xfs");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fd_mnt = fsmount(fd_fs, 0, 0);

For filesystems that require host privileges such as xfs, ext4, or btrfs (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges. However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace. So we can simply attach to the user namespace and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and then use

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

to attach the detached mount anywhere we like in the container.

Mounting a new bind mount into a mount namespace

A bind mount is even simpler. If we want to share a specific host directory with the container we can just have the container manager call:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLOEXEC | OPEN_TREE_CLONE);

to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and as above we are free to attach the detached mount anywhere we like in the container.

Conclusion

This is really it and as simple as it sounds. It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers. We've making heavy use of this LXD and it is general the proper way to insert mounts into mount namespaces on newer kernels.

 
Read more...