people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from paulmck

There is a proposal making its way through the C++ Standards Committee to Defang and deprecate memory_order::consume, and a similar proposal is likely to make its way through the C Standards Committee. This is somewhat annoying from a Linux-kernel-RCU perspective, because there was some reason to hope for language-level support for the address dependencies headed by calls to rcu_dereference().

So what is a Linux-kernel community to do?

The first thing is to review the current approach to rcu_dereference() address dependencies, which use a combination of standard language features and code standards.

Standard Language Features

Actual Implementations of memory_order::consume

All known implementations of memory_order::consume simply promote it to memory_order::acquire. This is correct from a functional perspective, but leaves performance on the table for PowerPC and for some hardware implementations of ARM. Of less concern for the Linux kernel, on many GPGPUs, memory_order::acquire has significant overhead.

There are those who claim that in the future, accesses using memory_order::acquire will be no more expensive than those using memory_order::consume. And who knows, maybe they are correct. But in the meantime, we must deal with the hardware that we actually have.

Your Friend in Need: volatile

Some might argue that the volatile keyword is underspecified by the C and C++ standards, but a huge body of device-driver code does constrain the compilers, albeit sometimes rather controversially. And rcu_dereference() uses a volatile load to fetch the pointer, which prevents the compiler from spoiling the fun by (say) reloading from the same pointer, whose value might well have changed in the meantime.

A Happy Consequence of Data-Race Prohibition

The C language forbids implementations (in our case, compilers) from creating data races, that is, situations where the object code has concurrent C-language accesses to a given variable, at least one of which is a store.

So how does this help?

For one thing, it constrains compiler-based speculation. To see this, suppose that one thread does this:

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 42;
rcu_assign_pointer(gp, p);

And another thread does this:

p = rcu_dereference(gp);
do_something_with(p->a);

At the source-code level, there is clearly no data race. But suppose the compiler uses profile-directed optimization, and learns that the value returned by rcu_dereference() is almost always 0x12345678. Such a compiler might be tempted to emit code to cause the hardware to concurrently execute the rcu_dereference() while also loading 0x12345678->a. If the rcu_dereference() returned the expected value of 0x12345678, the compiler could use the value loaded from 0x12345678->a, otherwise, it could load p->a.

The problem is that the two threads might execute concurrently as follows:

p = kmalloc(sizeof(*p), GFP_KERNEL);

// These two concurrently:
r1 = 0x12345678->a;
p->a = 0xdead1eafbadfab1e;

rcu_assign_pointer(gp, p);
p = rcu_dereference(gp); // Returns 0x12345678!!!

Because the value of p is 0x12345678, it appears that the speculation has succeeded. But the second thread's load into r1 ran concurrently with the first thread's store into p->a, which might result in user-visible torn loads and stores, or just plain pre-initialization garbage.

This sort of software speculation is therefore forbidden.

Yes, hardware can get away with this sort of thing because it tracks cache state. If a compiler wishes to generate code that executes speculatively, it must use something like hardware transactional memory is required, which typically has overhead that overwhelms any possible benefit.

Code Standards

The Documentation/RCU/rcu_dereference.rst file presents the Linux-kernel's code standards for the address dependencies headed by members of the rcu_dereference() API family. A summary of the most widely applicable of these standards is as follows:

  1. An address dependency must be headed by an appropriate member of the rcu_dereference() API family. The variables holding the return value from a member of this API family are said to be carrying a dependency.
  2. In the special case where data is added and never removed, READ_ONCE() can be substituted for one of the rcu_dereference() APIs.
  3. Address dependencies are carried by pointers only, and specifically not by integers. (With the exception that integer operations may be used to set, clear, and XOR bits in the pointers, which requires those pointers to be translated to integers, have their bits manipulated, and then translated immediately back to pointers.)
  4. Operations that cancel out all the bits in the original pointer break the address dependency.
  5. Comparing a dependency-carrying pointer to the address of a statically allocated variable can break the dependency chain. (Though there are special rules that allow such comparisons to be carried out safely in some equally special cases.)
  6. Special operations on hardware instruction caches may be required when using pointers to JITed functions.

The Documentation/RCU/rcu_dereference.rst file provides much more detail, which means that scanning the above list is not a substitute for reading the full file.

Enforcing Code Standards

My fond hope in the past was that compilers would have facilities that disable the optimizations requiring the code standards, but that effort seems likely to require greater life expectancy than I can bring to bear. That said, I definitely encourage others to try their hands at this.

But in the meantime, we need a way to enforce these code standards.

One approach is obviously code review, but it would be good to have automated help for this.

And Paul Heidekreuger presented a prototype tool at the 2022 Linux Plumbers Conference. This tool located several violations of the rule against comparing dependency-carrying pointers against the addresses of statically allocated variables.

Which suggests that continued work on such tooling could be quite productive.

Summary

So memory_order::consume is likely to go away, as is its counterpart in the C standard. This is not an immediate problem because all known implementations simply map memory_order::consume to memory_order::acquire, with those who care using other means to head address dependencies. (In the case of the Linux kernel, volatile loads.)

However, this does leave those who care with the issue of checking code using things like rcu_dereference(), given that the language standards are unlikely to provide any help any time soon.

Continued work on tooling that checks the handling of dependency chains in the object code therefore seems like an eminently reasonable step forward.

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Another busy year has passed so let us punctuate the never ending stream of development with a retrospective of our accomplishments over the last 12 months. The previous, 2023 retrospective has covered changes from Linux v6.3 to v6.8, for 2024 we will cover Linux v6.9 to v6.13, one fewer, as Linux releases don’t align with calendar years. We will focus on the work happening directly on the netdev mailing list, having neither space nor expertise to do justice to developments within sub-subsystems like WiFi, Bluetooth, BPF etc.

Core

After months of work and many patch revisions we have finally merged support for Device Memory TCP, which allows TCP payloads to be placed directly in accelerator (GPU, TPU, etc.) or user space memory while still using the kernel stack for all protocol (header) processing (v6.12). The immediate motivation for this work is obviously the GenAI boom, but some of the components built to enable Device Memory TCP, for example queue control API (v6.10), should be more broadly applicable.

The second notable area of development was busy polling. Additions to the epoll API allow enabling and configuring network busy polling on a per-epoll-instance basis, making the feature far easier to deploy in a single application (v6.9). Even more significant was the addition of a NAPI suspension mechanism which allows for efficient and automatic switching between busy polling and IRQ-driven operation, as most real life applications are not constantly running under highest load (v6.12). Once again the work was preceded by paying off technical debt, it is now possible to configure individual NAPI instances rather than an entire network interface (v6.13).

Work on relieving the rtnl_lock pressure has continued throughout the year. The rtnl_lock is often mentioned as one of the biggest global locks in the kernel, as it protects all of the network configuration and state. The efforts can be divided into two broad categories – converting read operations to rely on RCU protection or other fine grained locking (v6.9, v6.10), and splitting the lock into per-network namespace locks (preparations for which started in v6.13).

Following discussions during last year’s LPC, the Real Time developers have contributed changes which make network processing more RT-friendly by allowing all packet processing to be executed in dedicated threads, instead of the softirq thread (v6.10). They also replaced implicit Bottom Half protection (the fact that code in BH context can’t be preempted, or migrated between CPUs) with explicit local locks (v6.11).

The routing stack has seen a number of small additions for ECMP forwarding, which underpins all modern datacenter network fabrics. ECMP routing can now maintain per-path statistics to allow detecting unbalanced use of paths (v6.9), and to reseed the hashing key to remediate the poor traffic distribution (v6.11). The weights used in ECMP’s consistent hashing have been widened from 8 bits to 16 bits (v6.12).

The ability to schedule sending packets at a particular time in the future has been extended to survive network namespace traversal (v6.9), and now supports using the TAI clock as a reference (v6.11). We also gained the ability to explicitly supply the timestamp ID via a cmsg during a sendmsg call (v6.13).

The number of “drop reasons”, helping to easily identify and trace packet loss in the stack is steadily increasing. Reason codes are now also provided when TCP RST packets are generated (v6.10).

Protocols

The protocol development wasn’t particularly active in 2024. As we close off the year 3 large protocol patch sets are being actively reviewed, but let us not steal 2025’s thunder, and limit ourselves to changes present in Linus’s tree by the end of 2024.

AF_UNIX socket family has a new garbage collection algorithm (v6.10). Since AF_UNIX supports file descriptor passing, sockets can hold references to each other, forming reference cycles etc. The old home grown algorithm which was a constant source of bugs has been replaced by one with more theoretical backing (Tarjan’s algorithm).

TCP SYN cookie generation and validation can now be performed from the TC subsystem hooks, enabling scaling out SYN flood handling across multiple machines (v6.9). User space can peek into data queued to a TCP socket at a specified offset (v6.10). It is also now possible to set min_rto for all new sockets using a sysctl, a patch which was reportedly maintained downstream by multiple hyperscalers for years (v6.11).

UDP segmentation now works even if the underlying device doesn’t support checksum offload, e.g. TUN/TAP (v6.11). A new hash table was added for connected UDP sockets (4-tuple based), significantly speeding-up connected socket lookup (v6.13).

MPTCP gained TCP_NOTSENT_LOWAT support (v6.9), and automatic tracking of destinations which blackhole MPTCP traffic (6.12).

IPsec stack now adheres to RFC 4301 when it comes to forwarding ICMP Error messages (v6.9).

Bonding driver supports independent control state machine in addition to the traditional coupled one, per IEEE 802.1AX-2008 5.4.15 (v6.9).

The GTP protocol gained IPv6 support (v6.10).

The High-availability Seamless Redundancy (HSR) protocol implementation gained the ability to work as a proxy node connecting non-HSR capable node to an HSR network (RedBOX mode) (v6.11).

The netconsole driver can attach arbitrary metadata to the log messages (v6.9).

The work on making Netlink easier to interface with in modern languages continued. The Netlink protocol descriptions in YAML can now express Netlink “polymorphism” (v6.9), i.e. a situation where parsing of one attribute depends on the value of another attribute (e.g. link type determines how link attributes are parsed). 7 new specs have been added, as well as a lot of small spec and code generation improvements. Sadly we still only have bindings/codegen for C, C++ and Python.

Device APIs

The biggest addition to the device-facing APIs in 2024 was the HW traffic shaping interface (v6.13). Over the years we have accumulated a plethora of single-vendor, single-use case rate control APIs. The new API promises to express most use cases, ultimately unifying the configuration from the user perspective. The immediate use for the new API is rate limiting traffic from a group of Tx queues. Somewhat related to this work was the revamp of the RSS context API which allows directing Rx traffic to a group of queues (v6.11, v6.12, v6.13). Together the HW rate limiting and RSS context APIs will hopefully allow container networking to leverage HW capabilities, without the need for complex full offloads.

A new API for reporting device statistics has been created (qstat) within the netdev netlink family (v6.9). It allows reporting more detailed driver-level stats than old interfaces, and breaking down the stats by Rx/Tx queue.

Packet processing in presence of TC classifier offloads has been sped up, the software processing is now fully skipped if all rules are installed in HW-only mode (v6.10).

Ethtool gained support for flashing firmware to SFP modules, and configuring thresholds used by automatic IRQ moderation (v6.11). The most significant change to ethtool APIs in 2024 was, however, the ability to interact with multiple PHYs for a single network interface (v6.12).

Work continues on adding configuration interfaces for supplying power over network wiring. Ethtool APIs have been extended with Power over Ethernet (PoE) support (v6.10). The APIs have been extended to allow reporting more information about the devices and failure reasons, as well as setting power limits (v6.11).

Configuration of Energy Efficient Ethernet is being reworked because the old API did not have enough bits to cover new link modes (2.5GE, 5GE), but we also used this as an opportunity to share more code between drivers (especially those using phylib), and encourage more uniform behavior (v6.9).

Testing

2024 was the year of improving our testing. We spent the previous winter break building out an automated testing system, and have been running the full suite of networking selftests on all code merged since January. The pre-merge tests are catching roughly one bug a day.

We added a handful of simple libraries and infrastructure for writing tests in Python, crucially allowing easy use of Netlink YAML bindings, and supporting tests for NIC drivers (v6.10).

Later in the year we added native integration of packetdrill tests into kselftest, and started importing batches of tests from the packetdrill library (v6.12).

Community and process

The maintainers, developers and community members have met at two conferences, the netdev track at Linux Plumbers and netconf in Vienna, and the netdev.conf 0x18 conference in Santa Clara.

We have removed the historic requirement for special formatting of multi-line comments in netdev (although it is still the preferred style), documented our guidance on the use of automatic resource cleanup, as well as sending cleanup patches (such as “fixing” checkpatch warnings in existing code).

In April, we announced the redefinition of the “Supported” status for NIC drivers, to try to nudge vendors towards more collaboration and better testing. Whether this change has the desired effect remains to be seen.

Last but not least Andrew Lunn and Simon Horman have joined the netdev maintainer group.

6.9: https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org 6.10: https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org 6.11: https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org 6.12: https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org 6.13: https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com

 
Read more...

from paulmck

I recently noted that there are 11 ways to wait for RCU, and this note was duly noted by others. Please note that this is just vanilla RCU, not counting SRCU and the various flavors of Tasks RCU.

But what exactly are these ways?

Let Me List The Ways

The easy way to answer this is to look at rcutorture's module parameters:

  1. gp_normal waits asynchronously for normal grace periods, as in call_rcu().
  2. gp_sync waits synchronously for normal grace periods, as in synchronize_rcu().
  3. gp_exp waits synchronously for expedited grace periods, as in synchronize_rcu_expedited().
  4. gp_cond waits conditionally and synchronously for normal grace periods, as in: c = get_state_synchronize_rcu(); do_something(); cond_synchronize_rcu();
  5. gp_cond_exp waits conditionally and synchronously for expedited grace periods, as in: c = get_state_synchronize_rcu(); do_something(); cond_synchronize_rcu_expedited();
  6. gp_cond_full waits conditionally and synchronously with a full-sized “cookie” for normal grace periods, as in: get_state_synchronize_rcu_full(&c); do_something(); cond_synchronize_rcu_full(&c);
  7. gp_cond_exp_full waits conditionally and synchronously with a full-sized “cookie” for expedited grace periods, as in: get_state_synchronize_rcu_full(&c); do_something(); cond_synchronize_rcu_expedited_full(&c);
  8. gp_poll poll-waits for normal grace periods, as in: c = start_poll_synchronize_rcu(); do { do_something(); } while (!poll_state_synchronize_rcu());
  9. gp_poll_exp poll-waits for expedited grace periods, as in: c = start_poll_synchronize_rcu_expedited(); do { do_something(); } while (!poll_state_synchronize_rcu());
  10. gp_poll_full poll-waits with a full-sized “cookie” for normal grace periods, as in: start_poll_synchronize_rcu_full(&c); do { do_something(); } while (!poll_state_synchronize_rcu_full(&c));
  11. gp_poll_exp_full poll-waits with a full-sized “cookie” for expedited grace periods, as in: start_poll_synchronize_rcu_expedited_full(&c); do { do_something(); } while (!poll_state_synchronize_rcu_full(&c));

And that makes 11 of them, just as advertised!

In case you are curious, by default, rcutorture uses all eleven, making a random choice on each grace period. These module parameter can be used to restrict rcutorture's attention to the specified grace-period methods.

But Wait! There Are More!!!

There is a call_rcu_hurry() for kernels built with both CONFIG_RCU_LAZY=y and CONFIG_RCU_NOCB_CPU=y. These Kconfig options is used in kernels built for some battery-powered devices, and they make call_rcu() be lazy, as in a callback might wait up to 10 seconds before its grace period starts. Code that is in more of a hurry invokes call_rcu_hurry() in order to avoid this wait.

But shouldn't rcutorture also test call_rcu_hurry()?

It should and it does, for example, when doing callback flooding where a 10-second delay might result in OOM.

One might argue that queue_rcu_work() is yet another way of waiting asynchronously for a normal grace period, and from the user's point of view it most certainly is. But it invokes call_rcu_hurry() internally, and furthermore is implemented within workqueues.

One might further argue that kfree_rcu() and kfree_rcu_mightsleep() are also ways of waiting for a normal grace period. This argument quickly becomes philosophical: Is the user really waiting for that kfree() to happen? Maybe? Besides, Uladzislau Rezki recently posted a patch pulled in my Vlastimil Babka that moves these two functions to the slab allocator. Not only that, but these two functions use the aforementioned queue_rcu_work() function, which in turn uses call_rcu_hurry().

Therefore, rcutorture's testing of call_rcu_hurry() already covers all three of these functions.

How Many Are There Really???

Twelve!!!

Eleven that must be tested separately, and a twelfth in the form of call_rcu_hurry() that is exercised by rcutorture's callback flooding, queueing of callbacks from timer handlers, self-propagating callbacks, rcu_barrier() testing, and debug-objects testing.

Counting up the ways to wait for SRCU and the various Tasks RCU flavors is left as an exercise for the interested reader.

 
Read more...

from kees

Or, how to break all the tools that parse the “Fixes:” tag

Kees Cook

There was a recent discussion about how Linux's “Fixes” tag, which traditionally uses the 12 character commit SHA prefix, has an ever increasing chance of collisions. There are already 11-character collisions, and Geert wanted to raise the minimum short id to 16 characters. This was met with push-back for various reasons. One aspect that bothered me was some people still treating this like a theoretical “maybe in the future” problem. To clear up that problem, I generated a 12-character prefix collision against the start of Git history, commit 1da177e4c3f4 (“Linux-2.6.12-rc2”), which shows up in “Fixes” tags very often:

$ git log --no-merges --oneline --grep 'Fixes: 1da177e4c3f4' | wc -l
590

Tools like linux-next's “Fixes tag checker”, the Linux CNA's commit parser, and my own CVE lifetime analysis scripts do programmatic analysis of the “Fixes” tag and had no support for collisions (even shorter existing collisions).

So, in an effort to fix these tools, I broke them with commit 1da177e4c3f4 (“docs: git SHA prefixes are for humans”):

$ git show 1da177e4c3f4
error: short object ID 1da177e4c3f4 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

This is not yet in the upstream Linux tree, for fear of breaking countless other tools out in the wild. But it can serve as a test commit for those that want to get this fixed ahead of any future collisions (or this commit actually landing).

Lots of thanks to the lucky-commit project, which will grind trailing commit message whitespace in an attempt to find collisions. Doing the 12-character prefix collision took about 6 hours on my OpenCL-enabled RTX 3080 GPU.

For any questions, comments, etc, see this thread.

 
Read more...

from paulmck

In July of 2023, a routine rcutorture run reported a too-short grace period while running one of 15 TREE03 scenarios, each of which ran for 12 hours. This of course represents a serious bug in RCU. I therefore spent significant time attempting to reproduce this bug, with absolutely no success. Inspecting the diagnostics and the code was no more helpful. I therefore recorded my experiences here and got on with life.

This bug did not happen again.

Until August of 2024. Additional attempts to reproduce this bug once again proved fruitless, as documented here.

Until the very next month, when my attempts to track down some non-RCU bugs in a particularly troublesome -next release led me to modify TREE03's configuration. This effort was quite successful, so much so that it eventually (and unintentionally) reproduced the TREE03 bug. This situation illustrates the fact that tests have no loyalty: They will just as happily find your bugs as they will someone else's, especially in this case, where rcutorture was expressly designed to find my bugs.

The failure rate was about one instance per two hours across 400 TREE03 instances, each of which used four CPUs rather than the usual 16. Although this is way better than once per year, validating a fix would require many hundreds of hours of testing. Increasing the failure rate was thus of great importance, and to this end I undertook a series of experiments over two months time. These experiments are summarized below, with additional details available here.

Increasing Failure Rate

When rcutorture encounters a too-short grace period, it prints out the segments of the RCU reader that detected it:

Failure/close-call rcutorture reader segments:
 0: 0x8
 1: 0x1
 2: 0x60
 3: 0x9
 4: 0x60
 5: 0x8
 6: 0x8
 7: 0x60 300ms
 8: 0x38

The 0x60 expands to RCUTORTURE_RDR_RCU_1 | RCUTORTURE_RDR_RCU_2, which indicates an RCU reader segment protected by nested calls to rcu_read_lock(), this particular one containing a 300-millisecond delay. Such an RCU reader would of course be prone to preemption, which suggested creating an rcutorture facility to randomly preempt rcutorture's kthreads. The new preempt_duration module parameter controls the duration of the preemption, which by default occurs every second. Specifying rcutorture.preempt_duration=10 doubled the failure rate to about once per hour per 400 TREE03 guest OSes. A nice improvement, but much more is needed to de-heisenbug this heisenbug.

Runs without CPU hotplug never reproduced the too-short grace periods. Reducing the interval between successive CPU-hotplug operations had little effect, perhaps due to the fact that the default interval is only 200 milliseconds and that CPU-hotplug operations can take some time to complete.

Running a non-preemptible kernel resulted in about the same rate of too-short grace periods as did the TREE03 default preemptible kernel.

It turns out that RCU has no fewer than 11 ways to wait for a grace period, a fact that would have shocked my year-2000 self. But here we are. And further experimentation showed that restricting the grace-period type to expedited polled grace periods with full-sized state (using the rcutorture.gp_poll_exp_full=1 kernel boot parameter) in a non-preemptible kernel resulted in about eight failures per hour per 400 TREE03 guest OSes. Furthermore, enabling preemption increased this to about 13 failures per hour. These experiments also decreased the polling interval from a random number of nanoseconds ranging up to 128 microsecond to a random number of nanoseconds ranging up to 16 microseconds.

Code Inspections

This suggested focusing code-inspection efforts on expedited grace periods, but this proved fruitless. Furthermore, runs using the rcutorture.gp_exp=1 kernel parameter (which forces rcutorture to use only the synchronize_rcu_expedited() function) failed to reproduce the too-short grace periods. Runs using rcutorture.fwd_progress=0 to disable rcutorture's call_rcu() flooding increased the failure rate to 17 failures per hour per 400 TREE03 guest OSes, which is a failure rate that just might be evidence of successful de-heisenbugization.

All of the evidence thus far clearly pointed the finger of suspicion at the interaction of full-sized polling and expedited grace periods. Code inspections resulted in several changes, all but one of which was a legitimate improvement in the code, but none of which decreased the failure rate significantly. (The illegitimate non-improvement was a case of me noting that initialization of expedited grace periods must complete fully before the first quiescent state may be reported safely. But my earlier self had noticed this same requirement and had fulfilled it using workqueue flushes, a subtlety that initially escaped my current self. Perhaps comments should be upgraded? Or perhaps my memory?)

Added Diagnostics Provide Contradictory Evidence

Having come up dry, but having increased the failure rate enough to support additional diagnostics, it was time to add more information to rcutorture's RCU reader-segment printout. This resulted in some entertaining blind alleys resulting from bugs in my added instrumentation, the most entertaining of which made it look like the scheduler was preempting non-preemptible regions of code. Once these were found and fixed, the scheduler was (of course) exonerated and rcutorture was shown to be correctly creating the chains of RCU readers.

The reader detecting the too-short grace period then produced end-of-test output like this:

Failure/close-call rcutorture reader segments:
 0:  0x2 CPU  3 ... ga1:e60:p08-ga1:e60:p08 IRQ
 1: 0x20 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RCU_1
 2:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 3:  0x4 CPU  3 ... ga1:e60:p08-ga1:e60:p08 PREEMPT
 4: 0x10 CPU  3 ... ga1:e60:p08-ga1:e60:p08 SCHED
 5:  0x1 CPU  3 ... ga1:e60:p08-ga1:e60:p08 BH
 6:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 7: 0x60 CPU  3->0  ga1:e60:p08-ga5:e61:p09 300ms preempted RCU_1 RCU_2
 Reader was preempted.

This enhanced output has new columns that track:

  1. The CPU the reader runs on, starting with CPU 3 and at the end migrating to CPU 0.
  2. The bottom bits of the normal, expedited, and small-size polled grace-period sequence number. One normal grace period ended and a new one started, and a new expedited and small-size-polled grace period started. No full grace period elapsed, yet this reader still detected a too-short grace period.
  3. A human-readable expansion of the initial hexadecimal number showing reader state, which progress from:
    • interrupts disabled to
    • a single level of rcu_read_lock() to
    • rcu_read_lock_bh() to
    • preempt_disable() to
    • rcu_read_lock_sched() to
    • local_bh_disable() to
    • another round of rcu_read_lock_bh() and finally to
    • a pair of nested instances of rcu_read_lock().

The final line indicates that the reader was preempted, but we already knew that given the migration from CPU 0 to CPU 3.

This nonsensical contradiction bears repeating: The reader claims that the updater saw a too-short grace period, but the reader itself did not see a full grace period elapse. Adding delays in the updater after it detects the end of the grace period did not help, at least not until these delays were many hundreds of milliseconds. Which means nothing, because then the delay itself is longer than the readers. Note well that direct use of synchronize_rcu_expedited() never resulted in too-short grace periods, which provides an additional level of confusion.

This situation clearly calls for additional debugging, but adding debugging slows down the failure rate for this bug. For but one example, adding raw_smp_processor_id() to the reader in order to record the current CPU slowed down failure rate by about a factor of two.

It is therefore clearly necessary to extract every shred of information from the debugging that is already present.

Smoking CPU-Hotplug Operations

To that end, when the reader detects the too-short grace period, it dumps the trace buffer to the console, even in the case where nothing is being traced. This means that there will be a Dumping ftrace buffer message on the console at about the time that the too-short grace period was detected. The last few lines of console output prior to this are as follows, this time with timestamps:

[ 1091.544531] rcu-torture: rcu_torture_boost started
[ 1092.100819] rcu-torture: Stopping rcu_torture_boost task
[ 1092.101659] rcu-torture: rcu_torture_boost is stopping
[ 1092.115219] smpboot: CPU 3 is now offline
[ 1092.536276] rcu-torture: Stopping rcu_torture_boost task
[ 1092.536938] rcu-torture: rcu_torture_boost is stopping
[ 1092.569261] smpboot: CPU 1 is now offline
[ 1092.828577] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 1092.842975] rcu-torture: Creating rcu_torture_boost task
[ 1092.843895] rcu-torture: rcu_torture_boost started
[ 1093.201384] Dumping ftrace buffer:

Note that the CPU that the reader is running on, CPU 3, came online just before start of the reader that inexplicably reported the too-short grace period. Of course, each rcutorture guest OS has only four CPUs, and only three of them can be taken offline (CPUs 1-3), so this could be random chance.

Except that this was always the case, across more than 100 too-short grace periods. And the value of one-third to the 100th power represents a very low probability indeed, so it seems safe to assume that this correlation represents some form of causation, obscure though that causation might be. Especially given that the bug does not appear in the absence of CPU-hotplug operations.

This realization prompted another inspection of RCU's expedited grace period code and its relation to CPU hotplug, pointing the finger of suspicion on the sync_sched_exp_online_cleanup() function. I spent an embarrassingly large amount of time on this before reminding myself that this function is in fact a no-op in the preemptible kernels that I was testing.

Further Increasing Failure Rate

I also got annoyed at the RCU CPU stall warnings that were instigated by the combination of systems whose guest OSes booted slowly and the testing of RCU priority boosting. I therefore disabled testing of RCU priority boosting. To my delight this actually increased the failure rate of these phantom too-short grace periods.

Thus encouraged, I did more experiments, learning that building the kernel with CONFIG_RCU_FANOUT_LEAF=4 increased the failure rate by more than a factor of two. This was a surprise, given that the effect was to move from a three-rcu_node-structure combining tree to a singleton tree having only the one rcu_node structure. In the past, combining trees having fewer rcu_node structures have reduced race-condition failure rates, but this was clearly an exception to that rule.

At this point, with instrumentation disabled, the bug was reproducing about 50 times per hour across 400 instance of TREE03 guest OSes, which is a increase of more than five orders of magnitude over the initial once-per-year failure rate. This meant that I could use a lot more debugging code!

Smoking Polled Grace Periods

Which resulted in the following trace output, which consists of three lines, some or all of which your browser is likely to have line-wrapped:

rcu_pree-15        0d..1. 3170054244us : rcu_grace_period: rcu_preempt 389665 start
rcu_tort-60        0..... 3170054253us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x5f224/0xcc368.
rcu_tort-60        0..... 3170054276us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x5f224/0xcc368.

The first line (rcu_pree-15) marks the start of trace period 389665, also known as 0x5f221, whose end is 0x5f224. This matches the value produced by the trace_printk() statements producing the two Polled grace period messages (both rcu_tort-60). Yet there is no trace of the end of this grace period, nor is there any sign of an expedited grace period. Therefore, the updater is somehow ending the grace period too soon.

This points the finger of suspicion at the poll_state_synchronize_rcu_full() function that the updater uses to detect the end of the grace period. This function is as follows:

 1 bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb(); 
 6   if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || 
 7       rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || 
 8       rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || 
 9       rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) {
10     smp_mb(); 
11     return true;
12   }
13   return false;
14 }

Line 7 looks just plain wrong. To see why, recall that the ->gp_seq field is replicated in the rcu_state, rcu_node, and rcu_data structures in order to keep both lock and memory contention down to a dull roar. There is one rcu_node structure in this configuration and rcu_data is a per-CPU structure, so there is a total of six copies of ->gp_seq. The source of truth is the one in the rcu_state structure, that is, rcu_state.gp_seq, which is updated first at the beginning of each grace period and last at each grace period's end. So why doesn't line 7 use rcu_state.gp_seq? Sure, there is only one copy of this field, but the same is true of the root rcu_node structure's ->gp_seq field accessed by line 7.

The same question could be asked about the just-as-wrong-looking line 6 of get_state_synchronize_rcu_full():

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In fact both lines looked so wrong that I changed them both (without even bothering to read either function's header comment) and reran the test. There were no too-short grace periods.

Success?

Well, not until there is an explanation of why this makes a difference. After all, there is strong ordering between the updates of the rcu_node structures' ->gp_seq fields and all the quiescent states, so why should it really make a difference? And perhaps it is not a true fix, but rather an unrelated change that just happens to greatly reduce the probability of too-short grace periods, in which case, the change is doing nothing but impeding progress.

Turn to Tracing

I reverted the change to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions, and then enabled yet more tracing, resulting in the following redacted trace messages:

rcu_pree-15        0d..1. 816764514us : rcu_grace_period: rcu_preempt 90289 start
 cpuhp/3-34        3..... 816772336us : rcu_exp_grace_period: rcu_preempt 0x2f0e8 online
rcu_tort-59        0..... 816807835us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816810514us : rcu_grace_period: rcu_preempt 90289 startroot
rcu_pree-15        0d..1. 816810515us : rcu_grace_period: rcu_preempt 90289 cpustart
rcu_pree-15        0d..1. 816810516us : rcu_grace_period: rcu_preempt 90289 cpuqs
rcu_pree-15        0d..1. 816811522us : rcu_grace_period: rcu_preempt 90289 endroot 
rcu_pree-15        0d..1. 816811523us : rcu_grace_period: rcu_preempt 90289 cpuend
rcu_tort-59        0..... 816811529us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816842554us : rcu_grace_period: rcu_preempt 90289 end

The first trace message (rcu_pree-15) marks the official start of grace period 90289 (also known as 0x160b1), recorded in rcu_state.gp_seq. The ->gp_seq value for the end of this grace period is 0x160b4.

Some time later, the second trace message (cpuhp/3-34) records CPU 3 coming online. This CPU came online after the grace period started, so the grace period is not obligated to wait on it.

Later still, the third trace message (rcu_tort-59) marks the start of the polled grace period from the updater's perspective. The next five trace messages (rcu_pree-15) record:

  1. The update of the root rcu_node structure's ->gp_seq field to reflect the start of the grace period.
  2. CPU 0 noticing that the grace period had started.
  3. CPU 0 reporting its quiescent state.
  4. The update of the root rcu_node structure's ->gp_seq field to reflect the end of the grace period.
  5. CPU 0 noticing that the grace period had ended.

The second-to-last trace message (rcu_tort-59) marks the end of the polled grace period from the updater's perspective. And the final trace message (rcu_pree-15) marks the official end of this grace period, recorded in rcu_state.gp_seq.

Anatomy of a Normal RCU Grace Period

The above trace messages report on important transitions in RCU's grace-period processing, and this section puts those transitions into the context of the entire grace period. The relevant phases of grace-period processing are as follows:

  1. The very beginning of a grace-period, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in start.
  2. Accounting for recent CPU-hotplug operations. Again, please note that an RCU grace period need not wait for readers running on a CPU that came online after that grace period started, so this accounting need be done only at the start of each grace period.
  3. Initializing the rcu_node combining tree to prepare for quiescent states from any CPUs and tasks that this grace period must wait on, which updates each rcu_node structure's ->gp_seq field. This phase is reported by the rcu_pree-15 trace message above ending in startroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  4. The note_gp_changes() function notices the starts and ends of grace periods from the viewpoint of each CPU that was online at the start of the grace period (in this case, only CPU 0). These events are reported by the rcu_pree-15 trace messages above ending in cpustart and cpuend.
  5. Quiescent states are reported by a chain of events initiated by rcu_qs(), and are reported by the `rcu_pree-15 trace messages above ending in cpuqs.
  6. Idle CPUs are handled specially, but this did not happen in the above trace.
  7. Once all relevant CPUs and tasks have reported quiescent states, the cleanup phase starts which updates each rcu_node structure's ->gp_seq field. This is reported by the rcu_pree-15 trace message above ending in endroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  8. Finally, the entire grace period comes to an end, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in end.

The full story is a bit more complex, but this provides enough context for the bug at hand.

Turning Back to Tracing

As noted earlier, all the quiescent states happened between the two updates of the root rcu_node structure's ->gp_seq field, so all should be well, right?

Unfortunately, no. A reader noted a too-short grace period, and here is the corresponding console log, again with lines likely wrapped:

Failure/close-call rcutorture reader segments:
 816805685us 0: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 1: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 2:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  PREEMPT
 816805685us 3:  0x8 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  RBH
 816805685us 4:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4-g60b5:e02f0e5:p22f5 300ms PREEMPT
 817105825us 5: 0x60 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1 RCU_2
 817105851us 6:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 7:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 8: 0x20 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1
 Reader was preempted.

Note that it was CPU 3 that came online, and it was also CPU 3 that noticed the too-short grace period. This is the key clue: CPU 3 came online between the time that the grace period officially started and the start of the updater, so CPU 3 might well have started a reader before the updater started. Although the grace period did not need to wait for that reader, the updater most certainly did.

Reaching Root Cause

And this is why the aforementioned update to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions suppressed the bug. With the get_state_synchronize_rcu_full() function reading from rcu_state.gp_seq, the updater would see that the current grace period had already started, and would thus wait until the end of the next grace period, which could not end some time after until CPU 3's reader finished.

Finding a Fix

Unfortunately, the corresponding change to the poll_state_synchronize_rcu_full() function breaks this guarantee:

get_state_synchronize_rcu_full(&rgos);
synchronize_rcu();
WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The problem is that callbacks can be invoked (including those that awaken tasks blocked in synchronize_rcu()) immediately after the root rcu_node structure's ->gp_seq field has been updated, and thus long before rcu_state.gp_seq has been updated to record the grace period's end. Therefore, if the poll_state_synchronize_rcu_full() function is changed to instead check rcu_state.gp_seq, the above WARN_ON_ONCE() can trigger.

Which would have been immediately obvious had I read the poll_state_synchronize_rcu_full() function's header comment. So if I wasn't going to read these comments, why did I bother writing them? ;–)

Implementing a Fix

The first step is to fix line 6 of get_state_synchronize_rcu_full() to access rcu_state.gp_seq (instead of the root rcu_node structure's ->gp_seq field):

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In real life, we must of course also get rid of the now-unused rnp local variable, but that would mess up the line numbering.

But running with this patch results in too-short grace periods, and not just once per minute per 400 TREE03 instances, but instead once every few seconds in each and every instance!

The problem is that both get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() might run after the grace period starts (thus having incremented rcu_state.gp_seq) but before the grace period has finished accounting for recent CPU-hotplug operations (thus having not yet incremented the root rcu_node structure's ->gp_seq field). This easy-to-trigger situation will cause poll_state_synchronize_rcu_full() to conclude that the grace period ended an extremely long time ago, resulting in a too-short grace period from the perspective of any concurrent readers.

The fix is to adjust the rcu_seq_done_exact() function to take a more liberal view of what constitutes an ongoing grace period, as in the following diff:

 1 diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
 2 index 2f9c9272cd486..d2a91f705a4ab 100644
 3 --- a/kernel/rcu/rcu.h
 4 +++ b/kernel/rcu/rcu.h
 5 @@ -162,9 +162,9 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
 6  {
 7    unsigned long cur_s = READ_ONCE(*sp);
 8
 9 -  return ULONG_CMP_GE(cur_s, s) ||
10 -    ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
11 +  return ULONG_CMP_GE(cur_s, s) ||
12 +    ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
13  }
14
15  /*

The change from 2 to 3 does the trick. The corresponding RFC patch may be found here.

Evaluation

With this fixed, many tens of hours of 400 instances of TREE03 ran without any too-short grace periods. Now on to the bugs that I was chasing a couple of months ago...

Lessons (Re)Learned

This effort took a couple of months, so it is well worth reflecting on the experience to see what might be done better next time.

First, much time was consumed increasing the failure rate (de-heisenbugging) from about once per year to about once per few hours, again to about three minutes, and yet again to about once per minute. Things would have gone faster had I put more time in during the first phase of de-heisenbugging. For example, once I noted that CPU-hotplug operations were involved, I should have experimented with rcutree's gp_preinit_delay, gp_init_delay, and gp_cleanup_delay module parameters, which insert delays that can increase the probability of races between CPU-hotplug operations and grace-period processing. On the other hand, it is all too easy to spend forever de-heisenbugging and never get around to actually finding the bug, and in this particular case, a large leap of imagination would have been required to get from the expedited grace periods that seemed to be causing trouble to the normal grace periods that were actually doing so. Either way, it was necessary to greatly increase the failure rate in order to fix the failure.

Second, much time was wasted due to my erroneous assumption that the bug involved the expedited grace period, when it was really the interaction between the normal grace period and the full-sized polled API. Unfortunately, it would have been difficult to check that assumption much before the last stage of de-heisenbugging.

Third, I could have saved a few days by more carefully checking my initial instrumentation. Debug code is also code, and thus can also contain bugs!

Fourth, I could have saved a day or two by reading the poll_state_synchronize_rcu_full() function's header comment.

Fifth, although symmetry plays an important role in human aesthetics and perception of elegance, in this case, the symmetry of the ->gp_seq accesses in the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions was in fact the bug.

Sixth and finally, I created this bug in July of 2022. Those promoting various validation and verification techniques therefore had more than two years to find this bug, which was in a very public code base. Perhaps next time they will find an RCU bug before I do. ;–)

For More Heisenbug-Hunting Information...

Many people have held forth on this topic, but these references should get you started:

  1. Section 11.6 (“Probability and Heisenbugs”) of Is Parallel Programming Hard, And, If So, What Can You Do About It?, with emphasis on Section 11.6.4 (“Hunting Heisenbugs”).
  2. The “Hunting Heisenbugs” talk at 2023 Linux Plumbers Conference.
  3. The similar “Hunting Heisenbugs” talk at 2024 Everything Open.
 
Read more...

from linusw

In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:

  • PAN for LPAE CONFIG_CPU_TTBR0_PAN
  • KCFI on ARM32 CONFIG_CFI_CLANG

As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!

A good rundown of these and other historically interesting security features can be found in Russell Currey's abridged history of kernel hardening which sums up what has been done up to now in a very approachable form.

PAN for LPAE

PAN is an abbreviation for the somewhat grammatically incorrect Privileged Access Never.

The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions get_from_user() and put_to_user(). Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.

In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.

The ARM32 hardware has for a few years had a config option named CONFIG_SW_DOMAIN_PAN which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.

For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.

As of the patch originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate translation table base registers (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).

By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace – unless explicitly enabled in get|put_[from|to]_user() – we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.

KCFI on ARM32

The Kernel Control Flow Integrity is a “forward edge control flow checker”, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.

KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag LLVM=1 to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.

Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option -fsanitize=kcfi else the option will be disabled.

The patch set is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special SYM_TYPED_FUNC_START() and SYM_FUNC_END() macros, inserting KCFI hashes also before functions written in assembly.

The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.

 
Read more...

from Gustavo A. R. Silva

The counted_by attribute

The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.

Here is an example of a flexible array annotated with this attribute:

struct bounded_flex_struct {
    ...
    size_t count;
    struct foo flex_array[] __attribute__((__counted_by__(count)));
};

In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.

In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).

The __counted_by() macro

In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.

#if __has_attribute(__counted_by__)
# define __counted_by(member)  __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif
  • c8248faf3ca27 (“Compiler Attributes: counted_by: Adjust name...“)

And with this we have been annotating flexible-array members across the whole kernel tree over the last year.

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };
  • ceba9725fb45 (“cxgb4: Annotate struct sched_table with ...“)

However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.

__counted_by() annotations in the kernel

There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) > packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event->datalen = datalen;
 	event->code = code;
 	event->ifidx = event_packet->msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&event->emsg, &event_packet->msg, sizeof(event->emsg));
 	memcpy(event->data, data, datalen);
-	event->datalen = datalen;
 	memcpy(event->ifaddr, event_packet->eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);
  • 62d19b358088 (“wifi: brcmfmac: fweh: Add __counted_by...“)

In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.

In the kernel, we've been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:

  • 6dc445c19050 (“clk: bcm: rpi: Assign –>num before accessing...“)

  • 9368cdf90f52 (“clk: bcm: dvp: Assign –>num before accessing...“)

Another common issue is when the counter is updated inside a loop. See the patch below.

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request->wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif->scan_request = request;
 	mod_timer(&vif->scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil->discovery_mode && cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil->discovery_mode && cmd->scan_type == WMI_ACTIVE_SCAN) {
+		cmd->discovery_mode = 1;
 		wil_dbg_misc(wil, "active scan with discovery_mode=1\n");
 	}
 
 	if (vif->mid == 0)
 		wil->radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif->mid,
-		      &cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd->num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)
  • 34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by...“)

The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that's material for a future post. For now, let's focus on the following excerpt.

-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;

Notice that in this case, num_channels is our counter, and it's set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.

As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.

Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id < 0xf0 &&
-	     pdu.cp.num_cis < ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis < pdu->num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &conn->iso_qos;
 
-		cis = &pdu.cis[pdu.cp.num_cis++];
+		cis = &pdu->cis[aux_num_cis++];
 		cis->cis_id = cis_id;
 		cis->c_sdu  = cpu_to_le16(conn->iso_qos.ucast.out.sdu);
 		cis->p_sdu  = cpu_to_le16(conn->iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis->c_rtn  = qos->ucast.out.rtn;
 		cis->p_rtn  = qos->ucast.in.rtn;
 	}
+	pdu->num_cis = aux_num_cis;
 
 ...
  • ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and...“)

Again, the entire patch does more than merely annotate the flexible-array member, but let's just focus on how aux_num_cis is used to access flexible array pdu->cis[].

In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.

Both solutions are acceptable, so use whichever is convenient for you. :)

Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:

  • [PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*

In a future post, I'll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don't do it!

Latest version: How to use the new counted_by attribute in C (and Linux)

 
Read more...

from Konstantin Ryabitsev

Message-ID's are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it's only natural to want to use memorable ones. Or maybe it's just me.

Regardless, here's what I do with neomutt and coolname:

  1. If coolname isn't yet packaged for your distro, you can install it with pip:

    pip install --user coolname
    
  2. Create this file as ~/bin/my-msgid.py:

    #!/usr/bin/python3
    import sys
    import random
    import string
    import datetime
    import platform
    
    from coolname import generate_slug
    
    parts = []
    parts.append(datetime.datetime.now().strftime('%Y%m%d'))
    parts.append(generate_slug(3))
    parts.append(''.join(random.choices(string.hexdigits, k=6)).lower())
    
    sys.stdout.write('-'.join(parts) + '@' + platform.node().split('.')[0])
    
  3. Create this file as ~/.mutt-fix-msgid:

    my_hdr Message-ID: <`/path/to/my/bin/my-msgid.py`>
    
  4. Add this to your .muttrc (works with mutt and neomutt):

    send-hook . "source ~/.mutt-fix-msgid"
    
  5. Enjoy funky message-id's like 20240227-flawless-capybara-of-drama-e09653@lemur. :)

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Intro

The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 – v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.

Core and protocol stack

Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM_PIDFD and SO_PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.

v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems – the wireless stack and devlink.

Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.

Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page_pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE_SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page_pool statistics over Netlink. Looking forward – there’s ongoing work to allow page_pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page_pool world – a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.

Number of key networking data structures (skb, netdevice, page_pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.

In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).

Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.

Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.

Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.

The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.

Among changes which may be more immediately visible to users – starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.

The AF_XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF_XDP sockets (checksumming, segmentation) as well as timestamping.

Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far).

Device APIs

Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)

In the “ethtool” family – v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames).

After many attempts we have finally gained solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.

In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.

Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.

Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).

Removed

Inspired by the returning discussion about code removal at the Maintainer Summit let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act_ipt, act_rsvp, act_tcindex, sch_atm, sch_cbq, sch_dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.

Community & process

The maintainers, developers and community members had a chance to meet at the BPF/netdev track at Linux Plumbers in Richmond, and the netdev.conf 0x17 conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small netconf gathering – thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris!

We have made minor improvements to the mailing list development process by allowing a wider set of folks to update patch status using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.

The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.

A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.

Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).

syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.

6.3: https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/ 6.4: https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/ 6.5: https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/ 6.6: https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/ 6.7: https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/ 6.8: https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/

 
Read more...

from arnd

Most compilers have an option to warn about a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the sparse checker introduced -Wdecl back in 2005. Ensuring that each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as static where possible to improve inter-function optimizations.

The warnings are not enabled in a default build, but are part of both make W=1 and make C=1 build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the W=1 warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.

After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of “make randconfig” builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total.

I uploaded the patches to https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from W=1 to the default set.

The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.

Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the feature request being open since 2017.

 
Read more...

from Jakub Kicinski

The LWN's development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.

Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.

Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.

Local stats

With that in mind it's only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.

Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list's history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel community not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.

netdev stats

During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven't received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.

We've also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can't take all the full credit :)

Random example

I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.

Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################

As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven't said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I'll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I'm hoping we can learn something by tracking its changes over time.

Fin

To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer's time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN's code is also available publicly (search for gitdm, IIRC).

 
Read more...

from linusw

As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the init task and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:

  • How tasks are actually scheduled on ARM32
  • How the kernelspace and userspace are actually separated, and thus how we move from one to the other

This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.

Terminology

With tasks we mean processes, threads and kernel threads. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.

Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called kthreadd — the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.

Processes in userspace are in essence executing computer programs, or objects with an older terminology, giving the origin of expressions such as object file format. The kernel will start very few such processes, but modprobe and init (which always has process ID 1) are notable exceptions. Any other userspace processes are started by init. Processes can fork new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. POSIX threads is usually the way this happens and further abstractions such as the GLib GThread etc exist.

Task pie chart A pie chart of tasks according to priority on a certain system produced using CGFreak shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.

The userspace is the commonplace name given to a specific context of execution where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but not from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls “syscalls” or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)

The kernelspace conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.

Background

This separation, along with everything else we take for granted in modern computers and operating systems was created in the first time-sharing systems such as the CTSS running on the IBM 700/7000 series computers in the late 1950ies. The Ferranti Atlas Computer in 1962-1967 and its supervisor program followed shortly after these. The Atlas invented nifty features such as virtual memory and memory-mapped I/O, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.

The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the Motorola 68000 CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program code or TEXT segments and a DATA segments.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.

All this said to give you a clear idea why the acronym SVC as in Supervisor Call is used rather than e.g. operating system call or kernel call which would have been more natural. This naming is historical.

Execution Modes or Levels

We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.

When it comes to the older CPUs in the ARMv4, ARMv5 and ARMv6 range these have a special supervisor mode (SVC mode) and a user mode, and as you could guess these two modes are mapped directly to kernelspace and userspace in Linux. In addition to this there are actually 5 additional exception modes for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the user mode belong to kernelspace.

Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from supervisor mode and this way it is clear who is in control.

The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper secure monitor or just monitor mode.

For reference, these modes in the ARMv8 architecture correspond to “privilege levels”. Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 “higher” privilege levels. EL2 is used for hypervisor (virtualization) and EL3 is used for a secure monitor that oversee the switch back and forth to the trusted execution environment (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in drivers/tee in the kernel) but it is a different thing than Linux entirely.

These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.

Exceptional Events

We discussed that different execution modes pertain to certain exceptions. So let's recap ARM32 exceptions.

As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.

The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.

Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the SVC (supervisor call) instruction. This very same instruction — i.e. with the same binary operation code — was previously called SWI (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named TRAP, on x86 there is the INT instruction and RISC-V has the SBI (supervisor binary interface) call.

In my article about how the ARM32 architecture is set up I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace and back.

The transitions occurs at these distinct points:

  • A hardware RESET occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.
  • An undefined instruction is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)
  • A software interrupt occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special SVC (aka SWI) instruction that passes a 1-byte parameter to the software interrupt handler.
  • A prefetch abort occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.
  • A data abort occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.
  • An address exception occurs. This doesn't happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.
  • A hardware interrupt occurs – since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: IRQ and FIQ. Each can be routed to an external interrupt controller, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.
  • A fault occurs such as through division by zero or other arithmetic fault – the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a processor abort.

That's all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and preempted, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.

So how does that actually happen? Let's look at that next.

Entering Kernelspace

Everything has a beginning. I have explained in a previous article how the kernel bootstraps from the function start_kernel() in init/main.c and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.

Further down start_kernel() we initialize the timers, start the clocksource (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the start_kernel() function where arch_call_rest_init() is called. This is in most cases a call to rest_init() in the same file (only S390 does anything different) and that in turn actually initializes some processes:

pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
(...)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

We create separate threads running the in-kernel functions kernel_init and kthreadd, which is the kernel thread daemon which in turn spawns all kernel threads.

The user_mode_thread() or kernel_thread() calls create a new processing context: they both call kernel_clone() which calls copy_process() with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using dup_task_struct() passing current as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach copy_thread() which calls back into the architecture to initialize struct thread_info for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel_init passed instead of just forking), the following happens:

memset(childregs, 0, sizeof(struct pt_regs));
thread->cpu_context.r4 = (unsigned long)args->fn_arg;
thread->cpu_context.r5 = (unsigned long)args->fn;
childregs->ARM_cpsr = SVC_MODE;
(...)
thread->cpu_context.pc = (unsigned long)ret_from_fork;

fn_arg will be NULL in this case but fn is kernel_init or kthreadd. And we execute in SVC_MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to ret_from_fork will be significant, so notice this!

Neither of the functions kernel_init or kthreadd will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called schedule() a single time, which means nothing happens, because nothing is yet scheduled.

kernel_init is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for kthreadd to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged __init) and only then proceed to run_init_process(). As indicated, this will start the init process using kernel_execve(), usually /sbin/init which will then proceed to spawn all usermode processes/tasks. kernel_execve() will check for supported binary formats and most likely call the ELF loader to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START_THREAD() which in turn wraps the ARM32-specific start_thread() which will, amongst other things, do this:

regs->ARM_cpsr = USR_MODE;
(...)
regs->ARM_pc = pc & ~1;

So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. regs->ARM_cpsr will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.

kthreadd on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.

But as said: neither is executing.

In order to actually start the scheduling we call schedule_preempt_disabled() which will issue schedule() with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more “blocky” at this point. However: we already have the clockevent timer running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler_tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.

Until this point we have been running in the context of the Linux init task which is a elusive hard-coded kernel thread with PID 0 that is defined in init/init_task.c and which I have briefly discussed in a previous article. This task does not even appear in procfs in /proc.

As we call schedule(), the kernel init task will preempt and give way to kthreadd and then to the userspace init process. However when the scheduler again schedules the init task with PID 0, we return to rest_init(), and we will call cpu_startup_entry(CPUHP_ONLINE) and that function is in kernel/sched/idle.c and looks like this:

void cpu_startup_entry(enum cpuhp_state state)
{
        arch_cpu_idle_prepare();
        cpuhp_online_idle(state);
        while (1)
                do_idle();
}

That's right: this function never returns. Nothing ever breaks out of the while(1) loop. All that do_idle() does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU “idle” i.e. sleep, since nothing is going on. Then the loop repeats. The kernel init task, PID 0 or “main() function” that begins at start_kernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the init task. The kernel has some documentation of the inner loop that assumes that you know this context.

Let's look closer at do_idle() in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):

while (!need_resched()) {
    local_irq_disable();
    enter_arch_idle_code();
    /* here a considerable amount of wall-clock time can pass */
    exit_arch_idle_code();
    local_irq_enable();
}
(...)
schedule_idle();

This will spin here until something else needs to be scheduled, meaning the init task has the TIF_NEED_RESCHED bit set, and should be preempted. The call to schedule_idle() soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic schedule() call which we will see later.

We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.

Scheduling the first task

So we know that schedule() has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.

We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.

Scheduler model A mental model of the scheduler: scheduler_tick() sets the flag TIF_NEED_RESCHED and a later call to schedule() will actually call out to check_and_switch_context() that does the job of switching task.

Scheduler tick and TIF_NEED_RESCHED

As part of booting the kernel in start_kernel() we first initialized the scheduler with a call to sched_init() and the system tick with a call to tick_init() and then the timer drivers using time_init(). The time_init() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in drivers/clocksource.

There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.

The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it's interrupt will not be able to fire until local_irq_enable() is called further down in start_kernel(). After this point the system has a running scheduling tick.

As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.

The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO_HZ).

What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.

When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own ->event_handler() which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to do_timer() followed by a call to scheduler_tick(). (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)

As a result of calling scheduler_tick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, “rescheduled” and calls resched_curr(rq) on the runqueue for the CPU, which in turn will call set_tsk_need_resched(curr) on the current task, which flags it as ready to be rescheduled.

set_tsk_need_resched() will set the flag TIF_NEED_RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in arch/arm/include/asm/thread_info.h and ARM32 has a bitmask version of this flag helpfully named _TIF_NEED_RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.

This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means “at your earliest convenience, yield to another task”. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when schedule() is called.

The Task State and Stack

We mentioned the architecture-specific struct thread_info so let's hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread_info is simply part of the task_struct. The struct task_struct is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:

struct task_struct {
    struct thread_info thread_info;
    (...)
    unsigned int state;
    (...)
    void *stack;
    (...)
    struct mm_struct *mm;
    (...)
    pid_t pid;
    (...)
};

The struct thread_info which in our case is a member of task_struct contains all the architecture-specific aspects of the state.

The task_struct refers to thread_info, but also to a separate piece of memory void *stack called the task stack, which is where the task will store its activation records when executing code. The task stack is of size THREAD_SIZE, usually 8KB (2 * PAGE_SIZE). These days, in most systems, the task stack is mapped into the VMALLOC area.

The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first enabled THREAD_INFO_IN_TASK which enabled thread info to be contained in the task_struct and then enabled CONFIG_VMAP_STACK for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!

Task struct The task_struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread_info is inside it, and inside the thread_info is a cpu_context_save. It has a task stack of size THREAD_SIZE for ARM32 which is typically twice the PAGE_SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task_struct and thread_info is such that task_struct is Linux-generic and thread_info is architecture-specific and they correspond 1-to-1.

Actual Preemption

In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of schedule() getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?

schedule() can be called for two reasons:

  • Voluntary preemption: such as when a kernel thread want to give up it's time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call schedule_preempt_disabled() the very first time, we voluntarily preempt the kernel execution of the init task with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the kthreadd process. Other places can be found by git grep:ing for calls to cond_resched() or just an explicit call to schedule().
  • Forced preemption: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its' timeslice, and schedule_tick() has set the TIF_NEED_RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.

Places where forced preemption happens:

The short answer to the question “where does forced preemption happen?” is “at the end of exception handlers”. Here are the details.

The most classical place for preemption of userspace processes is on the return path of a system call. This happens from arch/arm/kernel/entry-common.S in the assembly snippets for ret_slow_syscall() and ret_fast_syscall(), where the ARM32 kernel makes an explicit call to do_work_pending() in arch/arm/kernel/signal.c. This will issue a call to schedule() if the flag _TIF_NEED_RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is ret_from_fork which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.

The most common place for preemption is however when returning from a hardware interrupt. Interrupts on ARM32 are handled in assembly in arch/arm/kernel/entry-armv.S with a piece of assembly that saves the processor state for the current CPU into a struct pt_regs and from there just calls the generic interrupt handling code in kernel/irq/handle.c named generic_handle_arch_irq(). This code is used by other archs than ARM32 and will nominally just store the system state and registers in a struct pt_regs record on entry and restore it on exit. However when the simplistic code in generic_handle_arch_irq() is done, it exits through the same routines in arch/arm/kernel/entry-common.S as fast and slow syscalls, and we can see that in ret_to_user_from_irq the code will explicitly check for the resched and other flags with ldr r1, [tsk, #TI_FLAGS] and branch to the handler doing do_work_pending(), and consequently preempt to another task instead of returning from an interrupt.

Now study do_work_pending():

do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
        /*
         * The assembly code enters us with IRQs off, (...)
         */

        do {
                if (likely(thread_flags & _TIF_NEED_RESCHED)) {
                        schedule();
                } else {
                        (...)
                }
                local_irq_disable();
                thread_flags = read_thread_flags();
        } while (...);
        return 0;
}

Notice the comment: we enter do_work_pending() with local IRQs disabled so we can't get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is local_irq_disable() – we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do_work_pending you will see that this looks the same on other architectures with similar setup.

In reality do_work_pending() does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local_irq_disable().

The struct pt_regs should be understood as “processor trace registers” which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the CPU state, including the program counter pc, which is the place where the task was supposed to resume execution, unless it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt_regs are stored in the task stack during the call to generic_handle_arch_irq().

The assembly in entry-common.S can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:

	(...)
slow_work_pending:
	mov	r0, sp				@ 'regs'
	mov	r2, why				@ 'syscall'
	bl	do_work_pending
	cmp	r0, #0
	beq	no_work_pending
	(...)

ENTRY(ret_to_user_from_irq)
	ldr	r1, [tsk, #TI_FLAGS]
	movs	r1, r1, lsl #16
	bne	slow_work_pending
no_work_pending:
	asm_trace_hardirqs_on save = 0
	ct_user_enter save = 0
	restore_user_regs fast = 0, offset = 0

We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by do_work_pending() which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to no_work_pending and restore the usersmode registers and continue execution.

Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler_tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler_tick() gets called by the timer interrupt, and if it sets TIF_NEED_RESCHED a different thread will start to execute on our way out of the exception handler!

The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named restore_user_regs can be found in entry-header.S and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):

	mov	r2, sp
	(...)
	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
	add	sp, sp, #\offset + PT_REGS_SIZE
	movs	pc, lr				@ return & move spsr_svc into cp

r2 is set to the stack pointer, where pt_regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) — NOTE: the little caret (^) after the ldmdb instruction means “also load CPSR from the stack” — then moves the stackpointer past the saved registers and returns.

Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task task stack, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.

Also notice that this must happen in the end of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.

But let's not skip the last part of what schedule() does.

Setting the Program Counter

So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do_work_pending() which in turn will call schedule() for us.

The schedulers schedule() call is supposed to very quickly select a process to run next. Eventually it will call context_switch() in kernel/sched/core.c, which in turn will do essentially two things:

  • Check if the next task has a unique memory management context (next->mm is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call switch_mm_irqs_off() which in the ARM32 case is just defined to the ARM32-specific switch_mm() which will call the ARM32-specific check_and_switch_context()NOTE that this function for any system with MMU is hidden in the arch/arm/include/asm/mmu_context.h header file — which in turn does one of two things:
    • If interrupts are disabled, we will just set mm->context.switch_pending = 1 so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific finish_arch_post_lock_switch() which is implemented right below and gets called right after dropping the runqueue lock.
    • If interrupts are not disabled, we will immediately call cpu_switch_mm(). This is a per-cpu callback witch is written in assembly for each CPU as cpu_NNNN_switch_mm() inside arch/arm/mm/proc-NNNN.S. For example, all v7 CPUs have the cpu_v7_switch_mm() in arch/arm/mm/proc-v7.S.
  • Switch context (such as the register states and stack) to the new task by calling switch_to() with the new task and the previous one as parameter. In most cases this latches to an architecture-specific __switch_to(). In the ARM32 case, this routine is written in assembly and can be found in arch/arm/kernel/entry-armv.S.

Now the final details happens in __switch_to() which is supplied the struct thread_info (i.e. the architecture-specific state) for both the current and the previous task:

  • We store the registers of the current task in the task stack, at the TI_CPU_SAVE index of struct thread_info, which corresponds to the .cpu_context entry in the struct, which is in turn a struct cpu_context_save, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to continue as if nothing has happened when we “return” after the schedule() call. I put “return” in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.
  • Then the TLS (Thread Local Storage) settings for the new task are obtained and we issue switch_tls(). On v6 CPUs this has special implications, but in most cases we end up using switch_tls_software() which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the kernel-provided user helpers, which in turn are a few kernel routines “similar to but different from VDSO” that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the struct thread_info pointer, so it cannot be used for TLS on ARMv7. (More on this later.)
  • We then broadcast THREAD_NOTIFY_SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet __switch_to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.

Then we reach the final step in __switch_to(), which is a bit different depending on whether we use CONFIG_VMAP_STACK or not.

The simple path when we are not using VMAP:ed stacks looks like this:

	set_current r7, r8
	ldmia	r4, {r4 - sl, fp, sp, pc}	@ Load all regs saved previously

Here r7 contains a pointer to the next tasks thread_info (which will somewhere the kernel virtual memory map), and set_current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label __current but ARMv7 and SMP systems have a dedicated special CP15 TPIDRURO thread ID register to store this in the CPU so that the thread_info can be located very quickly. (The only user of this information is, no surprise, the get_current() assembly snippet, but that is in turn called from a lot of places and contexts.)

The next ldmia instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) and pc(r15) from the location pointed out by r4, which again is the .cpu_context entry in the struct thread_info, the struct cpu_context_save, which is all the context there is including pc so the next instruction after this will be whatever pc was inside the struct cpu_context_save. We have switched to the new task and preemption is complete.

But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn't the task we're switching to going to miss those registers?

For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be “clobbered” during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It's a function call to inline assembly and not an exception.

And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or (C) explicitly called schedule() when this happened, this just doesn't matter.

Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)

When starting a completely new task all the contents of struct cpu_context_save will be zero, and the return address will be set to ret_from_fork or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.

If we're on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn't matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt_regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.

This is why r0 thru r3 are missing from struct cpu_context_save and need not be preserved.

When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution right after the schedule(); call. If we were on our way out of an exception in do_work_pending() we will proceed to return from the exception handler, and to the process it will “feel” like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.

Running init

So how does /sbin/init actually come to execute?

We saw that after start_kernel we get to rest_init which creates the thread with pid = user_mode_thread(kernel_init, NULL, CLONE_FS).

Then kernel_init calls on kernel_execve() to execute /sbin/init. It locates an ELF parser to read and page in the file. Then it will eventually issue start_thread() which will set regs->ARM_cpsr = USR_MODE and regs->ARM_pc to the start of the executable.

Then this tasks task_struct including memory context etc will be selected after a call to schedule().

But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we “cheat”:

When we initialized the process in arch/arm/kernel/process.c, we set the program counter to ret_from_fork so we are not going back after any schedule() call: we are going back to ret_from_fork! And this is just an exception return path, so this will restore regs->ARM_cpsr to USR_MODE, and “return from an exception” into whatever is in regs->ARM_pc, which is the start of the binary program from the ELF file!

So /sbin/init is executed as a consequence of returning from a fake exception through ret_from_fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.

This is how ARM32 schedules and executes processes.

 
Read more...

from Christian Brauner

The original blogpost is at https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

Early on when the LXD project was started we were clear that we wanted to make it possible to change settings while the container is running. On of the very first things that came to our mind was making it possible to insert new mounts into a running container. When I was still at Canonical working on LXD we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.

Back then the only way to create mounts or change mount option was by using the mount(2) system call. The mount system call multiplexes a lot of different operations. For example, it doesn't just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes. Mounting is overall a pretty complex operation as it doesn't just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.

I want to take a look at our legacy solution to this problem and a new approach that I've used and that has existed for a while but never talked about widely.

Creative uses of mount(2)

Before openat2(2) came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks. A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem. These attacks used to be quite common and there was no straightforward solution available; at least not before the RESOLVE_* flag namespace of openat2(2) improved things so considerably that symlink attacks on new kernels can be effectively blocked.

But before openat2() symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm. I won't go into too much detail but it is roughly done by verifying each path component in userspace using O_PATH file descriptors making sure that the paths point into the container's rootfs.

But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that mount(2) only operates on paths. So you are still susceptible to symlink attacks as soon as you call mount(source, target, ...).

The way we solved this problem was by realizing that mount(2) was perfectly happy to operate on /proc/self/fd/<nr> paths (This is similar to how fexecve() used to work before the addition of the execveat() system call.). So we could verify the whole path and then open the last component of the source and target paths at which point we could call mount("/proc/self/fd/1234", "/proc/self/fd/5678", ...).

We immediately thought that if mount(2) allows you to do that then we could easily use this to mount into namespaces. So if the container is running it its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container's mount namespace (and it's owning user namespace) and then simply call mount("/proc/self/fd/1234", "/mnt", ...). In pseudo C code it would look roughly:

fd_mnt = openat(-EBADF, "/opt", O_PATH, ...);
setns(fd_userns, CLONE_NEWUSER);
setns(fd_mntns, CLONE_NEWNS);
mount("/proc/self/fd/fd_mnt", "/mnt", ...);

However, this isn't possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller's mount namespace. Since the caller will be located in the container's mount namespace after the setns() call but the source file descriptors refers to a mount located in the host's mount namespace this check fails. The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted. Back then it would've also meant that adding mounts to containers would've only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.

Mount namespace tunnels

So we came up with the idea of mount namespace tunnels. Since we spearheaded this idea it has been picked up by various projects such as systemd for system services and it's own systemd-nspawn container runtime.

The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
# Create new mount namespace with all mounts turned into dependent mounts.
unshare --mount --propagation=slave

and then create a mount on or beneath the shared /opt mount on the host:

mkdir /opt/a
mount --bind /tmp /opt/a

then the new mount of /tmp on the dentry /opt/a will propagate into the mount namespace we created earlier. Since the /opt mount at the /opt dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:

mount --move /opt/a /mnt

As a last step we can unmount /opt/a in the host mount namespace. And as long as the /mnt dentry doesn't reside on a mount that is a dependent mount of /opt's peer group the unmount of /opt/a we just performed on the host will only unmount the mount in the host mount namespace.

There are various problems with this solution:

  • It's complex.
  • The container manager needs to set up the mount tunnel when the container starts. In other words, it needs to part of the architecture of the container which is always unfortunate.
  • The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted. Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.

Mounting into mount namespaces

A few years ago a new mount api made it into the kernel. Shortly after I've also added the mount_setattr(2) system call. Since then I've been expanding the abilities of this api and to put it to its full use.

Unfortunately the adoption of the new mount api has been slow. Mostly, because people don't know about it or because they don't yet see the many advantages it offers over the old one. But with the next release of the mount(8) binary a lot of us use the new mount api will be used whenever possible.

I won't be covering all the features that the mount api offers. This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let's us get rid of the complex mount propagation scheme.

Luckily, the new mount api is designed around file descriptors.

Filesystem Mounts

To create a new filesystem mount using the old mount api is simple:

mount("/dev/sda", "/mnt", "xfs", ...);

We pass the source, target, and filesystem type and potentially additional mount options. This single system call does a lot behind the scenes. A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller's mount namespace.

In the new mount api the various steps are split into separate system calls. While this makes mounting more complex it allows allows for greater flexibility. Mounting doesn't have to be a fast operation and never has been.

So in the new mount api we would create a new filesystem mount with the following steps:

/* Create a new filesystem context. */
fd_fs = fsopen("xfs");

/*
 * Set the source of the filsystem mount. Whether or not this is required
 * depends on the type of filesystem of course. For example, mounting a tmpfs
 * filesystem would not require us to set the "source" property as it's not
 * backed by a block device. 
 */
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);

/* Actually create the superblock and prepare to allocate a mount. */
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

The fd_fs file descriptor refers to VFS context object that doesn't concern us here. Let it suffice that it is an opaque object that can only be used to configure the superblock and the filesystem until fsmount() is called:

/* Create a new detached mount and return an O_PATH file descriptor refering to the mount. */
fd_mnt = fsmount(fd_fs, 0, 0);

The fsmount() call will turn the context file descriptor into an O_PATH file descriptor that refers to a detached mount. A detached mount is a mount that isn't attached to any mount namespace.

Bind Mounts

The old mount api created bind mounts via:

mount("/opt", "/mnt", MNT_BIND, ...)

and recursive bind mounts via:

mount("/opt", "/mnt", MNT_BIND | MS_REC, ...)

Most people however will be more familiar with mount(8):

mount --bind /opt /mnt
mount --rbind / /mnt

Bind mounts play a major role in container runtimes and system services as run by systemd.

The new mount api supports bind mounts through the open_tree() system call. Calling open_tree() on an existing mount will just return an O_PATH file descriptor referring to that mount. But if OPEN_TREE_CLONE is specified open_tree() will create a detached mount and return an O_PATH file descriptor. That file descriptor is indistinguishable from an O_PATH file descriptor returned from the earlier fsmount() example:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLONE, ...)

creates a new detached mount of /opt and:

fd_mnt = open_tree(-EBADF, "/", OPEN_TREE_CLONE | AT_RECURSIVE, ...)

would create a new detached copy of the whole rootfs mount tree.

Attaching detached mounts

As mentioned before the file descriptor returned from fsmount() and open_tree(OPEN_TREE_CLONE) refers to a detached mount in both cases. The mount it refers to doesn't appear anywhere in the filesystem hierarchy. Consequently, the mount can't be found by lookup operations going through the filesystem hierarchy. The new mount api thus provides an elegant mechanism for:

mount("/opt", "/mnt", MS_BIND, ...);
fd_mnt = openat(-EABDF, "/mnt", O_PATH | O_DIRECTORY | O_CLOEXEC, ...);
umount2("/mnt", MNT_DETACH);

and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace. This alone is already a very powerful tool but we won't go into depth today.

Most of the time a detached mount isn't wanted however. Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it. So we need to attach them to the filesystem hierarchy.

In order to attach a mount we can use the move_mount() system call. For example, to attach the detached mount fd_mnt we create before we can use:

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

This will attach the detached mount of /opt at the /mnt dentry on the / mount. What this means is that the /opt mount will be inserted into the mount namespace that the caller is located in at the time of calling move_mount(). (The kernel has very tight semantics here. For example, it will enforce that the caller has CAP_SYS_ADMIN in the owning user namespace of its mount namespace. It will also enforce that the mount the /mnt dentry is located on belongs to the same mount namespace as the caller.)

After move_mount() returns the mount is permanently attached. Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to. In other words, move_mount() is an irreversible operation.

The main point is that before move_mount() is called a detached mount doesn't belong to any mount namespace and can thus be freely moved around.

Mounting a new filesystem into a mount namespace

To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:

fd_fs = fsopen("xfs");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda", 0);
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fd_mnt = fsmount(fd_fs, 0, 0);

For filesystems that require host privileges such as xfs, ext4, or btrfs (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges. However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace. So we can simply attach to the user namespace and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and then use

move_mount(fd_mnt, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

to attach the detached mount anywhere we like in the container.

Mounting a new bind mount into a mount namespace

A bind mount is even simpler. If we want to share a specific host directory with the container we can just have the container manager call:

fd_mnt = open_tree(-EBADF, "/opt", OPEN_TREE_CLOEXEC | OPEN_TREE_CLONE);

to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:

setns(fd_userns);
setns(fd_mntns);

and as above we are free to attach the detached mount anywhere we like in the container.

Conclusion

This is really it and as simple as it sounds. It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers. We've making heavy use of this LXD and it is general the proper way to insert mounts into mount namespaces on newer kernels.

 
Read more...

from Konstantin Ryabitsev

At some point in the recent past, mutt changed the way it generates Message-ID header values. Instead of the perfectly good old way of doing it, the developers switched to using base64-encoded random bytes. The base64 dictionary contains the / character, which causes unnecessary difficulties when linking to these messages on lore.kernel.org, since the / character needs to be escaped as %2F for everything to work properly.

Mutt developers seem completely uninterested in changing this, so please save everyone a lot of trouble and do the following if you're using mutt for your kernel development needs (should work for all mutt versions):

  1. Create a ~/.mutt-hook-fix-msgid file with the following contents (change “mylaptop.local” to whatever you like):

    my_hdr Message-ID: <`uuidgen -r`@mylaptop.local>
    
  2. Add the following to your ~/.muttrc:

    send-hook . "source ~/.mutt-hook-fix-msgid"
    

UPDATE: if you have mutt 2.1 or later you can alternatively set the $message_id_format variable to restore the pre-mutt-2.0 behaviour:

# mutt-2.1+ only
set message_id_format = "<%Y%02m%02d%02H%02M%02S.G%c%p@%f>"

Thanks to Thomas Weißschuh for the suggestion!

 
Read more...

from Jakub Kicinski

NIC drivers pre-allocate memory for received packets. Once the packets arrive NIC can DMA them into the buffers, potentially hundreds of them, before host processing kicks in.

For efficiency reasons each packet-processing CPU (in extreme cases every CPU on the system) will have its own set of packet queues, including its own set of pre-allocated buffers.

The amount of memory pre-allocated for Rx is a product of:

  • buffer size
  • number of queues
  • queue depth

A reasonable example in data centers would be:

8k * 32 queues * 4k entries = 1GB

Buffer size is roughly dictated by the MTU of the network, for a modern datacenter network 8k (2 pages) is likely a right ballpark figure. Number of queues depends on the number of cores on the system and the request-per-second rate of the workload. 32 queues is a reasonable choice for the example (either 100+ threads or a network-heavy workload).

Last but not least – the queue depth. Because networking is bursty, and NAPI processing is at the whim of the scheduler (the latter is more of an issue in practice) the queue depths of 4k or 8k entries are not uncommon.

Can we do better?

Memory is not cheap, having 1GB of memory sitting around unused 99% of the time has a real cost. If we were to advise a NIC design (or had access to highly flexible devices like the Netronome/Corigine NICs) we could use the following scheme to save memory:

Normal processing rarely requires queue depth of more than 512 entries. We could therefore have smaller dedicated queues, and a larger “reserve” – a queue from which every Rx queue can draw, but which requires additional synchronization on the host side. To achieve the equivalent of 4k entries we'd only need:

8k * 32 queues * 512 entries + 8k * 1 reserve * 4k entries = 160MB

The NIC would try to use the 512 entries dedicated to each queue first, but if they run out (due to a packet burst or a scheduling delay) it could use the entries from the reserve. Bursts and latency spikes are rarely synchronized across the queues.

Can we do worse?

In practice memory savings are rarely top-of-mind for NIC vendors. Multiple drivers in Linux allocate a set of rings for each thread of the CPU. I can only guess that this is to make sure iperf tests run without a hitch...

As we wait for vendors to improve their devices – double check the queue count and queue size you use are justified (ethtool -g / ethtool -l).

 
Read more...

from kees

How to modernize C arrays for greater memory safety: a case-study in refactoring the Linux kernel and a look to the future

Kees Cook

C is not just a fancy assembler any more

Large projects written in C, especially those written close to the hardware layer like Linux, have long treated the language as a high-level assembler. Using C allowed for abstracting away much of the difficulty of writing directly in machine code while still providing easy low-level access to memory, registers, and CPU features. However, C has matured over the last half century, and many language features that improve robustness go unused in older codebases. This is especially true for arrays, where the historical lack of bounds checking has been a consistent source of security flaws.

Converting such codebases to use “modern” language features, like those in C99 (still from the prior millennium), can be a major challenge, but it is an entirely tractable problem. This post is a deep dive into an effort underway in the Linux kernel to make array index overflows (and more generally, buffer overflows) a thing of the past, where they belong. Our success hinges on replacing anachronistic array definitions with well-defined C99 flexible arrays. This approach can be used by developers to refactor C code, making it possible to leverage 21st century mitigations (like -fsanitize=bounds and FORTIFY_SOURCE), since such things can finally be cleanly applied to the modernized codebase.

The fraught history of arrays in C

For the compiler to successfully apply array index bounds checking, array sizes must be defined unambiguously, which is not always easy in C. Depending on the array definition, bounds checking falls roughly into three categories: fixed-sized arrays, dynamically-sized arrays, and pointer offsets. Each category of array definitions must be made unambiguous before the next, as they mostly build on top of each other. For example, if the compiler cannot protect a fixed-sized array, it certainly cannot protect a dynamically-sized array, and array indexing is just a specialized case of calculating a memory pointer offset.

Properly defined dynamically-sized arrays were introduced in C99 (int foo[]), and called “flexible arrays”. Before that, many C projects used the GNU extension of zero-length arrays (int foo[0]), which is not recognized by the C standard. This was done because, before the GNU extension, C projects would use single-element arrays (int foo[1]) which had several frustrating characteristics. (Using sizeof() on such a structure would include a single element as well, which would require additional handling to get allocation sizes to be accurate. This is not a problem for zero-element or true flexible arrays.)

However, due to yet more historical situations (e.g. struct sockaddr, which has a fixed-size trailing array that is not supposed to actually be treated as fixed-size), GCC and Clang actually treat all trailing arrays as flexible arrays. This behavior makes things even more problematic, since it becomes impossible to limit a flexible array heuristic to only 1-element or 0-element (i.e. zero-length) arrays. For example, a compiler can't tell the intent of variable's use here:

struct obj {
        ...
        unsigned char bytes;
        int variable[4];
};

Is it actually a 4 element array, or is it sized by the bytes member? As such, compilers have had to assume that trailing arrays must be intended to be dynamically sized (even though most are intended to be fixed-size).

To clear the way for sensible protection of fixed-size arrays, and to have a common framework for handling dynamically-sized arrays, Linux must have all the “fake” flexible array members replaced with actual C99 flexible array members so that the programmer's intent can actually be represented in an unambiguous way. With this done, -Warray-bounds (and similar things like __builtin_object_size()) will catch compile-time problems, and -fsanitize=bounds (and similar things like __builtin_dynamic_object_size()) can catch run-time problems.

Once fixed-sized arrays are protected, dynamically sized arrays can be protected as well, though this requires introducing a way to annotate structures that contain flexible arrays. Nearly all such structs also contain the count of allocated elements present in the flexible array:

struct obj {
        ...
        unsigned short count;
        struct foo items[]; /* Has "count" many "struct foo"s */
} *ptr;

Such structs therefore fully describe their contents at runtime (and are called “flexible array structures” from here on). In other words, their size can be determined at run-time as:

sizeof(*ptr) + sizeof(*ptr->items) * ptr->count

Teaching the compiler which struct member is associated with the count of a given flexible array member will allow -fsanitize=bounds and __builtin_dynamic_object_size() to reason about flexible array structure usage as well, covering all arrays in Linux with “known bounds”.

(Not covered here is the closely related work to tighten the FORTIFY_SOURCE implementation for the memcpy()-family of functions which also depends on making flexible array sizes unambiguous.)

Replacing “fake” flexible arrays

Compile-time diagnostics about the size of arrays use either internal value range checking or things similar to the FORTIFY_SOURCE macros (which use __builtin_object_size() for their implementations). This works well for arrays not at the end of the structure, but gets disabled for trailing arrays since the compiler must treat trailing arrays as flexible arrays (see struct sockaddr above). And for everything treated as a flexible array (i.e. dynamically sized), the compiler cannot know the array length at compile time, since it will be only known at runtime. To make such array declarations unambiguous (and therefore able to gain sane runtime bounds checking), compilers must gain an option to disable all “fake” flexible array heuristics, and treat only true flexible arrays as flexible arrays.

The creation of -fstrict-flex-arrays is now available in recent GCC and Clang builds, but any project using it will need to replace all fake flexible arrays with true flexible arrays first (to separate them from any fixed-size trailing arrays). This comes with several challenges.

Replace 0-length arrays

Most replacement of 0-length arrays with flexible arrays requires no special handling. Simply removing the “0” in the array declaration is sufficient. For example,

struct obj {
        ...
        int flex[0];
};

becomes:

struct obj {
        ...
        int flex[];

};

However, there are a few things of note that can go wrong with these conversions:

Changes to sizeof()

While sizeof(instance->flex) for a 0-length array returns 0, it becomes a compile-time failure once it becomes a true flexible array. This usually manifests within other complex macros that are examining the details of a given struct, and are usually hidden bugs that switching to a flexible array helps expose.

Pass by value

Converting to a true flexible array will expose any strange cases of trying to pass a flexible array struct by value. These are almost always a bug, so it's another case where a problem is exposed by cleaning up fake flexible arrays. For example:

net/core/flow_dissector.c: In function 'is_pppoe_ses_hdr_valid':
net/core/flow_dissector.c:898:13: note: the ABI of passing struct with a flexible array member has changed in GCC 4.4

898 | static bool is_pppoe_ses_hdr_valid(struct pppoe_hdr hdr)
    |                                   ^~~~~~~~~~~~~~~~~~~~~~

Flexible arrays in unions

C99 6.7.2.1 “Structure and union specifiers” #16 declares true flexible arrays may not be in unions nor otherwise empty structures: “As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member.”

However, this situation is allowed by the GNU “trailing array” extension, where such arrays are treated as flexible arrays. More importantly, flexible arrays (via the GNU extension) are used in unions in many places throughout Linux code. The C99 treatment of true flexible arrays appears to be only a definitional limitation (and likely just an oversight) since the restriction can be worked around with creative use of anonymous structs. For example, this will build:

struct obj {
        ...
        union {
                struct foo name1[0];
                struct bar name2[0];
        };
};

but this will not:

struct obj {
        ...
        union {
                struct foo name1[];
                struct bar name2[];
        };
};
<source>:5:22: error: flexible array member in union
  5 | struct foo name1[];
    |            ^~~~~

But in both cases, the compiler treats name1 and name2 as flexible arrays. What will happily compile, though, is wrapping true flexible arrays in a struct that has at least 1 other non-true-flexible array, including an empty anonymous struct (i.e. taking up no size):

struct obj {
        ...
        union {
                struct {
                        struct { } __unused_member1;
                        struct foo name1[];
                };
                struct {
                        struct { } __unused_member2;
                        struct bar name2[];
                };
        };
};

Thankfully, this was wrapped in Linux with the DECLARE_FLEX_ARRAY() macro:

struct obj {
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};

which makes this much more readable. I hope to see future C standards eliminate this restriction.

Overlapping composite structure members

This is another case of a real bug being exposed by true flexible array usage, as it is possible to create an implicit union of a flexible array and something else by including a flexible array structure in another struct. For example:

struct inner {
        ...
        int flex[0];
};

struct outer {
        ...
        struct inner header;
        int overlap;
        ...
} *instance;

Here, instance->overlap and instance->header.flex[0] share the same memory location. Whether or not this is intentional cannot be understood by the compiler. If it is a bug, then using a true flexible array will trigger a warning. If it's not a bug, rearranging the structures to use an actual union is needed (see above).

struct definition parsed by something other than a C compiler

If the converted struct is part of a source file that is parsed by something that is not a C compiler, it may not be prepared to handle empty square braces on arrays. For example, SWIG broke when the Linux Userspace API headers got converted. This is a known issue in SWIG, and can be worked around in various ways.

Replace 1-element arrays

Most 1-element array conversions are similar to 0-length array conversions, but with the effect that the surrounding structure's sizeof() changes. This leads to a few additional significant issues:

Size calculations

If a struct is used entirely internally to Linux, it is generally sufficient to make changes to both the struct and all size calculations, which will result in identical binary output. For example:

struct object {
        ...
        int flex[1];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p->flex[0]) * (count - 1)),
            GFP_KERNEL);

the above count - 1 becomes just count now:

struct object {
        ...
        int flex[];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p->flex[0]) * count),
            GFP_KERNEL);

If all size calculations are correctly adjusted, there should be no differences in the resulting allocation size, etc. If a discrepancy is found, it is going to be either a bug introduced by the conversion, or the discovery of an existing bug in the original size calculations.

Note that depending on the sizes of the structure, its flexible array element, and count, there is also the risk associated with arithmetic overflow. Linux uses the struct_size() macro to perform these calculations so that the result saturates to at most SIZE_MAX, which will cause an allocation failure rather than wrapping around. So the best way to perform this allocation would be:

p = kmalloc(struct_size(p, flex, count), GFP_KERNEL);

Padding and interface sizes

When a structure definition is also used by a codebase we don't control (e.g. firmware, userspace, virtualization), changing its layout or sizeof() may break such code. Specifically, it may break its ability to communicate correctly with the kernel across the shared interface. Such structures cannot suddenly lose the single element of its trailing array. In these cases, a new member needs to be used for kernel code, explicitly keeping the original member for backward compatibility. For example:

struct object {
        ...
        int flex[1];
};

becomes:

struct object {
        ...
        union {
                int flex[1];
                DECLARE_FLEX_ARRAY(int, data);
        };
};

Now the kernel will only use the newly named data member (and gain any potential bounds checking protections from the compiler), and external code that shares this structure definition can continue to use the flex member, all without changing the size of the structure.

This has the downside of needing to change the member name throughout Linux. However, if the other side of the interface doesn't actually use the original member, we can avoid this. We can convert the member to a flexible array and add explicit padding instead. This would mean no collateral changes with the member name in Linux are needed:

struct object {
        ...
        union {
                int __padding;
                DECLARE_FLEX_ARRAY(int, flex);
        };
};

Replace multi-element arrays

In the cases of trailing arrays with larger element counts, the usage needs to be even more carefully studied. Most problems end up looking very similar to 1-element interface conversions above. For example, if there is some hardware interface that returns at least 4 bytes for an otherwise dynamically sized array, the conversion would start from here:

struct object {
        ...
        unsigned char data[4];
};

which becomes:

struct object {
        ...
        union {
                unsigned char __padding[4];
                DECLARE_FLEX_ARRAY(unsigned char, data);
        };
};

Enable -Warray-bounds

With all fixed-size array bounds able to be determined at build time, -Warray-bounds can actually perform the checking, keeping provably bad code out of Linux. (This option is already part of -Wall, which Linux isn't quite able to use itself yet, but is strongly recommended for other C projects.) As a reminder, optimization level will impact this option. The kernel is built with -O2, which is likely the right choice for most C projects.

Enable -Wzero-length-array

If all zero length arrays have been removed from the code, future uses can be kept out of the code by using -Wzero-length-array. This option is currently only available in Clang, and will warn when finding the definition of such structure members, rather than warning when they are accessed in code. Because of this, it is unlikely to ever be enabled in Linux since some array sizes are constructed from build configurations, and may drop to 0 when they are unused (i.e. they were never used as flexible arrays). As such, it is sufficient to use -fstrict-flex-arrays (see below) and -Warray-bounds.

Enable -fstrict-flex-arrays

Once all the fake flexible arrays have been converted to true flexible arrays, the remaining fixed-sized trailing arrays can start being treated as actually fixed-size by enabling -fstrict-flex-arrays. Future attempts to add fake flexible arrays to the code will then elicit warnings as part of the existing diagnostics from -Warray-bounds, since all fake flexible arrays are now treated as fixed-size arrays. (Note that this option sees the subset of 0-length arrays caught by -Wzero-length-array when they are actually used in the code, so -Wzero-length-array may be redundant.)

Coming soon: annotate bounds of flexible arrays

With flexible arrays now a first-class citizen in Linux and the compilers, it becomes possible to extend their available diagnostics. What the compiler is missing is knowledge of how the length of a given flexible array is tracked. For well-described flexible array structs, this means associating the member holding the element count with the flexible array member. This idea is not new, though prior implementation proposals have wanted to make changes to the C language syntax. A simpler approach is the addition of struct member attributes, and is under discussion and early development by both the GCC and Clang developer communities.

Add __attribute__((__counted_by__(member)))

In order to annotate flexible arrays, a new attribute could be used to describe the relationship between struct members. For example:

struct object {
        ...
        signed char items;
        ...
        int flex[];
} *p;

becomes:

struct object {
        ...
        signed char items;
        ...
        int flex[] __attribute__((__counted_by__(items)));
} *p;

This would allow -fsanitize=bounds to check for out-of-bounds accesses. For example, given the above annotation, each of the marked access into p->flex should trap:

sum += p->flex[-1];  // trap all negative indexes
sum += p->flex[128]; // trap when index larger than bounds type
sum += p->flex[0];   // trap when p->items <= 0
sum += p->flex[5];   // trap when p->items <= 5
sum += p->flex[idx]; // trap when p->items <= idx || idx < 0

The type associated with the bounds check (signed char in the example above) should perhaps be required to be an unsigned type, but Linux has so many counters implemented as int that it becomes an additional refactoring burden to change these to unsigned, especially since sometimes they are sneakily being used with negative values in some other part of the code. Better to leave them as-is (though perhaps emit a warning), and just add a negativity check at access time. Switching the counter to unsigned then potentially becomes a small performance improvement.

Similar to -fsanitize=bounds above, __builtin_dynamic_object_size() will perform the expected calculations with the items member as the basis for the resulting size (and where values less than 0 are considered to be 0 to avoid pathological calculations):

p->items = 5;
assert(__builtin_dynamic_object_size(p, 1) ==
        sizeof(*p) + 5 * sizeof(*p->flex));
assert(__builtin_dynamic_object_size(p->flex, 1) ==
        5 * sizeof(*p->flex));
assert(__builtin_dynamic_object_size(&p->flex[0], 1) ==
        sizeof(*p->flex));
assert(__builtin_dynamic_object_size(&p->flex[2], 0) ==
        3 * sizeof(*p->flex));

p->items = -10;
assert(__builtin_dynamic_object_size(p, 0) == sizeof(*p));
assert(__builtin_dynamic_object_size(p, 1) == sizeof(*p));
assert(__builtin_dynamic_object_size(p->flex, 1) == 0);
assert(__builtin_dynamic_object_size(&p->flex[2], 1) == 0);

Additional attributes may be needed if structures explicitly use byte counts rather than element counts.

Scope considerations

Composite structures need to be able to define __counted_by__ across struct boundaries:

struct object {
        ...
        char items;
        ...
        struct inner {
                ...
                int flex[] __attribute__((__counted_by__(.items)));
        };
} *ptr;

This may mean passing &ptr->inner to a function will lose the bounds knowledge, but it may be possible to automatically include a bounds argument as an invisible function argument, as any function able to understand the layout of struct inner must by definition have visibility into the definition of struct object. For example, with this:

struct object instance;
...
func(&instance.inner);
...
void func(struct inner *ptr) {
        ...
        ptr->flex[foo]; /* "items" is not scope */
        ...
}

The prototype could either be rejected due to lack of available scope, or could be automatically converted into passing the outer object pointer with an injected scope:

void func(struct object *__ptr) {
        struct inner *ptr = &__ptr->inner;
        ...
        ptr->flex[foo]; /* __ptr->items is in scope */
        ...
}

Annotate kernel flexible array structs

With the compiler attribute available, all of Linux's flexible arrays can be updated to include the annotation, and CONFIG_FORTIFY_SOURCE can be expanded to use __builtin_dynamic_object_size().

Replace DECLARE_FLEX_ARRAY with DECLARE_BOUNDED_ARRAY

Most uses of DECLARE_FLEX_ARRAY() can be replaced with DECLARE_BOUNDED_ARRAY(), explicitly naming the expected flex array bounds member. For example, if we had:

struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};

it would become:

struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_BOUNDED_ARRAY(struct foo, name1, items);
                DECLARE_BOUNDED_ARRAY(struct bar, name2, items);
        };
};

Add manual annotations

Any flexible array structures not already using DECLARE_BOUNDED_ARRAY() can be annotated manually with the new attribute. For example, assuming the proposed __attribute__((__counted_by__(member))) is wrapped in a macro named __counted_by():

struct obj {
        ...
        int items;
        ...
        int flex[];
};

becomes:

struct obj {
        ...
        int items;
        ...
        int flex[] __counted_by(items);
};

Future work: expand attribute beyond arrays

It will also be possible to use the new attribute on pointers and function arguments as well as flexible arrays. All the same details are available, though there would be the obvious differences for enclosing structure sizes, as the pointers are aimed (usually) outside the struct itself. Regardless, having it be possible to check offsets and inform __builtin_dynamic_object_size() would allow for several more places where runtime checking could be possible. For example, given this:

struct object {
        ...
        unsigned char items;
        ...
        int *data __attribute__((__counted_by__(items)));
        ...
} *p;

It should be possible to detect sizing information:

p->items = 5;
assert(__builtin_dynamic_object_size(p->data, 1) ==
        5 * sizeof(*p->data));
assert(__builtin_dynamic_object_size(*p->data, 1) ==
        sizeof(*p->data));
assert(__builtin_dynamic_object_size(*p->data, 0) ==
        5 * sizeof(*p->data));

And it should be possible to trap on the following bad accesses:

int *ptr = p->data;
sum += ptr[-1];  // trap all negative indexes
sum += ptr[500]; // trap when index larger than bounds type
sum += ptr[0];   // trap when p->items <= 0
sum += ptr[5];   // trap when p->items <= 5
ptr += 5;        // don't trap yet: allow ptr++ in a for loop
sum += *ptr;     // trap when p->items <= 5

A safer code base

A C codebase that has refactored all of its arrays into proper flexible arrays can now finally build by using:

        -Warray-bounds
        -fstrict-flex-arrays
        -fsanitize=bounds
        -fsanitize-undefined-trap-on-error
        -D_FORTIFY_SOURCE=3

With this, the burdens of C array index bounds checking will have been shifted to the toolchain, and array index overflow flaw exploitation can be a thing of the past, reducing severity to a simple denial of service (assuming the traps aren't handled gracefully). For the next trick, new code can be written in a language that is memory safe to start with (e.g. Rust).

Acknowledgements

Thanks to many people who gave me feedback on this post: Nick Desaulniers, Gustavo A. R. Silva, Bill Wendling, Qing Zhao, Kara Olive, Chris Palmer, Steven Rostedt, Allen Webb, Julien Voisin, Guenter Roeck, Evan Benn, Seth Jenkins, Alexander Potapenko, Ricardo Ribalda, and Kevin Chowski.

Discussion

Please join this thread with your thoughts, comments, and corrections. :)

 
Read more...