paulmck

There is a proposal making its way through the C++ Standards Committee to Defang and deprecate memory_order::consume, and a similar proposal is likely to make its way through the C Standards Committee. This is somewhat annoying from a Linux-kernel-RCU perspective, because there was some reason to hope for language-level support for the address dependencies headed by calls to rcu_dereference().

So what is a Linux-kernel community to do?

The first thing is to review the current approach to rcu_dereference() address dependencies, which use a combination of standard language features and code standards.

Standard Language Features

Actual Implementations of memory_order::consume

All known implementations of memory_order::consume simply promote it to memory_order::acquire. This is correct from a functional perspective, but leaves performance on the table for PowerPC and for some hardware implementations of ARM. Of less concern for the Linux kernel, on many GPGPUs, memory_order::acquire has significant overhead.

There are those who claim that in the future, accesses using memory_order::acquire will be no more expensive than those using memory_order::consume. And who knows, maybe they are correct. But in the meantime, we must deal with the hardware that we actually have.

Your Friend in Need: volatile

Some might argue that the volatile keyword is underspecified by the C and C++ standards, but a huge body of device-driver code does constrain the compilers, albeit sometimes rather controversially. And rcu_dereference() uses a volatile load to fetch the pointer, which prevents the compiler from spoiling the fun by (say) reloading from the same pointer, whose value might well have changed in the meantime.

A Happy Consequence of Data-Race Prohibition

The C language forbids implementations (in our case, compilers) from creating data races, that is, situations where the object code has concurrent C-language accesses to a given variable, at least one of which is a store.

So how does this help?

For one thing, it constrains compiler-based speculation. To see this, suppose that one thread does this:

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 42;
rcu_assign_pointer(gp, p);

And another thread does this:

p = rcu_dereference(gp);
do_something_with(p->a);

At the source-code level, there is clearly no data race. But suppose the compiler uses profile-directed optimization, and learns that the value returned by rcu_dereference() is almost always 0x12345678. Such a compiler might be tempted to emit code to cause the hardware to concurrently execute the rcu_dereference() while also loading 0x12345678->a. If the rcu_dereference() returned the expected value of 0x12345678, the compiler could use the value loaded from 0x12345678->a, otherwise, it could load p->a.

The problem is that the two threads might execute concurrently as follows:

p = kmalloc(sizeof(*p), GFP_KERNEL);

// These two concurrently:
r1 = 0x12345678->a;
p->a = 0xdead1eafbadfab1e;

rcu_assign_pointer(gp, p);
p = rcu_dereference(gp); // Returns 0x12345678!!!

Because the value of p is 0x12345678, it appears that the speculation has succeeded. But the second thread's load into r1 ran concurrently with the first thread's store into p->a, which might result in user-visible torn loads and stores, or just plain pre-initialization garbage.

This sort of software speculation is therefore forbidden.

Yes, hardware can get away with this sort of thing because it tracks cache state. If a compiler wishes to generate code that executes speculatively, it must use something like hardware transactional memory is required, which typically has overhead that overwhelms any possible benefit.

Code Standards

The Documentation/RCU/rcu_dereference.rst file presents the Linux-kernel's code standards for the address dependencies headed by members of the rcu_dereference() API family. A summary of the most widely applicable of these standards is as follows:

  1. An address dependency must be headed by an appropriate member of the rcu_dereference() API family. The variables holding the return value from a member of this API family are said to be carrying a dependency.
  2. In the special case where data is added and never removed, READ_ONCE() can be substituted for one of the rcu_dereference() APIs.
  3. Address dependencies are carried by pointers only, and specifically not by integers. (With the exception that integer operations may be used to set, clear, and XOR bits in the pointers, which requires those pointers to be translated to integers, have their bits manipulated, and then translated immediately back to pointers.)
  4. Operations that cancel out all the bits in the original pointer break the address dependency.
  5. Comparing a dependency-carrying pointer to the address of a statically allocated variable can break the dependency chain. (Though there are special rules that allow such comparisons to be carried out safely in some equally special cases.)
  6. Special operations on hardware instruction caches may be required when using pointers to JITed functions.

The Documentation/RCU/rcu_dereference.rst file provides much more detail, which means that scanning the above list is not a substitute for reading the full file.

Enforcing Code Standards

My fond hope in the past was that compilers would have facilities that disable the optimizations requiring the code standards, but that effort seems likely to require greater life expectancy than I can bring to bear. That said, I definitely encourage others to try their hands at this.

But in the meantime, we need a way to enforce these code standards.

One approach is obviously code review, but it would be good to have automated help for this.

And Paul Heidekreuger presented a prototype tool at the 2022 Linux Plumbers Conference. This tool located several violations of the rule against comparing dependency-carrying pointers against the addresses of statically allocated variables.

Which suggests that continued work on such tooling could be quite productive.

Summary

So memory_order::consume is likely to go away, as is its counterpart in the C standard. This is not an immediate problem because all known implementations simply map memory_order::consume to memory_order::acquire, with those who care using other means to head address dependencies. (In the case of the Linux kernel, volatile loads.)

However, this does leave those who care with the issue of checking code using things like rcu_dereference(), given that the language standards are unlikely to provide any help any time soon.

Continued work on tooling that checks the handling of dependency chains in the object code therefore seems like an eminently reasonable step forward.

I recently noted that there are 11 ways to wait for RCU, and this note was duly noted by others. Please note that this is just vanilla RCU, not counting SRCU and the various flavors of Tasks RCU.

But what exactly are these ways?

Let Me List The Ways

The easy way to answer this is to look at rcutorture's module parameters:

  1. gp_normal waits asynchronously for normal grace periods, as in call_rcu().
  2. gp_sync waits synchronously for normal grace periods, as in synchronize_rcu().
  3. gp_exp waits synchronously for expedited grace periods, as in synchronize_rcu_expedited().
  4. gp_cond waits conditionally and synchronously for normal grace periods, as in: c = get_state_synchronize_rcu(); do_something(); cond_synchronize_rcu();
  5. gp_cond_exp waits conditionally and synchronously for expedited grace periods, as in: c = get_state_synchronize_rcu(); do_something(); cond_synchronize_rcu_expedited();
  6. gp_cond_full waits conditionally and synchronously with a full-sized “cookie” for normal grace periods, as in: get_state_synchronize_rcu_full(&c); do_something(); cond_synchronize_rcu_full(&c);
  7. gp_cond_exp_full waits conditionally and synchronously with a full-sized “cookie” for expedited grace periods, as in: get_state_synchronize_rcu_full(&c); do_something(); cond_synchronize_rcu_expedited_full(&c);
  8. gp_poll poll-waits for normal grace periods, as in: c = start_poll_synchronize_rcu(); do { do_something(); } while (!poll_state_synchronize_rcu());
  9. gp_poll_exp poll-waits for expedited grace periods, as in: c = start_poll_synchronize_rcu_expedited(); do { do_something(); } while (!poll_state_synchronize_rcu());
  10. gp_poll_full poll-waits with a full-sized “cookie” for normal grace periods, as in: start_poll_synchronize_rcu_full(&c); do { do_something(); } while (!poll_state_synchronize_rcu_full(&c));
  11. gp_poll_exp_full poll-waits with a full-sized “cookie” for expedited grace periods, as in: start_poll_synchronize_rcu_expedited_full(&c); do { do_something(); } while (!poll_state_synchronize_rcu_full(&c));

And that makes 11 of them, just as advertised!

In case you are curious, by default, rcutorture uses all eleven, making a random choice on each grace period. These module parameter can be used to restrict rcutorture's attention to the specified grace-period methods.

But Wait! There Are More!!!

There is a call_rcu_hurry() for kernels built with both CONFIG_RCU_LAZY=y and CONFIG_RCU_NOCB_CPU=y. These Kconfig options is used in kernels built for some battery-powered devices, and they make call_rcu() be lazy, as in a callback might wait up to 10 seconds before its grace period starts. Code that is in more of a hurry invokes call_rcu_hurry() in order to avoid this wait.

But shouldn't rcutorture also test call_rcu_hurry()?

It should and it does, for example, when doing callback flooding where a 10-second delay might result in OOM.

One might argue that queue_rcu_work() is yet another way of waiting asynchronously for a normal grace period, and from the user's point of view it most certainly is. But it invokes call_rcu_hurry() internally, and furthermore is implemented within workqueues.

One might further argue that kfree_rcu() and kfree_rcu_mightsleep() are also ways of waiting for a normal grace period. This argument quickly becomes philosophical: Is the user really waiting for that kfree() to happen? Maybe? Besides, Uladzislau Rezki recently posted a patch pulled in my Vlastimil Babka that moves these two functions to the slab allocator. Not only that, but these two functions use the aforementioned queue_rcu_work() function, which in turn uses call_rcu_hurry().

Therefore, rcutorture's testing of call_rcu_hurry() already covers all three of these functions.

How Many Are There Really???

Twelve!!!

Eleven that must be tested separately, and a twelfth in the form of call_rcu_hurry() that is exercised by rcutorture's callback flooding, queueing of callbacks from timer handlers, self-propagating callbacks, rcu_barrier() testing, and debug-objects testing.

Counting up the ways to wait for SRCU and the various Tasks RCU flavors is left as an exercise for the interested reader.

In July of 2023, a routine rcutorture run reported a too-short grace period while running one of 15 TREE03 scenarios, each of which ran for 12 hours. This of course represents a serious bug in RCU. I therefore spent significant time attempting to reproduce this bug, with absolutely no success. Inspecting the diagnostics and the code was no more helpful. I therefore recorded my experiences here and got on with life.

This bug did not happen again.

Until August of 2024. Additional attempts to reproduce this bug once again proved fruitless, as documented here.

Until the very next month, when my attempts to track down some non-RCU bugs in a particularly troublesome -next release led me to modify TREE03's configuration. This effort was quite successful, so much so that it eventually (and unintentionally) reproduced the TREE03 bug. This situation illustrates the fact that tests have no loyalty: They will just as happily find your bugs as they will someone else's, especially in this case, where rcutorture was expressly designed to find my bugs.

The failure rate was about one instance per two hours across 400 TREE03 instances, each of which used four CPUs rather than the usual 16. Although this is way better than once per year, validating a fix would require many hundreds of hours of testing. Increasing the failure rate was thus of great importance, and to this end I undertook a series of experiments over two months time. These experiments are summarized below, with additional details available here.

Increasing Failure Rate

When rcutorture encounters a too-short grace period, it prints out the segments of the RCU reader that detected it:

Failure/close-call rcutorture reader segments:
 0: 0x8
 1: 0x1
 2: 0x60
 3: 0x9
 4: 0x60
 5: 0x8
 6: 0x8
 7: 0x60 300ms
 8: 0x38

The 0x60 expands to RCUTORTURE_RDR_RCU_1 | RCUTORTURE_RDR_RCU_2, which indicates an RCU reader segment protected by nested calls to rcu_read_lock(), this particular one containing a 300-millisecond delay. Such an RCU reader would of course be prone to preemption, which suggested creating an rcutorture facility to randomly preempt rcutorture's kthreads. The new preempt_duration module parameter controls the duration of the preemption, which by default occurs every second. Specifying rcutorture.preempt_duration=10 doubled the failure rate to about once per hour per 400 TREE03 guest OSes. A nice improvement, but much more is needed to de-heisenbug this heisenbug.

Runs without CPU hotplug never reproduced the too-short grace periods. Reducing the interval between successive CPU-hotplug operations had little effect, perhaps due to the fact that the default interval is only 200 milliseconds and that CPU-hotplug operations can take some time to complete.

Running a non-preemptible kernel resulted in about the same rate of too-short grace periods as did the TREE03 default preemptible kernel.

It turns out that RCU has no fewer than 11 ways to wait for a grace period, a fact that would have shocked my year-2000 self. But here we are. And further experimentation showed that restricting the grace-period type to expedited polled grace periods with full-sized state (using the rcutorture.gp_poll_exp_full=1 kernel boot parameter) in a non-preemptible kernel resulted in about eight failures per hour per 400 TREE03 guest OSes. Furthermore, enabling preemption increased this to about 13 failures per hour. These experiments also decreased the polling interval from a random number of nanoseconds ranging up to 128 microsecond to a random number of nanoseconds ranging up to 16 microseconds.

Code Inspections

This suggested focusing code-inspection efforts on expedited grace periods, but this proved fruitless. Furthermore, runs using the rcutorture.gp_exp=1 kernel parameter (which forces rcutorture to use only the synchronize_rcu_expedited() function) failed to reproduce the too-short grace periods. Runs using rcutorture.fwd_progress=0 to disable rcutorture's call_rcu() flooding increased the failure rate to 17 failures per hour per 400 TREE03 guest OSes, which is a failure rate that just might be evidence of successful de-heisenbugization.

All of the evidence thus far clearly pointed the finger of suspicion at the interaction of full-sized polling and expedited grace periods. Code inspections resulted in several changes, all but one of which was a legitimate improvement in the code, but none of which decreased the failure rate significantly. (The illegitimate non-improvement was a case of me noting that initialization of expedited grace periods must complete fully before the first quiescent state may be reported safely. But my earlier self had noticed this same requirement and had fulfilled it using workqueue flushes, a subtlety that initially escaped my current self. Perhaps comments should be upgraded? Or perhaps my memory?)

Added Diagnostics Provide Contradictory Evidence

Having come up dry, but having increased the failure rate enough to support additional diagnostics, it was time to add more information to rcutorture's RCU reader-segment printout. This resulted in some entertaining blind alleys resulting from bugs in my added instrumentation, the most entertaining of which made it look like the scheduler was preempting non-preemptible regions of code. Once these were found and fixed, the scheduler was (of course) exonerated and rcutorture was shown to be correctly creating the chains of RCU readers.

The reader detecting the too-short grace period then produced end-of-test output like this:

Failure/close-call rcutorture reader segments:
 0:  0x2 CPU  3 ... ga1:e60:p08-ga1:e60:p08 IRQ
 1: 0x20 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RCU_1
 2:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 3:  0x4 CPU  3 ... ga1:e60:p08-ga1:e60:p08 PREEMPT
 4: 0x10 CPU  3 ... ga1:e60:p08-ga1:e60:p08 SCHED
 5:  0x1 CPU  3 ... ga1:e60:p08-ga1:e60:p08 BH
 6:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 7: 0x60 CPU  3->0  ga1:e60:p08-ga5:e61:p09 300ms preempted RCU_1 RCU_2
 Reader was preempted.

This enhanced output has new columns that track:

  1. The CPU the reader runs on, starting with CPU 3 and at the end migrating to CPU 0.
  2. The bottom bits of the normal, expedited, and small-size polled grace-period sequence number. One normal grace period ended and a new one started, and a new expedited and small-size-polled grace period started. No full grace period elapsed, yet this reader still detected a too-short grace period.
  3. A human-readable expansion of the initial hexadecimal number showing reader state, which progress from:
    • interrupts disabled to
    • a single level of rcu_read_lock() to
    • rcu_read_lock_bh() to
    • preempt_disable() to
    • rcu_read_lock_sched() to
    • local_bh_disable() to
    • another round of rcu_read_lock_bh() and finally to
    • a pair of nested instances of rcu_read_lock().

The final line indicates that the reader was preempted, but we already knew that given the migration from CPU 0 to CPU 3.

This nonsensical contradiction bears repeating: The reader claims that the updater saw a too-short grace period, but the reader itself did not see a full grace period elapse. Adding delays in the updater after it detects the end of the grace period did not help, at least not until these delays were many hundreds of milliseconds. Which means nothing, because then the delay itself is longer than the readers. Note well that direct use of synchronize_rcu_expedited() never resulted in too-short grace periods, which provides an additional level of confusion.

This situation clearly calls for additional debugging, but adding debugging slows down the failure rate for this bug. For but one example, adding raw_smp_processor_id() to the reader in order to record the current CPU slowed down failure rate by about a factor of two.

It is therefore clearly necessary to extract every shred of information from the debugging that is already present.

Smoking CPU-Hotplug Operations

To that end, when the reader detects the too-short grace period, it dumps the trace buffer to the console, even in the case where nothing is being traced. This means that there will be a Dumping ftrace buffer message on the console at about the time that the too-short grace period was detected. The last few lines of console output prior to this are as follows, this time with timestamps:

[ 1091.544531] rcu-torture: rcu_torture_boost started
[ 1092.100819] rcu-torture: Stopping rcu_torture_boost task
[ 1092.101659] rcu-torture: rcu_torture_boost is stopping
[ 1092.115219] smpboot: CPU 3 is now offline
[ 1092.536276] rcu-torture: Stopping rcu_torture_boost task
[ 1092.536938] rcu-torture: rcu_torture_boost is stopping
[ 1092.569261] smpboot: CPU 1 is now offline
[ 1092.828577] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 1092.842975] rcu-torture: Creating rcu_torture_boost task
[ 1092.843895] rcu-torture: rcu_torture_boost started
[ 1093.201384] Dumping ftrace buffer:

Note that the CPU that the reader is running on, CPU 3, came online just before start of the reader that inexplicably reported the too-short grace period. Of course, each rcutorture guest OS has only four CPUs, and only three of them can be taken offline (CPUs 1-3), so this could be random chance.

Except that this was always the case, across more than 100 too-short grace periods. And the value of one-third to the 100th power represents a very low probability indeed, so it seems safe to assume that this correlation represents some form of causation, obscure though that causation might be. Especially given that the bug does not appear in the absence of CPU-hotplug operations.

This realization prompted another inspection of RCU's expedited grace period code and its relation to CPU hotplug, pointing the finger of suspicion on the sync_sched_exp_online_cleanup() function. I spent an embarrassingly large amount of time on this before reminding myself that this function is in fact a no-op in the preemptible kernels that I was testing.

Further Increasing Failure Rate

I also got annoyed at the RCU CPU stall warnings that were instigated by the combination of systems whose guest OSes booted slowly and the testing of RCU priority boosting. I therefore disabled testing of RCU priority boosting. To my delight this actually increased the failure rate of these phantom too-short grace periods.

Thus encouraged, I did more experiments, learning that building the kernel with CONFIG_RCU_FANOUT_LEAF=4 increased the failure rate by more than a factor of two. This was a surprise, given that the effect was to move from a three-rcu_node-structure combining tree to a singleton tree having only the one rcu_node structure. In the past, combining trees having fewer rcu_node structures have reduced race-condition failure rates, but this was clearly an exception to that rule.

At this point, with instrumentation disabled, the bug was reproducing about 50 times per hour across 400 instance of TREE03 guest OSes, which is a increase of more than five orders of magnitude over the initial once-per-year failure rate. This meant that I could use a lot more debugging code!

Smoking Polled Grace Periods

Which resulted in the following trace output, which consists of three lines, some or all of which your browser is likely to have line-wrapped:

rcu_pree-15        0d..1. 3170054244us : rcu_grace_period: rcu_preempt 389665 start
rcu_tort-60        0..... 3170054253us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x5f224/0xcc368.
rcu_tort-60        0..... 3170054276us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x5f224/0xcc368.

The first line (rcu_pree-15) marks the start of trace period 389665, also known as 0x5f221, whose end is 0x5f224. This matches the value produced by the trace_printk() statements producing the two Polled grace period messages (both rcu_tort-60). Yet there is no trace of the end of this grace period, nor is there any sign of an expedited grace period. Therefore, the updater is somehow ending the grace period too soon.

This points the finger of suspicion at the poll_state_synchronize_rcu_full() function that the updater uses to detect the end of the grace period. This function is as follows:

 1 bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb(); 
 6   if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || 
 7       rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || 
 8       rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || 
 9       rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) {
10     smp_mb(); 
11     return true;
12   }
13   return false;
14 }

Line 7 looks just plain wrong. To see why, recall that the ->gp_seq field is replicated in the rcu_state, rcu_node, and rcu_data structures in order to keep both lock and memory contention down to a dull roar. There is one rcu_node structure in this configuration and rcu_data is a per-CPU structure, so there is a total of six copies of ->gp_seq. The source of truth is the one in the rcu_state structure, that is, rcu_state.gp_seq, which is updated first at the beginning of each grace period and last at each grace period's end. So why doesn't line 7 use rcu_state.gp_seq? Sure, there is only one copy of this field, but the same is true of the root rcu_node structure's ->gp_seq field accessed by line 7.

The same question could be asked about the just-as-wrong-looking line 6 of get_state_synchronize_rcu_full():

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In fact both lines looked so wrong that I changed them both (without even bothering to read either function's header comment) and reran the test. There were no too-short grace periods.

Success?

Well, not until there is an explanation of why this makes a difference. After all, there is strong ordering between the updates of the rcu_node structures' ->gp_seq fields and all the quiescent states, so why should it really make a difference? And perhaps it is not a true fix, but rather an unrelated change that just happens to greatly reduce the probability of too-short grace periods, in which case, the change is doing nothing but impeding progress.

Turn to Tracing

I reverted the change to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions, and then enabled yet more tracing, resulting in the following redacted trace messages:

rcu_pree-15        0d..1. 816764514us : rcu_grace_period: rcu_preempt 90289 start
 cpuhp/3-34        3..... 816772336us : rcu_exp_grace_period: rcu_preempt 0x2f0e8 online
rcu_tort-59        0..... 816807835us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816810514us : rcu_grace_period: rcu_preempt 90289 startroot
rcu_pree-15        0d..1. 816810515us : rcu_grace_period: rcu_preempt 90289 cpustart
rcu_pree-15        0d..1. 816810516us : rcu_grace_period: rcu_preempt 90289 cpuqs
rcu_pree-15        0d..1. 816811522us : rcu_grace_period: rcu_preempt 90289 endroot 
rcu_pree-15        0d..1. 816811523us : rcu_grace_period: rcu_preempt 90289 cpuend
rcu_tort-59        0..... 816811529us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816842554us : rcu_grace_period: rcu_preempt 90289 end

The first trace message (rcu_pree-15) marks the official start of grace period 90289 (also known as 0x160b1), recorded in rcu_state.gp_seq. The ->gp_seq value for the end of this grace period is 0x160b4.

Some time later, the second trace message (cpuhp/3-34) records CPU 3 coming online. This CPU came online after the grace period started, so the grace period is not obligated to wait on it.

Later still, the third trace message (rcu_tort-59) marks the start of the polled grace period from the updater's perspective. The next five trace messages (rcu_pree-15) record:

  1. The update of the root rcu_node structure's ->gp_seq field to reflect the start of the grace period.
  2. CPU 0 noticing that the grace period had started.
  3. CPU 0 reporting its quiescent state.
  4. The update of the root rcu_node structure's ->gp_seq field to reflect the end of the grace period.
  5. CPU 0 noticing that the grace period had ended.

The second-to-last trace message (rcu_tort-59) marks the end of the polled grace period from the updater's perspective. And the final trace message (rcu_pree-15) marks the official end of this grace period, recorded in rcu_state.gp_seq.

Anatomy of a Normal RCU Grace Period

The above trace messages report on important transitions in RCU's grace-period processing, and this section puts those transitions into the context of the entire grace period. The relevant phases of grace-period processing are as follows:

  1. The very beginning of a grace-period, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in start.
  2. Accounting for recent CPU-hotplug operations. Again, please note that an RCU grace period need not wait for readers running on a CPU that came online after that grace period started, so this accounting need be done only at the start of each grace period.
  3. Initializing the rcu_node combining tree to prepare for quiescent states from any CPUs and tasks that this grace period must wait on, which updates each rcu_node structure's ->gp_seq field. This phase is reported by the rcu_pree-15 trace message above ending in startroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  4. The note_gp_changes() function notices the starts and ends of grace periods from the viewpoint of each CPU that was online at the start of the grace period (in this case, only CPU 0). These events are reported by the rcu_pree-15 trace messages above ending in cpustart and cpuend.
  5. Quiescent states are reported by a chain of events initiated by rcu_qs(), and are reported by the `rcu_pree-15 trace messages above ending in cpuqs.
  6. Idle CPUs are handled specially, but this did not happen in the above trace.
  7. Once all relevant CPUs and tasks have reported quiescent states, the cleanup phase starts which updates each rcu_node structure's ->gp_seq field. This is reported by the rcu_pree-15 trace message above ending in endroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  8. Finally, the entire grace period comes to an end, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in end.

The full story is a bit more complex, but this provides enough context for the bug at hand.

Turning Back to Tracing

As noted earlier, all the quiescent states happened between the two updates of the root rcu_node structure's ->gp_seq field, so all should be well, right?

Unfortunately, no. A reader noted a too-short grace period, and here is the corresponding console log, again with lines likely wrapped:

Failure/close-call rcutorture reader segments:
 816805685us 0: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 1: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 2:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  PREEMPT
 816805685us 3:  0x8 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  RBH
 816805685us 4:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4-g60b5:e02f0e5:p22f5 300ms PREEMPT
 817105825us 5: 0x60 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1 RCU_2
 817105851us 6:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 7:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 8: 0x20 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1
 Reader was preempted.

Note that it was CPU 3 that came online, and it was also CPU 3 that noticed the too-short grace period. This is the key clue: CPU 3 came online between the time that the grace period officially started and the start of the updater, so CPU 3 might well have started a reader before the updater started. Although the grace period did not need to wait for that reader, the updater most certainly did.

Reaching Root Cause

And this is why the aforementioned update to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions suppressed the bug. With the get_state_synchronize_rcu_full() function reading from rcu_state.gp_seq, the updater would see that the current grace period had already started, and would thus wait until the end of the next grace period, which could not end some time after until CPU 3's reader finished.

Finding a Fix

Unfortunately, the corresponding change to the poll_state_synchronize_rcu_full() function breaks this guarantee:

get_state_synchronize_rcu_full(&rgos);
synchronize_rcu();
WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The problem is that callbacks can be invoked (including those that awaken tasks blocked in synchronize_rcu()) immediately after the root rcu_node structure's ->gp_seq field has been updated, and thus long before rcu_state.gp_seq has been updated to record the grace period's end. Therefore, if the poll_state_synchronize_rcu_full() function is changed to instead check rcu_state.gp_seq, the above WARN_ON_ONCE() can trigger.

Which would have been immediately obvious had I read the poll_state_synchronize_rcu_full() function's header comment. So if I wasn't going to read these comments, why did I bother writing them? ;–)

Implementing a Fix

The first step is to fix line 6 of get_state_synchronize_rcu_full() to access rcu_state.gp_seq (instead of the root rcu_node structure's ->gp_seq field):

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In real life, we must of course also get rid of the now-unused rnp local variable, but that would mess up the line numbering.

But running with this patch results in too-short grace periods, and not just once per minute per 400 TREE03 instances, but instead once every few seconds in each and every instance!

The problem is that both get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() might run after the grace period starts (thus having incremented rcu_state.gp_seq) but before the grace period has finished accounting for recent CPU-hotplug operations (thus having not yet incremented the root rcu_node structure's ->gp_seq field). This easy-to-trigger situation will cause poll_state_synchronize_rcu_full() to conclude that the grace period ended an extremely long time ago, resulting in a too-short grace period from the perspective of any concurrent readers.

The fix is to adjust the rcu_seq_done_exact() function to take a more liberal view of what constitutes an ongoing grace period, as in the following diff:

 1 diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
 2 index 2f9c9272cd486..d2a91f705a4ab 100644
 3 --- a/kernel/rcu/rcu.h
 4 +++ b/kernel/rcu/rcu.h
 5 @@ -162,9 +162,9 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
 6  {
 7    unsigned long cur_s = READ_ONCE(*sp);
 8
 9 -  return ULONG_CMP_GE(cur_s, s) ||
10 -    ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
11 +  return ULONG_CMP_GE(cur_s, s) ||
12 +    ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
13  }
14
15  /*

The change from 2 to 3 does the trick. The corresponding RFC patch may be found here.

Evaluation

With this fixed, many tens of hours of 400 instances of TREE03 ran without any too-short grace periods. Now on to the bugs that I was chasing a couple of months ago...

Lessons (Re)Learned

This effort took a couple of months, so it is well worth reflecting on the experience to see what might be done better next time.

First, much time was consumed increasing the failure rate (de-heisenbugging) from about once per year to about once per few hours, again to about three minutes, and yet again to about once per minute. Things would have gone faster had I put more time in during the first phase of de-heisenbugging. For example, once I noted that CPU-hotplug operations were involved, I should have experimented with rcutree's gp_preinit_delay, gp_init_delay, and gp_cleanup_delay module parameters, which insert delays that can increase the probability of races between CPU-hotplug operations and grace-period processing. On the other hand, it is all too easy to spend forever de-heisenbugging and never get around to actually finding the bug, and in this particular case, a large leap of imagination would have been required to get from the expedited grace periods that seemed to be causing trouble to the normal grace periods that were actually doing so. Either way, it was necessary to greatly increase the failure rate in order to fix the failure.

Second, much time was wasted due to my erroneous assumption that the bug involved the expedited grace period, when it was really the interaction between the normal grace period and the full-sized polled API. Unfortunately, it would have been difficult to check that assumption much before the last stage of de-heisenbugging.

Third, I could have saved a few days by more carefully checking my initial instrumentation. Debug code is also code, and thus can also contain bugs!

Fourth, I could have saved a day or two by reading the poll_state_synchronize_rcu_full() function's header comment.

Fifth, although symmetry plays an important role in human aesthetics and perception of elegance, in this case, the symmetry of the ->gp_seq accesses in the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions was in fact the bug.

Sixth and finally, I created this bug in July of 2022. Those promoting various validation and verification techniques therefore had more than two years to find this bug, which was in a very public code base. Perhaps next time they will find an RCU bug before I do. ;–)

For More Heisenbug-Hunting Information...

Many people have held forth on this topic, but these references should get you started:

  1. Section 11.6 (“Probability and Heisenbugs”) of Is Parallel Programming Hard, And, If So, What Can You Do About It?, with emphasis on Section 11.6.4 (“Hunting Heisenbugs”).
  2. The “Hunting Heisenbugs” talk at 2023 Linux Plumbers Conference.
  3. The similar “Hunting Heisenbugs” talk at 2024 Everything Open.

It is quite easy to make your email agent (mutt in my case) send directly to mail.kernel.org, but this can result in inconvenient delays when you have a low-bandwidth Internet connection, and, worse yet, abject failure when you have no Internet connection at all. (Yes, I was born before the turn of the millennium. Why do you ask?)

One way to avoid these problems is to set up an email server on your laptop. This email server will queue your email when Internet is slow or completely inaccessible, and will automatically transmit your email as appropriate. There are several email servers to choose from, but I chose postfix.

First, you need to install postfix, for example, sudo apt install postfix. I have generally selected the satellite option when it asks, but there seems to be a number of different opinions expressed at different web sites.

You need to tell postfix to talk to mail.kernel.org in your /etc/postfix/main.cf file:

relayhost = [mail.kernel.org]:587

I followed linode's advice on setting up encryption, passwords, and so on, by tacking the following onto the end of my /etc/postfix/main.cf file:

# enable SASL authentication smtp_sasl_auth_enable = yes # disallow methods that allow anonymous authentication. smtp_sasl_security_options = noanonymous # where to find sasl_passwd smtp_sasl_password_maps = hash:/etc/postfix/sasl_password # Enable STARTTLS encryption smtp_use_tls = yes # where to find CA certificates smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt

But this requires setting up /etc/postfix/sasl_password with a single line:

[mail.kernel.org]:587 paulmck:fake-password

You will need to replace my kernel.org email and password with yours, obviously. Then you need to create the /etc/postfix/sasl_password.db file:

sudo postmap /etc/postfix/sasl_passwd

There was some difference of opinion across Internet as to what the ownership and permisssions of /etc/postfix/sasl_password and /etc/postfix/sasl_password.db should be, with some people arguing for maximum security via user and group both being set to root and permissions set to 0600. In my case, this was indeed secure, so much so that postfix failed to transmit any of my email, making some lame complaint about being unable to read /etc/postfix/sasl_password.db. Despite the compelling security benefits of this approach, I eventually elected to use user and group of postfix and permissions of 0640 for /etc/postfix/sasl_password.db, intentionally giving up a bit of security in favor of email actually being transmitted. :–)

I left /etc/postfix/sasl_passwd with user and group of root and mode of 0600. Somewhat pointlessly, given that its information can be easily extracted from etc/postfix/sasl_password.db by anyone with permission to read that file.

And anytime you change any postfix configuration, you need to tell postfix about it, for example:

sudo postfix reload

You might well need to make other adjustments to your postfix configuration. To that end, I strongly suggest testing your setup by sending a test email or three! ;–)

The mailq command will list queued email along with the reason why it has not yet been transmitted.

The sudo postfix flush command gives postfix a hint that now would be an excellent time for it to attempt to transmit queued email, for example, when you have an Internet connection only for a short time.

Back in the day, CPU hotplug was an optional and rarely used feature of Linux kernels built for SMP systems. However, increasing numbers of popular CPU families prohibit building Linux kernels with both CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y. So much so that I no longer have access to a system that supports mainline kernels built with this combination of Kconfig options. This means that I have been putting up with rcutorture output containing :CONFIG_HOTPLUG_CPU: improperly set errors, which is of course annoying, and will soon motivate me to rework the rcutorture test scenarios so as to eliminate this false positive. A side effect of this expected change is that rcutorture will no longer even try to test SMP kernels built without CPU hotplug.

Why prohibit this combination of Kconfig options? It turns out that some of the defenses against hardware side-channel attacks rely on CPU hotplug, so the CPU families vulnerable to such attacks are ensuring that CPU hotplug is available on all systems having at least two CPUs. Security is great, thus so far so good!

Unfortunately, this is a problem for you if you use a CPU family that permits CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y and you actually need to build kernels this way. Please keep in mind that there really have been RCU bugs that only manifest in such kernels. Given that I no longer have the means to test for these bugs, Murphy asserts that such bugs will quietly but fatally accumulate.

So what are your options? Here are a few:

  1. Rework your CPU family's Kconfig options to force CONFIG_CPU_HOTPLUG=y in any kernel build that enables CONFIG_SMP=y.
  2. Start running rcutorture on your hardware frequently enough to catch any bugs that might find their way into CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y configurations. I am happy to help you get started with this, which will require updating the rcutorture scripting to support your CPU family, defining additional rcutorture scenarios, and having someone (not me!) do the actual rcutorture runs on your hardware on a regular basis.
  3. Just ignore the problem in the fond hope that no such bugs will ever appear.

I believe that the first option is best, given that RCU is not the only Linux-kernel subsystem vulnerable to bugs specific to kernels built with CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y. But at the end of the day, it is your CPU family, so it is your decision!

Recently, a formal-verification researcher gave the expected verdict on software: “It is surprising that any software works!” I could not resist replying “You should be surprised that you work.” The researcher muttered something about evolution.

And in fact it has been observed that “Linux is evolution, not intelligent design”. From this viewpoint, review, testing, and other validation processes provide the Darwinian fitness function, eliminating unfit mutations from the “gene pool”. Which is good reason to continue working to improve these validation processes.

But there is another implication of this viewpoint. The patches that we developers sweat so hard over are nothing more or less than the software counterpart of random mutations.

But don't take my word for it. Ask your compiler!