people.kernel.org

Hunting a TREE03 Heisenbug

In July of 2023, a routine rcutorture run reported a too-short grace period while running one of 15 TREE03 scenarios, each of which ran for 12 hours. This of course represents a serious bug in RCU. I therefore spent significant time attempting to reproduce this bug, with absolutely no success. Inspecting the diagnostics and the code was no more helpful. I therefore recorded my experiences here and got on with life.

This bug did not happen again.

Until August of 2024. Additional attempts to reproduce this bug once again proved fruitless, as documented here.

Until the very next month, when my attempts to track down some non-RCU bugs in a particularly troublesome -next release led me to modify TREE03's configuration. This effort was quite successful, so much so that it eventually (and unintentionally) reproduced the TREE03 bug. This situation illustrates the fact that tests have no loyalty: They will just as happily find your bugs as they will someone else's, especially in this case, where rcutorture was expressly designed to find my bugs.

The failure rate was about one instance per two hours across 400 TREE03 instances, each of which used four CPUs rather than the usual 16. Although this is way better than once per year, validating a fix would require many hundreds of hours of testing. Increasing the failure rate was thus of great importance, and to this end I undertook a series of experiments over two months time. These experiments are summarized below, with additional details available here.

Increasing Failure Rate

When rcutorture encounters a too-short grace period, it prints out the segments of the RCU reader that detected it:

Failure/close-call rcutorture reader segments:
 0: 0x8
 1: 0x1
 2: 0x60
 3: 0x9
 4: 0x60
 5: 0x8
 6: 0x8
 7: 0x60 300ms
 8: 0x38

The 0x60 expands to RCUTORTURE_RDR_RCU_1 | RCUTORTURE_RDR_RCU_2, which indicates an RCU reader segment protected by nested calls to rcu_read_lock(), this particular one containing a 300-millisecond delay. Such an RCU reader would of course be prone to preemption, which suggested creating an rcutorture facility to randomly preempt rcutorture's kthreads. The new preempt_duration module parameter controls the duration of the preemption, which by default occurs every second. Specifying rcutorture.preempt_duration=10 doubled the failure rate to about once per hour per 400 TREE03 guest OSes. A nice improvement, but much more is needed to de-heisenbug this heisenbug.

Runs without CPU hotplug never reproduced the too-short grace periods. Reducing the interval between successive CPU-hotplug operations had little effect, perhaps due to the fact that the default interval is only 200 milliseconds and that CPU-hotplug operations can take some time to complete.

Running a non-preemptible kernel resulted in about the same rate of too-short grace periods as did the TREE03 default preemptible kernel.

It turns out that RCU has no fewer than 11 ways to wait for a grace period, a fact that would have shocked my year-2000 self. But here we are. And further experimentation showed that restricting the grace-period type to expedited polled grace periods with full-sized state (using the rcutorture.gp_poll_exp_full=1 kernel boot parameter) in a non-preemptible kernel resulted in about eight failures per hour per 400 TREE03 guest OSes. Furthermore, enabling preemption increased this to about 13 failures per hour. These experiments also decreased the polling interval from a random number of nanoseconds ranging up to 128 microsecond to a random number of nanoseconds ranging up to 16 microseconds.

Code Inspections

This suggested focusing code-inspection efforts on expedited grace periods, but this proved fruitless. Furthermore, runs using the rcutorture.gp_exp=1 kernel parameter (which forces rcutorture to use only the synchronize_rcu_expedited() function) failed to reproduce the too-short grace periods. Runs using rcutorture.fwd_progress=0 to disable rcutorture's call_rcu() flooding increased the failure rate to 17 failures per hour per 400 TREE03 guest OSes, which is a failure rate that just might be evidence of successful de-heisenbugization.

All of the evidence thus far clearly pointed the finger of suspicion at the interaction of full-sized polling and expedited grace periods. Code inspections resulted in several changes, all but one of which was a legitimate improvement in the code, but none of which decreased the failure rate significantly. (The illegitimate non-improvement was a case of me noting that initialization of expedited grace periods must complete fully before the first quiescent state may be reported safely. But my earlier self had noticed this same requirement and had fulfilled it using workqueue flushes, a subtlety that initially escaped my current self. Perhaps comments should be upgraded? Or perhaps my memory?)

Added Diagnostics Provide Contradictory Evidence

Having come up dry, but having increased the failure rate enough to support additional diagnostics, it was time to add more information to rcutorture's RCU reader-segment printout. This resulted in some entertaining blind alleys resulting from bugs in my added instrumentation, the most entertaining of which made it look like the scheduler was preempting non-preemptible regions of code. Once these were found and fixed, the scheduler was (of course) exonerated and rcutorture was shown to be correctly creating the chains of RCU readers.

The reader detecting the too-short grace period then produced end-of-test output like this:

Failure/close-call rcutorture reader segments:
 0:  0x2 CPU  3 ... ga1:e60:p08-ga1:e60:p08 IRQ
 1: 0x20 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RCU_1
 2:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 3:  0x4 CPU  3 ... ga1:e60:p08-ga1:e60:p08 PREEMPT
 4: 0x10 CPU  3 ... ga1:e60:p08-ga1:e60:p08 SCHED
 5:  0x1 CPU  3 ... ga1:e60:p08-ga1:e60:p08 BH
 6:  0x8 CPU  3 ... ga1:e60:p08-ga1:e60:p08 RBH
 7: 0x60 CPU  3->0  ga1:e60:p08-ga5:e61:p09 300ms preempted RCU_1 RCU_2
 Reader was preempted.

This enhanced output has new columns that track:

  1. The CPU the reader runs on, starting with CPU 3 and at the end migrating to CPU 0.
  2. The bottom bits of the normal, expedited, and small-size polled grace-period sequence number. One normal grace period ended and a new one started, and a new expedited and small-size-polled grace period started. No full grace period elapsed, yet this reader still detected a too-short grace period.
  3. A human-readable expansion of the initial hexadecimal number showing reader state, which progress from:
    • interrupts disabled to
    • a single level of rcu_read_lock() to
    • rcu_read_lock_bh() to
    • preempt_disable() to
    • rcu_read_lock_sched() to
    • local_bh_disable() to
    • another round of rcu_read_lock_bh() and finally to
    • a pair of nested instances of rcu_read_lock().

The final line indicates that the reader was preempted, but we already knew that given the migration from CPU 0 to CPU 3.

This nonsensical contradiction bears repeating: The reader claims that the updater saw a too-short grace period, but the reader itself did not see a full grace period elapse. Adding delays in the updater after it detects the end of the grace period did not help, at least not until these delays were many hundreds of milliseconds. Which means nothing, because then the delay itself is longer than the readers. Note well that direct use of synchronize_rcu_expedited() never resulted in too-short grace periods, which provides an additional level of confusion.

This situation clearly calls for additional debugging, but adding debugging slows down the failure rate for this bug. For but one example, adding raw_smp_processor_id() to the reader in order to record the current CPU slowed down failure rate by about a factor of two.

It is therefore clearly necessary to extract every shred of information from the debugging that is already present.

Smoking CPU-Hotplug Operations

To that end, when the reader detects the too-short grace period, it dumps the trace buffer to the console, even in the case where nothing is being traced. This means that there will be a Dumping ftrace buffer message on the console at about the time that the too-short grace period was detected. The last few lines of console output prior to this are as follows, this time with timestamps:

[ 1091.544531] rcu-torture: rcu_torture_boost started
[ 1092.100819] rcu-torture: Stopping rcu_torture_boost task
[ 1092.101659] rcu-torture: rcu_torture_boost is stopping
[ 1092.115219] smpboot: CPU 3 is now offline
[ 1092.536276] rcu-torture: Stopping rcu_torture_boost task
[ 1092.536938] rcu-torture: rcu_torture_boost is stopping
[ 1092.569261] smpboot: CPU 1 is now offline
[ 1092.828577] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 1092.842975] rcu-torture: Creating rcu_torture_boost task
[ 1092.843895] rcu-torture: rcu_torture_boost started
[ 1093.201384] Dumping ftrace buffer:

Note that the CPU that the reader is running on, CPU 3, came online just before start of the reader that inexplicably reported the too-short grace period. Of course, each rcutorture guest OS has only four CPUs, and only three of them can be taken offline (CPUs 1-3), so this could be random chance.

Except that this was always the case, across more than 100 too-short grace periods. And the value of one-third to the 100th power represents a very low probability indeed, so it seems safe to assume that this correlation represents some form of causation, obscure though that causation might be. Especially given that the bug does not appear in the absence of CPU-hotplug operations.

This realization prompted another inspection of RCU's expedited grace period code and its relation to CPU hotplug, pointing the finger of suspicion on the sync_sched_exp_online_cleanup() function. I spent an embarrassingly large amount of time on this before reminding myself that this function is in fact a no-op in the preemptible kernels that I was testing.

Further Increasing Failure Rate

I also got annoyed at the RCU CPU stall warnings that were instigated by the combination of systems whose guest OSes booted slowly and the testing of RCU priority boosting. I therefore disabled testing of RCU priority boosting. To my delight this actually increased the failure rate of these phantom too-short grace periods.

Thus encouraged, I did more experiments, learning that building the kernel with CONFIG_RCU_FANOUT_LEAF=4 increased the failure rate by more than a factor of two. This was a surprise, given that the effect was to move from a three-rcu_node-structure combining tree to a singleton tree having only the one rcu_node structure. In the past, combining trees having fewer rcu_node structures have reduced race-condition failure rates, but this was clearly an exception to that rule.

At this point, with instrumentation disabled, the bug was reproducing about 50 times per hour across 400 instance of TREE03 guest OSes, which is a increase of more than five orders of magnitude over the initial once-per-year failure rate. This meant that I could use a lot more debugging code!

Smoking Polled Grace Periods

Which resulted in the following trace output, which consists of three lines, some or all of which your browser is likely to have line-wrapped:

rcu_pree-15        0d..1. 3170054244us : rcu_grace_period: rcu_preempt 389665 start
rcu_tort-60        0..... 3170054253us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x5f224/0xcc368.
rcu_tort-60        0..... 3170054276us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x5f224/0xcc368.

The first line (rcu_pree-15) marks the start of trace period 389665, also known as 0x5f221, whose end is 0x5f224. This matches the value produced by the trace_printk() statements producing the two Polled grace period messages (both rcu_tort-60). Yet there is no trace of the end of this grace period, nor is there any sign of an expedited grace period. Therefore, the updater is somehow ending the grace period too soon.

This points the finger of suspicion at the poll_state_synchronize_rcu_full() function that the updater uses to detect the end of the grace period. This function is as follows:

 1 bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb(); 
 6   if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || 
 7       rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || 
 8       rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || 
 9       rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) {
10     smp_mb(); 
11     return true;
12   }
13   return false;
14 }

Line 7 looks just plain wrong. To see why, recall that the ->gp_seq field is replicated in the rcu_state, rcu_node, and rcu_data structures in order to keep both lock and memory contention down to a dull roar. There is one rcu_node structure in this configuration and rcu_data is a per-CPU structure, so there is a total of six copies of ->gp_seq. The source of truth is the one in the rcu_state structure, that is, rcu_state.gp_seq, which is updated first at the beginning of each grace period and last at each grace period's end. So why doesn't line 7 use rcu_state.gp_seq? Sure, there is only one copy of this field, but the same is true of the root rcu_node structure's ->gp_seq field accessed by line 7.

The same question could be asked about the just-as-wrong-looking line 6 of get_state_synchronize_rcu_full():

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In fact both lines looked so wrong that I changed them both (without even bothering to read either function's header comment) and reran the test. There were no too-short grace periods.

Success?

Well, not until there is an explanation of why this makes a difference. After all, there is strong ordering between the updates of the rcu_node structures' ->gp_seq fields and all the quiescent states, so why should it really make a difference? And perhaps it is not a true fix, but rather an unrelated change that just happens to greatly reduce the probability of too-short grace periods, in which case, the change is doing nothing but impeding progress.

Turn to Tracing

I reverted the change to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions, and then enabled yet more tracing, resulting in the following redacted trace messages:

rcu_pree-15        0d..1. 816764514us : rcu_grace_period: rcu_preempt 90289 start
 cpuhp/3-34        3..... 816772336us : rcu_exp_grace_period: rcu_preempt 0x2f0e8 online
rcu_tort-59        0..... 816807835us : rcu_torture_writer: rcu_torture_writer Polled grace period started 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816810514us : rcu_grace_period: rcu_preempt 90289 startroot
rcu_pree-15        0d..1. 816810515us : rcu_grace_period: rcu_preempt 90289 cpustart
rcu_pree-15        0d..1. 816810516us : rcu_grace_period: rcu_preempt 90289 cpuqs
rcu_pree-15        0d..1. 816811522us : rcu_grace_period: rcu_preempt 90289 endroot 
rcu_pree-15        0d..1. 816811523us : rcu_grace_period: rcu_preempt 90289 cpuend
rcu_tort-59        0..... 816811529us : rcu_torture_writer: rcu_torture_writer Polled grace period ended 0x160b4/0x2f0ec.
rcu_pree-15        0d..1. 816842554us : rcu_grace_period: rcu_preempt 90289 end

The first trace message (rcu_pree-15) marks the official start of grace period 90289 (also known as 0x160b1), recorded in rcu_state.gp_seq. The ->gp_seq value for the end of this grace period is 0x160b4.

Some time later, the second trace message (cpuhp/3-34) records CPU 3 coming online. This CPU came online after the grace period started, so the grace period is not obligated to wait on it.

Later still, the third trace message (rcu_tort-59) marks the start of the polled grace period from the updater's perspective. The next five trace messages (rcu_pree-15) record:

  1. The update of the root rcu_node structure's ->gp_seq field to reflect the start of the grace period.
  2. CPU 0 noticing that the grace period had started.
  3. CPU 0 reporting its quiescent state.
  4. The update of the root rcu_node structure's ->gp_seq field to reflect the end of the grace period.
  5. CPU 0 noticing that the grace period had ended.

The second-to-last trace message (rcu_tort-59) marks the end of the polled grace period from the updater's perspective. And the final trace message (rcu_pree-15) marks the official end of this grace period, recorded in rcu_state.gp_seq.

Anatomy of a Normal RCU Grace Period

The above trace messages report on important transitions in RCU's grace-period processing, and this section puts those transitions into the context of the entire grace period. The relevant phases of grace-period processing are as follows:

  1. The very beginning of a grace-period, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in start.
  2. Accounting for recent CPU-hotplug operations. Again, please note that an RCU grace period need not wait for readers running on a CPU that came online after that grace period started, so this accounting need be done only at the start of each grace period.
  3. Initializing the rcu_node combining tree to prepare for quiescent states from any CPUs and tasks that this grace period must wait on, which updates each rcu_node structure's ->gp_seq field. This phase is reported by the rcu_pree-15 trace message above ending in startroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  4. The note_gp_changes() function notices the starts and ends of grace periods from the viewpoint of each CPU that was online at the start of the grace period (in this case, only CPU 0). These events are reported by the rcu_pree-15 trace messages above ending in cpustart and cpuend.
  5. Quiescent states are reported by a chain of events initiated by rcu_qs(), and are reported by the `rcu_pree-15 trace messages above ending in cpuqs.
  6. Idle CPUs are handled specially, but this did not happen in the above trace.
  7. Once all relevant CPUs and tasks have reported quiescent states, the cleanup phase starts which updates each rcu_node structure's ->gp_seq field. This is reported by the rcu_pree-15 trace message above ending in endroot. (But note that this trace message is not in mainline, and probably won't ever be.)
  8. Finally, the entire grace period comes to an end, which updates rcu_state.gp_seq, reported by the rcu_pree-15 trace message above ending in end.

The full story is a bit more complex, but this provides enough context for the bug at hand.

Turning Back to Tracing

As noted earlier, all the quiescent states happened between the two updates of the root rcu_node structure's ->gp_seq field, so all should be well, right?

Unfortunately, no. A reader noted a too-short grace period, and here is the corresponding console log, again with lines likely wrapped:

Failure/close-call rcutorture reader segments:
 816805685us 0: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 1: 0x10 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  SCHED
 816805685us 2:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  PREEMPT
 816805685us 3:  0x8 CPU  3 ... g60b1:e02f0e4:p22f4 ..................  RBH
 816805685us 4:  0x4 CPU  3 ... g60b1:e02f0e4:p22f4-g60b5:e02f0e5:p22f5 300ms PREEMPT
 817105825us 5: 0x60 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1 RCU_2
 817105851us 6:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 7:  0x4 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted PREEMPT
 817105852us 8: 0x20 CPU  3 ... g60b5:e02f0e5:p22f5 ..................  preempted RCU_1
 Reader was preempted.

Note that it was CPU 3 that came online, and it was also CPU 3 that noticed the too-short grace period. This is the key clue: CPU 3 came online between the time that the grace period officially started and the start of the updater, so CPU 3 might well have started a reader before the updater started. Although the grace period did not need to wait for that reader, the updater most certainly did.

Reaching Root Cause

And this is why the aforementioned update to the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions suppressed the bug. With the get_state_synchronize_rcu_full() function reading from rcu_state.gp_seq, the updater would see that the current grace period had already started, and would thus wait until the end of the next grace period, which could not end some time after until CPU 3's reader finished.

Finding a Fix

Unfortunately, the corresponding change to the poll_state_synchronize_rcu_full() function breaks this guarantee:

get_state_synchronize_rcu_full(&rgos);
synchronize_rcu();
WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The problem is that callbacks can be invoked (including those that awaken tasks blocked in synchronize_rcu()) immediately after the root rcu_node structure's ->gp_seq field has been updated, and thus long before rcu_state.gp_seq has been updated to record the grace period's end. Therefore, if the poll_state_synchronize_rcu_full() function is changed to instead check rcu_state.gp_seq, the above WARN_ON_ONCE() can trigger.

Which would have been immediately obvious had I read the poll_state_synchronize_rcu_full() function's header comment. So if I wasn't going to read these comments, why did I bother writing them? ;–)

Implementing a Fix

The first step is to fix line 6 of get_state_synchronize_rcu_full() to access rcu_state.gp_seq (instead of the root rcu_node structure's ->gp_seq field):

 1 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 2 {
 3   struct rcu_node *rnp = rcu_get_root();
 4
 5   smp_mb();
 6   rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
 7   rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 8 }

In real life, we must of course also get rid of the now-unused rnp local variable, but that would mess up the line numbering.

But running with this patch results in too-short grace periods, and not just once per minute per 400 TREE03 instances, but instead once every few seconds in each and every instance!

The problem is that both get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() might run after the grace period starts (thus having incremented rcu_state.gp_seq) but before the grace period has finished accounting for recent CPU-hotplug operations (thus having not yet incremented the root rcu_node structure's ->gp_seq field). This easy-to-trigger situation will cause poll_state_synchronize_rcu_full() to conclude that the grace period ended an extremely long time ago, resulting in a too-short grace period from the perspective of any concurrent readers.

The fix is to adjust the rcu_seq_done_exact() function to take a more liberal view of what constitutes an ongoing grace period, as in the following diff:

 1 diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
 2 index 2f9c9272cd486..d2a91f705a4ab 100644
 3 --- a/kernel/rcu/rcu.h
 4 +++ b/kernel/rcu/rcu.h
 5 @@ -162,9 +162,9 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
 6  {
 7    unsigned long cur_s = READ_ONCE(*sp);
 8
 9 -  return ULONG_CMP_GE(cur_s, s) ||
10 -    ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
11 +  return ULONG_CMP_GE(cur_s, s) ||
12 +    ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
13  }
14
15  /*

The change from 2 to 3 does the trick. The corresponding RFC patch may be found here.

Evaluation

With this fixed, many tens of hours of 400 instances of TREE03 ran without any too-short grace periods. Now on to the bugs that I was chasing a couple of months ago...

Lessons (Re)Learned

This effort took a couple of months, so it is well worth reflecting on the experience to see what might be done better next time.

First, much time was consumed increasing the failure rate (de-heisenbugging) from about once per year to about once per few hours, again to about three minutes, and yet again to about once per minute. Things would have gone faster had I put more time in during the first phase of de-heisenbugging. For example, once I noted that CPU-hotplug operations were involved, I should have experimented with rcutree's gp_preinit_delay, gp_init_delay, and gp_cleanup_delay module parameters, which insert delays that can increase the probability of races between CPU-hotplug operations and grace-period processing. On the other hand, it is all too easy to spend forever de-heisenbugging and never get around to actually finding the bug, and in this particular case, a large leap of imagination would have been required to get from the expedited grace periods that seemed to be causing trouble to the normal grace periods that were actually doing so. Either way, it was necessary to greatly increase the failure rate in order to fix the failure.

Second, much time was wasted due to my erroneous assumption that the bug involved the expedited grace period, when it was really the interaction between the normal grace period and the full-sized polled API. Unfortunately, it would have been difficult to check that assumption much before the last stage of de-heisenbugging.

Third, I could have saved a few days by more carefully checking my initial instrumentation. Debug code is also code, and thus can also contain bugs!

Fourth, I could have saved a day or two by reading the poll_state_synchronize_rcu_full() function's header comment.

Fifth, although symmetry plays an important role in human aesthetics and perception of elegance, in this case, the symmetry of the ->gp_seq accesses in the get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions was in fact the bug.

Sixth and finally, I created this bug in July of 2022. Those promoting various validation and verification techniques therefore had more than two years to find this bug, which was in a very public code base. Perhaps next time they will find an RCU bug before I do. ;–)

For More Heisenbug-Hunting Information...

Many people have held forth on this topic, but these references should get you started:

  1. Section 11.6 (“Probability and Heisenbugs”) of Is Parallel Programming Hard, And, If So, What Can You Do About It?, with emphasis on Section 11.6.4 (“Hunting Heisenbugs”).
  2. The “Hunting Heisenbugs” talk at 2023 Linux Plumbers Conference.
  3. The similar “Hunting Heisenbugs” talk at 2024 Everything Open.