Email me at: joel _at_

GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD's slab implementation) is here:

The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.

Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can't see the photon yet, we can only see what we've been shown so far, and in a way, if we've not seen something because enough “time” has not passed, then we may not reclaim it yet. If we've not seen something, we will see it at some point in the future. Till then we need to sit tight.

Roughly, an implementation has 2+N counters (with N CPUs): 1. Global write sequence. 2. Global read sequence. 3. Per-cpu read sequence (read from #1 when a reader starts)

On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free'ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs. Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.

Note the scalability drawbacks of this reclaim scheme:

  1. Expensive poll operation if you have 1000s of CPUs. (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)

  2. Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.

  3. There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.

One advantage of the scheme could be implementation simplicity.

RCU (not SRCU or Userspace RCU) doesn't suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.

The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

There's are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part's lock counter during locking and likewise for unlock.

During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with 'A MUST' in the below pseudo-code:

        reader 1        writer                        reader 2
        // read_lock
        // enter
        Read: idx = 0;
        <long delay>    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                                                      Read: idx = 1;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] <- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] <- A MUST!

NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its 'srcu_idx' equivalent.

Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic? Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the 'flip' logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a 'flipped' part would only monotonically decrease since new readers go to its counterpart.

The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available — by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0's stores and P1's loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (“message”).

        int buf = 0, flag = 0;

                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);

                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);

Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney's perfbook for details on that).

(*--algorithm mp_pattern
    buf = 0,
    flag = 0;

process Writer = 1
e1:        buf := 1;
e2:        flag := 1;
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
    r1 = 0,
    r2 = 0;  
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)

Sure enough, the assert r1 = 0 \/ r2 = 1; fires when the PlusCal program is run through the TLC model checker.

I do find the either or block clunky, and wish I could just do something like:

non_deterministic {
        buf := 1;
        flag := 1;

And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.

Other notes:

  • PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal's syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don't want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don't specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.

  • I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.

Note: At the time of this writing, it is kernel v5.3 release. RCU moves fast and can change in the future, so some details in this article may be obsolete.

The RCU subsystem and the task scheduler are inter-dependent. They both depend on each other to function correctly. The scheduler has many data structures that are protected by RCU. And, RCU may need to wake up threads to perform things like completing grace periods and callback execution. One such case where RCU does a wake up and enters the scheduler is rcu_read_unlock_special().

Recently Paul McKenney consolidated RCU flavors. What does this mean?

Consider the following code executing in CPU 0:


And, consider the following code executing in CPU 1:

a = 1;
synchronize_rcu();  // Assume synchronize_rcu
                    // executes after CPU0's rcu_read_lock
b = 2;

CPU 0's execution path shows 2 flavors of RCU readers, one nested into another. The preempt_{disable,enable} pair is an RCU-sched flavor RCU reader section, while the rcu_read_{lock,unlock} pair is an RCU-preempt flavor RCU reader section.

In older kernels (before v4.20), CPU 1's synchronize_rcu() could return after CPU 0's rcu_read_unlock() but before CPU 0's preempt_enable(). This is because synchronize_rcu() only needs to wait for the “RCU-preempt” flavor of the RCU grace period to end.

In newer kernels (v4.20 and above), the RCU-preempt and RCU-sched flavors have been consolidated. This means CPU 1's synchronize_rcu() is guaranteed to wait for both of CPU 1's rcu_read_unlock() and preempt_enable() to complete.

Now, lets get a bit more detailed. That rcu_read_unlock() most likely does very little. However, there are cases where it needs to do more, by calling rcu_read_unlock_special(). One such case is if the reader section was preempted. A few more cases are:

  • The RCU reader is blocking an expedited grace period, so it needed to report a quiescent state quickly.
  • The RCU reader is blocking a grace period for too long (~100 jiffies on my system, that's the default but can be set with rcutree.jiffies_till_sched_qs parameter).

In all these cases, the rcu_read_unlock() needs to do more work. However, care must be taken when calling rcu_read_unlock() from the scheduler, that's why this article on scheduler deadlocks.

One of the reasons rcu_read_unlock_special() needs to call into the scheduler is priority de-boosting: A task getting preempted in the middle of an RCU read-side critical section results in blocking the completion of the critical section and hence could prevent current and future grace periods from ending. So the priority of the RCU reader may need to be boosted so that it gets enough CPU time to make progress, and have the grace period end soon. But it also needs to be de-boosted after the reader section completes. This de-boosting happens by calling of the rcu_read_unlock_special() function in the outer most rcu_read_unlock().

What could go wrong with the scheduler using RCU? Let us see this in action. Consider the following piece of code executed in the scheduler:

		do_something();     // Preemption happened
                /* Preempted task got boosted */
		task_rq_lock();     // Disables interrupts
                rcu_read_unlock();  // Need to de-boost
		task_rq_unlock();   // Re-enables interrupts

Assume that the rcu_read_unlock() needs to de-boost the task's priority. This may cause it to enter the scheduler and cause a deadlock due to recursive locking of RQ/PI locks.

Because of these kind of issues, there has traditionally been a rule that RCU usage in the scheduler must follow:

“Thou shall not hold RQ/PI locks across an rcu_read_unlock() if thou not holding it or disabling IRQ across both both the rcu_read_lock() + rcu_read_unlock().”

More on this rule can be read here as well.

Obviously, acquiring RQ/PI locks across the whole rcu_read_lock() and rcu_read_unlock() pair would resolve the above situation. Since preemption and interrupts are disabled across the whole rcu_read_lock() and rcu_read_unlock() pair; there is no question of task preemption.

Anyway, the point is rcu_read_unlock() needs to be careful about scheduler wake-ups; either by avoiding calls to rcu_read_unlock_special() altogether (as is the case if interrupts are disabled across the entire RCU reader), or by detecting situations where a wake up is unsafe. Peter Ziljstra says there's no way to know when the scheduler uses RCU, so “generic” detection of the unsafe condition is a bit tricky.

Now with RCU consolidation, the above situation actually improves. Even if the scheduler RQ/PI locks are not held across the whole read-side critical sectoin, but just across that of the rcu_read_unlock(), then that itself may be enough to prevent a scheduler deadlock. The reasoning is: during the rcu_read_unlock(), we cannot yet report a QS until the RQ/PI lock is itself released since the act of holding the lock itself means preemption is disabled and that would cause a QS deferral. As a result, the act of priority de-boosting would also be deferred and prevent a possible scheduler deadlock.

However, RCU consolidation introduces even newer scenarios where the rcu_read_unlock() has to enter the scheduler, if the “scheduler rules” above is not honored, as explained below:

Consider the previous code example. Now also assume that the RCU reader is blocking an expedited RCU grace period. That is just a fancy term for a grace period that needs to end fast. These grace periods have to complete much more quickly than normal grace period. An expedited grace period causes currently running RCU reader sections to receive IPIs that set a hint. Setting of this hint results in the outermost rcu_read_unlock() calling rcu_read_unlock_special(), which otherwise would not occur. When rcu_read_unlock_special() gets called in this scenario, it tries to get more aggressive once it notices that the reader has blocked an expedited RCU grace period. In particular, it notices that preemption is disabled and so the grace period cannot end due to RCU consolidation. Out of desperation, it raises a softirq (raise_softirq()) in the hope that the next time the softirq runs, the grace period could be ended quickly before the scheduler tick occurs. But that can cause a scheduler deadlock by way of entry into the scheduler due to a ksoftirqd-wakeup.

The cure for this problem is the same, holding the RQ/PI locks across the entire reader section results in no question of a scheduler related deadlock due to recursively acquiring of these locks; because there would be no question of expedited-grace-period IPIs, hence no question of setting of any hints, and hence no question of calling rcu_read_unlock_special() from scheduler code. For a twist of the IPI problem, see special note.

However, the RCU consolidation throws yet another curve ball. Paul McKenney explained on LKML that there is yet another situation now due to RCU consolidation that can cause scheduler deadlocks.

Consider the following code, where previous_reader() and current_reader() execute in quick succession in the context of the same task:

		do_something();      // Preemption or IPI happened
		local_irq_disable(); // Cannot be the scheduler
		rcu_read_unlock();  // As IRQs are off, defer QS report
                                    //but set deferred_qs bit in 

        // QS from previous_reader() is still deferred.
		local_irq_disable();  // Might be the scheduler.
		rcu_read_unlock();    // Must still defer reporting QS

Here previous_reader() had a preemption; even though the current_reader() did not – but the current_reader() still needs to call rcu_read_unlock_special() from the scheduler! This situation would not happen in the pre-consolidated-RCU world because previous_reader()'s rcu_read_unlock() would have taken care of it.

As you can see, just following the scheduler rule of disabling interrupts across the entire reader section does not help. To detect the above scenario; a new bitfield deferred_qs has been added to the task_struct::rcu_read_unlock_special union. Now what happens is, at rcu_read_unlock()-time, the previous reader() sets this bit, and the current_reader() checks this bit. If set, the call to raise_softirq() is avoided thus eliminating the possibility of a scheduler deadlock.

Hopefully no other scheduler deadlock issue is lurking!

Coming back to the scheduler rule, I have been running overnight rcutorture tests to detect if this rule is ever violated. Here is the test patch checking for the unsafe condition. So far I have not seen this condition occur which is a good sign.

I may need to check with Paul McKenney about whether proposing this checking for mainline is worth it. Thankfully, LPC 2019 is right around the corner! ;–)

Special Note

[1] The expedited IPI interrupting an RCU reader has a variation. For an example see below where the IPI was not received, but we still have a problem because the ->need_qs bit in the rcu_read_unlock_special union got set even though the expedited grace period started after IRQs were disabled. The start of the expedited grace period would set the rnp->expmask bit for the CPU. In the unlock path, because the ->need_qs bit is set, it will call rcu_read_unlock_special() and risk a deadlock by way of a ksoftirqd wakeup because exp in that function is true.

CPU 0                         CPU 1

// do something real long

// Scheduler-tick sets
// ->need_qs as reader is
// held for too long.

                              // Expedited GP started
// Exp IPI not received
// because IRQs are off.


// Here rcu_read_unlock will
// still call ..._special()
// as ->need_qs got set.


The fix for this issue is the same as described earlier, disabling interrupts across both rcu_read_lock() and rcu_read_unlock() in the scheduler path.