joelfernandes

Email me at: joel at joelfernandes.org

Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that: 1. Normal hrtimers will have both a soft and hard expiry which are equal to each other. 2. But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.

The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimer_run_queues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimer_update_next_event() –> __hrtimer_get_next_event() –> __hrtimer_next_event_base()–>hrtimer_get_expires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.

The commit containing the original code added for range timers is:

commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Mon Sep 1 15:47:08 2008 -0700

    hrtimer: turn hrtimers into range timers
   
    this patch turns hrtimers into range timers;
    they have 2 expire points
    1) the soft expire point
    2) the hard expire point
   
    the kernel will do it's regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future.

The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimer_interrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.

To see the issue, consider the case where we have 2 timers queued:

  1. The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.

  2. The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.

Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.

As shown below, the loop in hrtimer_interrupt() goes through all the active timers in the timerqueue, _softexpires is made to be the real expiry, and the old _expires now becomes _softexpires + slack.

       while((node = timerqueue_getnext(&base->active))) {
              struct hrtimer *timer;

              timer = container_of(node, struct hrtimer, node);

              /*
               * The immediate goal for using the softexpires is
               * minimizing wakeups, not running timers at the
               * earliest interrupt after their soft expiration.
               * This allows us to avoid using a Priority Search
               * Tree, which can answer a stabbing querry for
               * overlapping intervals and instead use the simple
               * BST we already have.
               * We don't add extra wakeups by delaying timers that
               * are right-of a not yet expired timer, because that
               * timer will have to trigger a wakeup anyway.
               */

              if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
                      ktime_t expires;

                      expires = ktime_sub(hrtimer_get_expires(timer),
                                          base->offset);
                      if (expires.tv64 < expires_next.tv64)
                              expires_next = expires;
                      break;
              }

              __run_hrtimer(timer, &basenow);
      }

However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because __hrtimer_next_event_base() calls hrtimer_get_expires() which correctly use the “hard expiry” times to do the programming.

As of v6.2, the __hrtimer_run_queues() function looks like this:

static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
				 unsigned long flags, unsigned int active_mask)
{
	struct hrtimer_clock_base *base;
	unsigned int active = cpu_base->active_bases & active_mask;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *node;
		ktime_t basenow;

		basenow = ktime_add(now, base->offset);

		while ((node = timerqueue_getnext(&base->active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing query for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don't add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow < hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(cpu_base, base, timer, &basenow, flags);
			if (active_mask == HRTIMER_ACTIVE_SOFT)
				hrtimer_sync_wait_running(cpu_base, flags);
		}
	}
}

The utilization of hrtimer_get_softexpires_tv64() might be perplexing, as it may raise the question of how this loop expires non-slack timers that possess only a hard expiry time. To clarify, it's important to note that what was once referred to as expiry is now considered soft expiry for non-slack timers. Consequently, the condition basenow < hrtimer_get_softexpires_tv64(timer) is capable of expiring both slack and non-slack timers effectively.

The writer works in the ChromeOS kernel team, where most of the system libraries, low-level components and user space is written in C++. Thus the writer has no choice but to be familiar with C++. It is not that hard, but some things are confusing. rvalue references are definitely confusing.

In this post, I wish to document rvalue references by simple examples, before I forget it.

Refer to this article for in-depth coverage on rvalue references.

In a nutshell: An rvalue reference can be used to construct a C++ object efficiently using a “move constructor”. This efficiency is achieved by the object's move constructor by moving the underlying memory of the object efficiently to the destination instead of a full copy. Typically the move constructor of the object will copy pointers within the source object into the destination object, and null the pointer within the source object.

An rvalue reference is denoted by a double ampersand (&&) when you want to create an rvalue reference as a variable.

For example T &&y; defines a variable y which holds an rvalue reference of type T. I have almost never seen an rvalue reference variable created this way in real code. I also have no idea when it can be useful. Almost always they are created by either of the 2 methods in the next section. These methods create an “unnamed” rvalue reference which can be passed to a class's move constructor.

When is an rvalue reference created?

In the below example, we create an rvalue reference to a vector, and create another vector object from this.

This can happen in 2 ways (that I know off):

1. Using std::move

This converts an lvalue reference to an rvalue reference.

Example:

#include <iostream>
#include <vector>

int main()
{
    int *px, *py;
    std::vector<int> x = {4,3};
    px = &(x[0]);
 
    // Convert lvalue 'x' to rvalue reference and pass
    // it to vector's overloaded move constructor.
    std::vector<int> y(std::move(x)); 
    py = &(y[0]);

    // Confirm the new vector uses same storage
    printf("same vector? : %d\n", px == py); // prints 1
}

2. When returning something from a function

The returned object from the function can be caught as an rvalue reference to that object.

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int> y(myf(4)); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 1
}

Note on move asssignment

Interestingly, if you construct vector 'y' using the assignment operator: std::vector<int> y = myf(4);, the compiler may decide to use the move constructor automatically even though assignment is chosen. I believe this is because of vector's move assignment operator overload.

Further, the compiler may even not invoke a constructor at all and just perform RVO (Return Value Optimization).

Quiz

Question:

If I create a named rvalue reference using std::move and then use this to create a vector, the underlying storage of the new vector is different. Why?

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int>&& ref = myf(4);
    std::vector<int> y(ref); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 0
}

Answer

The answer is: because the value category of the id-expression 'ref' is lvalue, the copy constructor will be chosen. To use the move constructor, it has to be std::vector<int> y(std::move(ref));.

Conclusion

rvalue references are confusing and sometimes the compiler can do different optimizations to cause further confusion. It is best to follow well known design patterns when designing your code. It may be best to also try to avoid rvalue references altogether but hopefully this article helps you understand it a bit more when you come across large C++ code bases.

GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD's slab implementation) is here: https://www.youtube.com/watch?v=ZXUIFj4nRjk

The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.

Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can't see the photon yet, we can only see what we've been shown so far, and in a way, if we've not seen something because enough “time” has not passed, then we may not reclaim it yet. If we've not seen something, we will see it at some point in the future. Till then we need to sit tight.

Roughly, an implementation has 2+N counters (with N CPUs): 1. Global write sequence. 2. Global read sequence. 3. Per-cpu read sequence (read from #1 when a reader starts)

On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free'ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs. Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.

Note the scalability drawbacks of this reclaim scheme:

  1. Expensive poll operation if you have 1000s of CPUs. (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)

  2. Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.

  3. There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.

One advantage of the scheme could be implementation simplicity.

RCU (not SRCU or Userspace RCU) doesn't suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.

The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

There's are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part's lock counter during locking and likewise for unlock.

During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with 'A MUST' in the below pseudo-code:

        reader 1        writer                        reader 2
        -------------------------------------------------------
        // read_lock
        // enter
        Read: idx = 0;
        <long delay>    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                        done.
                                                      Read: idx = 1;
        lock[0]++;
                                                      lock[1]++;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] <- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] <- A MUST!

NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its 'srcu_idx' equivalent.

Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic? Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the 'flip' logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a 'flipped' part would only monotonically decrease since new readers go to its counterpart.

The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available — by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0's stores and P1's loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (“message”).

        int buf = 0, flag = 0;

        P0()
        {
                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);
        }

        P1()
        {
                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);
        }

Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney's perfbook for details on that).

EXTENDS Integers, TLC
(*--algorithm mp_pattern
variables
    buf = 0,
    flag = 0;

process Writer = 1
variables
    begin
e0:
       either
e1:        buf := 1;
e2:        flag := 1;
        or
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
variables
    r1 = 0,
    r2 = 0;  
    begin
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)

Sure enough, the assert r1 = 0 \/ r2 = 1; fires when the PlusCal program is run through the TLC model checker.

I do find the either or block clunky, and wish I could just do something like:

non_deterministic {
        buf := 1;
        flag := 1;
}

And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.

Other notes:

  • PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal's syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don't want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don't specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.

  • I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.

Note: At the time of this writing, it is kernel v5.3 release. RCU moves fast and can change in the future, so some details in this article may be obsolete.

The RCU subsystem and the task scheduler are inter-dependent. They both depend on each other to function correctly. The scheduler has many data structures that are protected by RCU. And, RCU may need to wake up threads to perform things like completing grace periods and callback execution. One such case where RCU does a wake up and enters the scheduler is rcu_read_unlock_special().

Recently Paul McKenney consolidated RCU flavors. What does this mean?

Consider the following code executing in CPU 0:

preempt_disable();
rcu_read_lock();
rcu_read_unlock();
preempt_enable();

And, consider the following code executing in CPU 1:

a = 1;
synchronize_rcu();  // Assume synchronize_rcu
                    // executes after CPU0's rcu_read_lock
b = 2;

CPU 0's execution path shows 2 flavors of RCU readers, one nested into another. The preempt_{disable,enable} pair is an RCU-sched flavor RCU reader section, while the rcu_read_{lock,unlock} pair is an RCU-preempt flavor RCU reader section.

In older kernels (before v4.20), CPU 1's synchronize_rcu() could return after CPU 0's rcu_read_unlock() but before CPU 0's preempt_enable(). This is because synchronize_rcu() only needs to wait for the “RCU-preempt” flavor of the RCU grace period to end.

In newer kernels (v4.20 and above), the RCU-preempt and RCU-sched flavors have been consolidated. This means CPU 1's synchronize_rcu() is guaranteed to wait for both of CPU 1's rcu_read_unlock() and preempt_enable() to complete.

Now, lets get a bit more detailed. That rcu_read_unlock() most likely does very little. However, there are cases where it needs to do more, by calling rcu_read_unlock_special(). One such case is if the reader section was preempted. A few more cases are:

  • The RCU reader is blocking an expedited grace period, so it needed to report a quiescent state quickly.
  • The RCU reader is blocking a grace period for too long (~100 jiffies on my system, that's the default but can be set with rcutree.jiffies_till_sched_qs parameter).

In all these cases, the rcu_read_unlock() needs to do more work. However, care must be taken when calling rcu_read_unlock() from the scheduler, that's why this article on scheduler deadlocks.

One of the reasons rcu_read_unlock_special() needs to call into the scheduler is priority de-boosting: A task getting preempted in the middle of an RCU read-side critical section results in blocking the completion of the critical section and hence could prevent current and future grace periods from ending. So the priority of the RCU reader may need to be boosted so that it gets enough CPU time to make progress, and have the grace period end soon. But it also needs to be de-boosted after the reader section completes. This de-boosting happens by calling of the rcu_read_unlock_special() function in the outer most rcu_read_unlock().

What could go wrong with the scheduler using RCU? Let us see this in action. Consider the following piece of code executed in the scheduler:

  reader()
	{
		rcu_read_lock();
		do_something();     // Preemption happened
                /* Preempted task got boosted */
		task_rq_lock();     // Disables interrupts
                rcu_read_unlock();  // Need to de-boost
		task_rq_unlock();   // Re-enables interrupts
	}

Assume that the rcu_read_unlock() needs to de-boost the task's priority. This may cause it to enter the scheduler and cause a deadlock due to recursive locking of RQ/PI locks.

Because of these kind of issues, there has traditionally been a rule that RCU usage in the scheduler must follow:

“Thou shall not hold RQ/PI locks across an rcu_read_unlock() if thou not holding it or disabling IRQ across both both the rcu_read_lock() + rcu_read_unlock().”

More on this rule can be read here as well.

Obviously, acquiring RQ/PI locks across the whole rcu_read_lock() and rcu_read_unlock() pair would resolve the above situation. Since preemption and interrupts are disabled across the whole rcu_read_lock() and rcu_read_unlock() pair; there is no question of task preemption.

Anyway, the point is rcu_read_unlock() needs to be careful about scheduler wake-ups; either by avoiding calls to rcu_read_unlock_special() altogether (as is the case if interrupts are disabled across the entire RCU reader), or by detecting situations where a wake up is unsafe. Peter Ziljstra says there's no way to know when the scheduler uses RCU, so “generic” detection of the unsafe condition is a bit tricky.

Now with RCU consolidation, the above situation actually improves. Even if the scheduler RQ/PI locks are not held across the whole read-side critical sectoin, but just across that of the rcu_read_unlock(), then that itself may be enough to prevent a scheduler deadlock. The reasoning is: during the rcu_read_unlock(), we cannot yet report a QS until the RQ/PI lock is itself released since the act of holding the lock itself means preemption is disabled and that would cause a QS deferral. As a result, the act of priority de-boosting would also be deferred and prevent a possible scheduler deadlock.

However, RCU consolidation introduces even newer scenarios where the rcu_read_unlock() has to enter the scheduler, if the “scheduler rules” above is not honored, as explained below:

Consider the previous code example. Now also assume that the RCU reader is blocking an expedited RCU grace period. That is just a fancy term for a grace period that needs to end fast. These grace periods have to complete much more quickly than normal grace period. An expedited grace period causes currently running RCU reader sections to receive IPIs that set a hint. Setting of this hint results in the outermost rcu_read_unlock() calling rcu_read_unlock_special(), which otherwise would not occur. When rcu_read_unlock_special() gets called in this scenario, it tries to get more aggressive once it notices that the reader has blocked an expedited RCU grace period. In particular, it notices that preemption is disabled and so the grace period cannot end due to RCU consolidation. Out of desperation, it raises a softirq (raise_softirq()) in the hope that the next time the softirq runs, the grace period could be ended quickly before the scheduler tick occurs. But that can cause a scheduler deadlock by way of entry into the scheduler due to a ksoftirqd-wakeup.

The cure for this problem is the same, holding the RQ/PI locks across the entire reader section results in no question of a scheduler related deadlock due to recursively acquiring of these locks; because there would be no question of expedited-grace-period IPIs, hence no question of setting of any hints, and hence no question of calling rcu_read_unlock_special() from scheduler code. For a twist of the IPI problem, see special note.

However, the RCU consolidation throws yet another curve ball. Paul McKenney explained on LKML that there is yet another situation now due to RCU consolidation that can cause scheduler deadlocks.

Consider the following code, where previous_reader() and current_reader() execute in quick succession in the context of the same task:

       previous_reader()
	{
		rcu_read_lock();
		do_something();      // Preemption or IPI happened
		local_irq_disable(); // Cannot be the scheduler
		do_something_else();
		rcu_read_unlock();  // As IRQs are off, defer QS report
                                    //but set deferred_qs bit in 
                                    //rcu_read_unlock_special
		do_some_other_thing();
		local_irq_enable();
	}

        // QS from previous_reader() is still deferred.
	current_reader() 
	{
		local_irq_disable();  // Might be the scheduler.
		do_whatever();
		rcu_read_lock();
		do_whatever_else();
		rcu_read_unlock();    // Must still defer reporting QS
		do_whatever_comes_to_mind();
		local_irq_enable();
	}

Here previous_reader() had a preemption; even though the current_reader() did not – but the current_reader() still needs to call rcu_read_unlock_special() from the scheduler! This situation would not happen in the pre-consolidated-RCU world because previous_reader()'s rcu_read_unlock() would have taken care of it.

As you can see, just following the scheduler rule of disabling interrupts across the entire reader section does not help. To detect the above scenario; a new bitfield deferred_qs has been added to the task_struct::rcu_read_unlock_special union. Now what happens is, at rcu_read_unlock()-time, the previous reader() sets this bit, and the current_reader() checks this bit. If set, the call to raise_softirq() is avoided thus eliminating the possibility of a scheduler deadlock.

Hopefully no other scheduler deadlock issue is lurking!

Coming back to the scheduler rule, I have been running overnight rcutorture tests to detect if this rule is ever violated. Here is the test patch checking for the unsafe condition. So far I have not seen this condition occur which is a good sign.

I may need to check with Paul McKenney about whether proposing this checking for mainline is worth it. Thankfully, LPC 2019 is right around the corner! ;–)


Special Note

[1] The expedited IPI interrupting an RCU reader has a variation. For an example see below where the IPI was not received, but we still have a problem because the ->need_qs bit in the rcu_read_unlock_special union got set even though the expedited grace period started after IRQs were disabled. The start of the expedited grace period would set the rnp->expmask bit for the CPU. In the unlock path, because the ->need_qs bit is set, it will call rcu_read_unlock_special() and risk a deadlock by way of a ksoftirqd wakeup because exp in that function is true.

CPU 0                         CPU 1
preempt_disable();
rcu_read_lock();

// do something real long

// Scheduler-tick sets
// ->need_qs as reader is
// held for too long.

local_irq_disable();
                              // Expedited GP started
// Exp IPI not received
// because IRQs are off.

local_irq_enable();

// Here rcu_read_unlock will
// still call ..._special()
// as ->need_qs got set.
rcu_read_unlock();

preempt_enable();

The fix for this issue is the same as described earlier, disabling interrupts across both rcu_read_lock() and rcu_read_unlock() in the scheduler path.