The Immanent Deprecation of memory_order::consume

There is a proposal making its way through the C++ Standards Committee to Defang and deprecate memory_order::consume, and a similar proposal is likely to make its way through the C Standards Committee. This is somewhat annoying from a Linux-kernel-RCU perspective, because there was some reason to hope for language-level support for the address dependencies headed by calls to rcu_dereference().

So what is a Linux-kernel community to do?

The first thing is to review the current approach to rcu_dereference() address dependencies, which use a combination of standard language features and code standards.

Standard Language Features

Actual Implementations of memory_order::consume

All known implementations of memory_order::consume simply promote it to memory_order::acquire. This is correct from a functional perspective, but leaves performance on the table for PowerPC and for some hardware implementations of ARM. Of less concern for the Linux kernel, on many GPGPUs, memory_order::acquire has significant overhead.

There are those who claim that in the future, accesses using memory_order::acquire will be no more expensive than those using memory_order::consume. And who knows, maybe they are correct. But in the meantime, we must deal with the hardware that we actually have.

Your Friend in Need: volatile

Some might argue that the volatile keyword is underspecified by the C and C++ standards, but a huge body of device-driver code does constrain the compilers, albeit sometimes rather controversially. And rcu_dereference() uses a volatile load to fetch the pointer, which prevents the compiler from spoiling the fun by (say) reloading from the same pointer, whose value might well have changed in the meantime.

A Happy Consequence of Data-Race Prohibition

The C language forbids implementations (in our case, compilers) from creating data races, that is, situations where the object code has concurrent C-language accesses to a given variable, at least one of which is a store.

So how does this help?

For one thing, it constrains compiler-based speculation. To see this, suppose that one thread does this:

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 42;
rcu_assign_pointer(gp, p);

And another thread does this:

p = rcu_dereference(gp);
do_something_with(p->a);

At the source-code level, there is clearly no data race. But suppose the compiler uses profile-directed optimization, and learns that the value returned by rcu_dereference() is almost always 0x12345678. Such a compiler might be tempted to emit code to cause the hardware to concurrently execute the rcu_dereference() while also loading 0x12345678->a. If the rcu_dereference() returned the expected value of 0x12345678, the compiler could use the value loaded from 0x12345678->a, otherwise, it could load p->a.

The problem is that the two threads might execute concurrently as follows:

p = kmalloc(sizeof(*p), GFP_KERNEL);

// These two concurrently:
r1 = 0x12345678->a;
p->a = 0xdead1eafbadfab1e;

rcu_assign_pointer(gp, p);
p = rcu_dereference(gp); // Returns 0x12345678!!!

Because the value of p is 0x12345678, it appears that the speculation has succeeded. But the second thread's load into r1 ran concurrently with the first thread's store into p->a, which might result in user-visible torn loads and stores, or just plain pre-initialization garbage.

This sort of software speculation is therefore forbidden.

Yes, hardware can get away with this sort of thing because it tracks cache state. If a compiler wishes to generate code that executes speculatively, it must use something like hardware transactional memory is required, which typically has overhead that overwhelms any possible benefit.

Code Standards

The Documentation/RCU/rcu_dereference.rst file presents the Linux-kernel's code standards for the address dependencies headed by members of the rcu_dereference() API family. A summary of the most widely applicable of these standards is as follows:

  1. An address dependency must be headed by an appropriate member of the rcu_dereference() API family. The variables holding the return value from a member of this API family are said to be carrying a dependency.
  2. In the special case where data is added and never removed, READ_ONCE() can be substituted for one of the rcu_dereference() APIs.
  3. Address dependencies are carried by pointers only, and specifically not by integers. (With the exception that integer operations may be used to set, clear, and XOR bits in the pointers, which requires those pointers to be translated to integers, have their bits manipulated, and then translated immediately back to pointers.)
  4. Operations that cancel out all the bits in the original pointer break the address dependency.
  5. Comparing a dependency-carrying pointer to the address of a statically allocated variable can break the dependency chain. (Though there are special rules that allow such comparisons to be carried out safely in some equally special cases.)
  6. Special operations on hardware instruction caches may be required when using pointers to JITed functions.

The Documentation/RCU/rcu_dereference.rst file provides much more detail, which means that scanning the above list is not a substitute for reading the full file.

Enforcing Code Standards

My fond hope in the past was that compilers would have facilities that disable the optimizations requiring the code standards, but that effort seems likely to require greater life expectancy than I can bring to bear. That said, I definitely encourage others to try their hands at this.

But in the meantime, we need a way to enforce these code standards.

One approach is obviously code review, but it would be good to have automated help for this.

And Paul Heidekreuger presented a prototype tool at the 2022 Linux Plumbers Conference. This tool located several violations of the rule against comparing dependency-carrying pointers against the addresses of statically allocated variables.

Which suggests that continued work on such tooling could be quite productive.

Summary

So memory_order::consume is likely to go away, as is its counterpart in the C standard. This is not an immediate problem because all known implementations simply map memory_order::consume to memory_order::acquire, with those who care using other means to head address dependencies. (In the case of the Linux kernel, volatile loads.)

However, this does leave those who care with the issue of checking code using things like rcu_dereference(), given that the language standards are unlikely to provide any help any time soon.

Continued work on tooling that checks the handling of dependency chains in the object code therefore seems like an eminently reasonable step forward.