What on Earth Does Pointer Provenance Have to do With RCU?

TL;DR: Unless you are doing very strange things with RCU, not much!!!

So why has the guy most responsible for Linux-kernel spent so much time over the past five years working on the provenance-related lifetime-end pointer zap within the C++ Standards Committee?

But first...

What is Pointer Provenance?

Back in the old days, provenance was for objets d'art and the like, and we did not need them for our pointers, no sirree!!! Pointers had bits, those bits formed memory addresses, and as often as not we didn't even need to worry about these addresses being translated. But life is more complicated now. On the other hand, computing life is also much bigger, faster, more reliable, and (usually) more productive, so be extremely careful what you wish for from the Good Old Days!

These days, pointers have provenance as well as addresses, and this has consequences. The C++ Standard (recent draft) states that when an object's storage duration ends, any pointers to that object become invalid. For its part, the C Standard states that when an object's storage duration ends, any pointers to that object become indeterminate. In both standards, the wording is more precise, but this will serve for our purposes. (For the remainder of this document, we will follow C++ and say “invalid”, which is shorter than “indeterminate”. We will balance this out by using C-language example code.)

Neither standard places any constraints on what a compiler can do with an invalid pointer value, even if all you are doing is loading or storing that value.

Those of us who cut our teeth on assembly language might quite reasonably ask why anyone would even think to make pointers so invalid that you cannot even load or store them. Let's start by looking at pointer comparisons using this code fragment:

p = kmalloc(...);
might_kfree(p);         // Pointer might become invalid (AKA "zapped")
q = kmalloc(...);       // Assume that the addresses of p and q are equal.
if (p == q)             // Compiler can optimize as "if (false)"!!!
    do_something();

Both p and q contain addresses, but the compiler also keeps track of the fact that their values were obtained from different invocations of kmalloc(). This information forms part of each pointer's provenance. This means that p and q have different provenance, which in turn means that the compiler does not need to generate any code for the p == q comparison. The two pointers' provenance differs, so the result cannot be anything other than false.

And this is one motivation for pointer provenance and invalidity: The results of operations on invalid pointers are not guaranteed, which provides additional opportunities for optimization. This example perhaps seems a bit silly, but modern compilers can use pointer provenance and invalidity to carry out serious points-to and aliasing analysis.

Yes, you can have hardware provenance. Examples include ARM MTE, the CHERI research prototype, and the venerable IBM System i. Conventional systems provide pointer provenance of a sort via their page tables, which is used by a variety of memory-allocation-use debuggers, for but one example, the efence library.

Of course, using invalid (AKA “dangling”) pointers is known to be a bad idea. So why are we even talking about it???

Why Would Anyone Use Invalid/Dangling Pointers?

Please allow me to introduce you to the famous and frequently re-invented LIFO Push algorithm. You can find this in many places, but let's focus on the Linux kernel's llist_add_batch() and llist_del_all() functions. The former atomically pushes a list of elements on a linked-list stack, and the latter just as atomically removes the entire contents of the stack:

static inline bool llist_add_batch(struct llist_node *new_first,
                                   struct llist_node *new_last,
                                   struct llist_head *head)
{
    struct llist_node *first = READ_ONCE(head->first);

    do {
        new_last->next = first;
    } while (!try_cmpxchg(&head->first, &first, new_first));

    return !first;
}

static inline struct llist_node *llist_del_all(struct llist_head *head)
{
    return xchg(&head->first, NULL);
}

As lockless concurrent algorithms go, this one is pretty straightforward. The llist_add_batch() function reads the list header, fills in the ->next pointer, then does a compare-and-exchange operation to point the list header at the new first element. The llist_del_all() function is even simpler, doing a single atomic exchange operation to NULL out the list header and returning the elements that were previously on the list. This algorithm also has excellent forward-progress properties: the llist_add_batch() function is lock-free and the llist_del_all() function is wait-free.

So what is not to like?

In assembly language, or with a simple compiler, not much. But to see the pointer-provenance issue with more heavily optimized languages, consider the following sequence of events:

  1. CPU 0 allocates an llist_node B and passes it via both the new_first and new_last parameters of llist_add_batch().
  2. CPU 0 picks up the head->first pointer and places it in the first local variable, then assigns it to new_last->next. This new_last->next pointer now references llist_node A.
  3. CPU 1 invokes llist_del_all(), which returns a list containing llist_node A. The caller of llist_del_all() processes A and passes it to kfree().
  4. CPU 0's new_last->next pointer is now invalid due to llist_node A having been freed. But CPU 0 does not know this.
  5. CPU 1 allocates an llist_node C that happens to have the same address as the old llist_node A. It passes C via both the new_first and new_last parameters of llist_add_batch(), which runs to completion. The head pointer now points to llist_node C, which happens to have the same address as the now storage-duration-ended llist_node A.
  6. CPU 0 finally gets around to executing its try_cmpxchg(), which given typical C compilers will succeed. The llist now contains an llist_node B that contains an invalid pointer to dead llist_node A, but whose pointer address happens to reference the shiny new llist_node C. (We term this invalid pointer a “zombie pointer” because it has in some assembly-language sense come back from the dead.)
  7. Some CPU invokes llist_del_all() and gets back an llist containing an invalid pointer.

One could argue that the Linux-kernel implementation of LIFO Push is simply buggy and should be fixed. Except that there is no reasonable way to fix it. Which of course raises the question...

What Are Unreasonable Fixes?

We can protect pointers from invalidity by storing them as integers, but:

  1. Suppose someone has an element that they are passing to a library function. They should not be required to convert all their ->next pointers to integer just because the library's developers decide to switch to the LIFO Push algorithm for some obscure internal operation.
  2. In addition, switching to integer defeats type-checking, because integers are integers no matter what type of pointer they came from.
  3. We could restore some type-checking capability by wrapping the integer into a differently named struct for each pointer type. Except that this requires a struct with some particular name to be treated as compatible with pointers of some type corresponding to that name, a notion that current do not support.
  4. In C++, we could use template metaprogramming to wrap an integer into a class that converts automatically to and from compatibly typed pointers. But there would then be windows of time in which there was a real pointer, and at that time there would still be the possibility of pointer invalidity.
  5. All of the above hack-arounds put additional obstacles in the way of developers of concurrent software.

However, it is fair to ask...

Why Do We Care About Strange New Algorithms???

Let's take a look at the history, courtesy of Maged Michael's diligent software archaeology.

In 1986, R. K. Treiber presented an assembly language implementation of the LIFO Push algorithm in technical report RJ 5118 entitled “Systems Programming: Coping with Parallelism” while at the IBM Almaden Research Center.

In 1975, an assembly language implementation of this same algorithm (but with pop() instead of popall(), but with the same ABA properties) was presented in the IBM System 370 Principles of Operation as a method for managing a concurrent freelist.

US Patent 3,886,525 was filed in June 1973, just a few months before I wrote my first line of code, and contains a prior-art reference to the LIFO Push algorithm (again with pop() instead of popall()) as follows: “Conditional swapping of a single address is sufficient to program a last-in, first-out single-user-at-a-time sequencing mechanism.” (If you were to ask a patent attorney, you would likely be told that this 50-year-old patent has long since expired. Which should be no surprise, given that it is even older than Dennis Ritchie's setuid Patent 4,135,240.)

All three of these references describe LIFO push as if it was straightforward and well known.

So we don’t know who first invented LIFO Push or when they invented it, but it was well known in 1973. Which is well over a decade before C was first standardized, more than two decades before C++ was first standardized, and even longer before work was started on Rust.

And its combination of (relative) simplicity and excellent forward-progress properties just might be why this algorithm was anonymously invented so long ago and why it is so persistently and repeatedly reinvented. This frequent reinvention puts paid to any notion that LIFO Push is strange.

So sorry, but LIFO Push is neither new nor strange.

@@@ relaxed stores and pointer becoming available before it has been allocated. Or not. Maybe just cite it. (David Goldblat?)

@@@ More information:

  1. https://isocpp.org/files/papers/P2434R4.html
  2. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p2414r9.pdf
  3. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3347r3.pdf
  4. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3790r0.pdf

@@@

There is a proposal making its way through the C++ Standards Committee to Defang and deprecate memory_order::consume, and a similar proposal is likely to make its way through the C Standards Committee. This is somewhat annoying from a Linux-kernel-RCU perspective, because there was some reason to hope for language-level support for the address dependencies headed by calls to rcu_dereference().

For one thing, it constrains compiler-based speculation. To see this, suppose that one thread does this:

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 42;
rcu_assign_pointer(gp, p);

And another thread does this:

p = rcu_dereference(gp);
do_something_with(p->a);

At the source-code level, there is clearly no data race. But suppose the compiler uses profile-directed optimization, and learns that the value returned by rcu_dereference() is almost always 0x12345678. Such a compiler might be tempted to emit code to cause the hardware to concurrently execute the rcu_dereference() while also loading 0x12345678->a. If the rcu_dereference() returned the expected value of 0x12345678, the compiler could use the value loaded from 0x12345678->a, otherwise, it could load p->a.

The problem is that the two threads might execute concurrently as follows:

p = kmalloc(sizeof(*p), GFP_KERNEL);

// These two concurrently:
r1 = 0x12345678->a;
p->a = 0xdead1eafbadfab1e;

rcu_assign_pointer(gp, p);
p = rcu_dereference(gp); // Returns 0x12345678!!!

Because the value of p is 0x12345678, it appears that the speculation has succeeded. But the second thread's load into r1 ran concurrently with the first thread's store into p->a, which might result in user-visible torn loads and stores, or just plain pre-initialization garbage.

This sort of software speculation is therefore forbidden.

Yes, hardware can get away with this sort of thing because it tracks cache state. If a compiler wishes to generate code that executes speculatively, it must use something like hardware transactional memory is required, which typically has overhead that overwhelms any possible benefit.

Code Standards

The Documentation/RCU/rcu_dereference.rst file presents the Linux-kernel's code standards for the address dependencies headed by members of the rcu_dereference() API family. A summary of the most widely applicable of these standards is as follows:

  1. An address dependency must be headed by an appropriate member of the rcu_dereference() API family. The variables holding the return value from a member of this API family are said to be carrying a dependency.
  2. In the special case where data is added and never removed, READ_ONCE() can be substituted for one of the rcu_dereference() APIs.
  3. Address dependencies are carried by pointers only, and specifically not by integers. (With the exception that integer operations may be used to set, clear, and XOR bits in the pointers, which requires those pointers to be translated to integers, have their bits manipulated, and then translated immediately back to pointers.)
  4. Operations that cancel out all the bits in the original pointer break the address dependency.
  5. Comparing a dependency-carrying pointer to the address of a statically allocated variable can break the dependency chain. (Though there are special rules that allow such comparisons to be carried out safely in some equally special cases.)
  6. Special operations on hardware instruction caches may be required when using pointers to JITed functions.

The Documentation/RCU/rcu_dereference.rst file provides much more detail, which means that scanning the above list is not a substitute for reading the full file.

Enforcing Code Standards

My fond hope in the past was that compilers would have facilities that disable the optimizations requiring the code standards, but that effort seems likely to require greater life expectancy than I can bring to bear. That said, I definitely encourage others to try their hands at this.

But in the meantime, we need a way to enforce these code standards.

One approach is obviously code review, but it would be good to have automated help for this.

And Paul Heidekreuger presented a prototype tool at the 2022 Linux Plumbers Conference. This tool located several violations of the rule against comparing dependency-carrying pointers against the addresses of statically allocated variables.

Which suggests that continued work on such tooling could be quite productive.

Summary

So memory_order::consume is likely to go away, as is its counterpart in the C standard. This is not an immediate problem because all known implementations simply map memory_order::consume to memory_order::acquire, with those who care using other means to head address dependencies. (In the case of the Linux kernel, volatile loads. In the case of quite a bit of userspace software, memory_order_relaxed.)

However, this does leave those who care with the issue of checking code using things like rcu_dereference(), given that the language standards are unlikely to provide any help any time soon.

Continued work on tooling that checks the handling of dependency chains in the object code therefore seems like an eminently reasonable step forward.