Read the latest posts from

from mcgrof

kdevops logo

After 3 years since the announcement of the first release of kdevops I'd like to announce the release of v6.2-rc1 of kdevops!

kdevops is designed to help with automation of Linux kernel development workflows. At first, is was not clear how and if kdevops could be used outside of filesystems testing easily. In fact my last post about it 3 years ago explained how one could only use kdevops in an odd way for other things, one had to fork it to use it for different workflows. That's old nonsense now. kdevops has grown to adopt kconfig and so in one single tree different workflows are now possible. Embracing other things such as using jinja2 for file templating with ansible and having to figure out a way to add PCI-E passthrough support through kconfig has made me realize that the growth component to the project is no longer a concern, it is actually a feature now. It is clear now that new technologies and very complex workflows can easily be added to kdevops.

But it is easy to say unless you have proof, and fortunately I have it. There are two new technologies that go well supported in kdevops that folks who are curious can start mucking around with, which otherwise may take a bit of time to ramp up with. The technologies are: Zoned storage and CXL. Supporting new technologies also means ensuring you get whatever tooling you might need to want to test or work with such technologies.

So for instance, getting a full Linux kernel development workflow going for CXL with the meson unit tests, even by enabling PCI-E passthrough, with the latest linux-next kernel is now reduced to just a few basic commands, in a Linux distribution / cloud provider agnostic manner:

make dynconfig
make bringup
make linux
make cxl
make cxl-test-meson

Just ask around a typical CXL Linux kernel developer how long it took them to get a CXL Linux kernel development & test environment up and running that they were happy with. And ask if it was reproducible. This is all now just reduced to 6 commands.

As for the details, it has been 8 months since the last release, and over that time the project has received 680 commits. I'd like to thank the developers who contributed:

Adam Manzanares
Amir Goldstein
Chandan Babu R
Jeff Layton
Joel Granados
Josef Bacik
Luis Chamberlain
Pankaj Raghav

I'd also like to thank my employer for trusting in this work, and allowing me to share a big iron server to help the community with Linux kernel stable work and general kernel technology enablement.

As for the exact details of changes merged, there so many! So I've tried to provide a nice terse summary on highlights on the git tag for v6.2-rc1. 8 months was certainly a long time to wait for a new release, so my hope is we'll try to bake a release now in tandem with the Linux kernel, in cadence with the same Linux kernel versioning and release timeline.

Based on feedback at LSFMM this year the project is now under the github linux-kdevops organization. This enables other developers to push into the tree. This let's us scale, specially as different workflows are supported.

If you see value in enabling rapid ramp up with Linux kernel development through kdevops for your subsystem / technology / feel free to join the party and either send a pull request to the group or just send patches.


from Konstantin Ryabitsev

While b4 started out as a way for maintainers to retrieve patches from mailing lists, it also has contributor-oriented features. Starting with version 0.10 b4 can:

  • create and manage patch series and cover letters
  • track and auto-reroll series revisions
  • display range-diffs between revisions
  • apply trailers received from reviewers and maintainers
  • submit patches without needing a valid SMTP gateway

These features are still considered experimental, but they should be stable for most work and I'd be happy to receive further feedback from occasional contributors.

In this article, we'll go through the process of submitting an actual typo fix patch to the upstream kernel. This bug was identified a few years ago and submitted via bugzilla, but never fixed:

Accompanying video

This article has an accompanying video where I go through all the steps and submit the actual patch at the end:

Installing the latest b4 version

Start by installing b4. The easiest is to do it via pip, as this would grab the latest stable version:

$ pip install --user b4
$ b4 --version

If you get an error or an older version of b4, please check that your $PATH contains $HOME/.local/bin where pip installs the binaries.

Preparing the tree

  • b4 prep -n [name-of-branch] -f [nearest-tag]

Next, prepare a topical branch where you will be doing your work. We'll be fixing a typo in arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts, and we'll base this work on tag v6.1:

$ b4 prep -n lanyang-dts-typo -f v6.1
Created new branch b4/lanyang-dts-typo
Created the default cover letter, you can edit with --edit-cover.

This is just a regular branch prepended with “b4/”:

$ git branch
* b4/lanyang-dts-typo

You can do all the normal operations with it, and the only special thing about it is that it has an “empty commit” at the start of the series containing the template of our cover letter.

Editing the cover letter

  • b4 prep --edit-cover

If you plan to submit a single patch, then the cover letter is not that necessary and will only be used to track the destination addresses and changelog entries. You can delete most of the template content and leave just the title and sign-off. The tracking information json will always be appended to the end automatically — you don't need to worry about it.

Here's what the commit looks like after I edited it:

$ git cat-file -p HEAD
tree c7c1b7db9ced3eba518cfc1f711e9d89f73f8667
parent 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
author Konstantin Ryabitsev <> 1671656701 -0500
committer Konstantin Ryabitsev <> 1671656701 -0500

Simple typo fix for the lanyang dts

Signed-off-by: Konstantin Ryabitsev <>

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
  "series": {
    "revision": 1,
    "change-id": "20221221-lanyang-dts-typo-8509e8ffccd4",
    "base-branch": "master",
    "prefixes": []

Committing your work

You can add commits to this branch as you normally would with any other git work. I am going to fix two obvious typos in a single file and make a single commit:

$ git show HEAD
commit 820ce2d9bc7c88e1515642cf3fc4005a52e4c490 (HEAD -> b4/lanyang-dts-typo)
Author: Konstantin Ryabitsev <>
Date:   Wed Dec 21 16:17:21 2022 -0500

    arm: lanyang: fix lable->label typo for lanyang dts

    Fix an obvious spelling error in the dts file for Lanyang BMC.
    This was reported via bugzilla a few years ago but never fixed.

    Reported-by: Jens Schleusener <>
    Signed-off-by: Konstantin Ryabitsev <>

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
index c0847636f20b..e72e8ef5bff2 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
@@ -52,12 +52,12 @@ hdd_fault {
                        gpios = <&gpio ASPEED_GPIO(B, 3) GPIO_ACTIVE_HIGH>;
                bmc_err {
-                       lable = "BMC_fault";
+                       label = "BMC_fault";
                        gpios = <&gpio ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH>;

                sys_err {
-                       lable = "Sys_fault";
+                       label = "Sys_fault";
                        gpios = <&gpio ASPEED_GPIO(H, 7) GPIO_ACTIVE_HIGH>;

Collecting To: and Cc: addresses

  • b4 prep --auto-to-cc

After you've committed your work, you will want to collect the addresses of people who should be the ones reviewing it. Running b4 prep --auto-to-cc will invoke scripts/ with the default recommended flags to find out who should go into the To: and Cc: headers:

$ b4 prep --auto-to-cc
Will collect To: addresses using
Will collect Cc: addresses using
Collecting To/Cc addresses
    + To: Rob Herring <...>
    + To: Krzysztof Kozlowski <...>
    + To: Joel Stanley <...>
    + To: Andrew Jeffery <...>
    + Cc:
    + Cc:
    + Cc:
    + Cc:
    + Cc: Jens Schleusener <...>
You can trim/expand this list with: b4 prep --edit-cover
Invoking git-filter-repo to update the cover letter.
New history written in 0.06 seconds...
Completely finished after 0.33 seconds.

These addresses will be added to the cover letter and you can edit them to add/remove destinations using the usual b4 prep --edit-cover command.

Creating your patatt keypair for web endpoint submission

(This needs to be done only once.)

  • patatt genkey

Note: if you already have a PGP key and it's set as user.signingKey, then you can skip this section entirely.

Before we submit the patch, let's set up the keypair to sign our contributions. This is not strictly necessary if you are going to be using your own SMTP server to submit the patches, but it's a required step if you will use the patch submission endpoint (which is what b4 will use in the absence of any [sendemail] sections in your git config).

The process is very simple. Run patatt genkey and add the resulting [patatt] section to your ~/.gitconfig as instructed by the output.

NOTE: You will want to back up the contents of your ~/.local/share/patatt so you don't lose access to your private key.

Dry-run and checkpatch

  • b4 send -o /tmp/tosend
  • ./scripts/ /tmp/tosend/*

Next, generate the patches and look at their contents to make sure that everything is looking sane. Good things to check are:

  • the From: address
  • the To: and Cc: addresses
  • general patch formatting
  • cover letter formatting (if more than 1 patch in the series)

If everything looks sane, one more recommended step is to run from the top of the kernel tree:

$ ./scripts/ /tmp/tosend/*
total: 0 errors, 0 warnings, 14 lines checked

/tmp/tosend/0001-arm-lanyang-fix-lable-label-typo-for-lanyang-dts.eml has no obvious style problems and is ready for submission.

Register your key with the web submission endpoint

(This needs to be done only once, unless you change your keys.)

  • b4 send --web-auth-new
  • b4 send --web-auth-verify [challenge]

If you're not going to use your own SMTP server to send the patch, you should register your new keypair with the endpoint:

$ b4 send --web-auth-new
Will submit a new email authorization request to:
      Name: Konstantin Ryabitsev
  Selector: 20221221
    Pubkey: ed25519:24L8+ejW6PwbTbrJ/uT8HmSM8XkvGGtjTZ6NftSSI6I=
Press Enter to confirm or Ctrl-C to abort
Submitting new auth request to
Challenge generated and sent to
Once you receive it, run b4 send --web-auth-verify [challenge-string]

The challenge is a UUID4 string and this step is a simple verification that you are able to receive email at the address you want associated with this key. Once you receive the challenge, complete the process as described:

$ b4 send --web-auth-verify 897851db-9b84-4117-9d82-1d970f9df5f8
Signing challenge
Submitting verification to
Challenge successfully verified for
You may now use this endpoint for submitting patches.

OR, set up your [sendemail] section

You don't have to use the web endpoint — it exists primarily for people who are not able or not willing to set up their SMTP information with git. Setting up a SMTP gateway is not a straightforward process for many:

  • platforms using OAuth require setting up “application-specific passwords”
  • some companies only provide Exchange or browser-based access to email and don't offer any other way to send mail
  • some company SMTP gateways rewrite messages to add lengthy disclaimers or rewrite links to quarantine them

However, if you have access to a functional SMTP gateway, then you are encouraged to use it instead of submitting via the web endpoint, as this ensures that the development process remains distributed and not dependent on any central services. Just follow instructions in man git-send-email and add a valid [sendemail] section to your git config. If b4 finds it, it will use it instead of relying on the web endpoint.

    smtpEncryption = tls
    smtpServer =
    smtpServerPort = 465
    smtpEncryption = ssl
    smtpUser =
    smtpPass = your-gmail-app-password

Reflect the email to yourself

  • b4 send --reflect

This is the last step to use before sending off your contribution. Note, that it will fill out the To: and Cc: headers of all messages with actual recipients, but it will NOT actually send mail to them, just to yourself. Mail servers don't actually pay any attention to those headers — the only thing that matters to them is what was specified in the RCPT TO outer envelope of the negotiation.

This step is particularly useful if you're going to send your patches via the web endpoint. Unless your email address is from one of the following domains, the From: header will be rewritten in order to not violate DMARC policies:


If your email domain doesn't match the above, the From: header will be rewritten to be a dummy address. Your actual From: will be added to the body of the message where git expects to find it, and the Reply-To: header will be set so anyone replying to your message will be sending it to the right place.

Send it off!

  • b4 send

If all your tests are looking good, then you are ready to send your work. Fire off “b4 send”, review the “Ready to:” section for one final check and either Ctrl-C to get out of it, or hit Enter to submit your work upstream.

Coming up next

In the next post, I will go over:

  • making changes to your patches using: git rebase -i
  • retrieving and applying follow-up trailers using: b4 trailers -u
  • comparing v2 and v1 to see what changes you made using: b4 prep --compare-to v1
  • adding changelog entries using: b4 prep --edit-cover


All contributor-oriented features of b4 are documented on the following site:


from joelfernandes

Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that: 1. Normal hrtimers will have both a soft and hard expiry which are equal to each other. 2. But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.

The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimer_run_queues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimer_update_next_event() –> __hrtimer_get_next_event() –> __hrtimer_next_event_base()–>hrtimer_get_expires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.

The commit containing the original code added for range timers is:

commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
Author: Arjan van de Ven <>
Date:   Mon Sep 1 15:47:08 2008 -0700

    hrtimer: turn hrtimers into range timers
    this patch turns hrtimers into range timers;
    they have 2 expire points
    1) the soft expire point
    2) the hard expire point
    the kernel will do it's regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future.

The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimer_interrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.

To see the issue, consider the case where we have 2 timers queued:

  1. The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.

  2. The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.

Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.

As shown below, the loop in hrtimer_interrupt() goes through all the active timers in the timerqueue, _softexpires is made to be the real expiry, and the old _expires now becomes _softexpires + slack.

       while((node = timerqueue_getnext(&base->active))) {
              struct hrtimer *timer;

              timer = container_of(node, struct hrtimer, node);

               * The immediate goal for using the softexpires is
               * minimizing wakeups, not running timers at the
               * earliest interrupt after their soft expiration.
               * This allows us to avoid using a Priority Search
               * Tree, which can answer a stabbing querry for
               * overlapping intervals and instead use the simple
               * BST we already have.
               * We don't add extra wakeups by delaying timers that
               * are right-of a not yet expired timer, because that
               * timer will have to trigger a wakeup anyway.

              if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
                      ktime_t expires;

                      expires = ktime_sub(hrtimer_get_expires(timer),
                      if (expires.tv64 < expires_next.tv64)
                              expires_next = expires;

              __run_hrtimer(timer, &basenow);

However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because __hrtimer_next_event_base() calls hrtimer_get_expires() which correctly use the “hard expiry” times to do the programming.


from joelfernandes

Condition variables can seldom be used in isolation and depend on proper usage from the users of it. This article is a gentle introduction to condition variable usage using TLA+ / PlusCal to formally model it: A parent and child process waiting on each other. During this, I actually discovered a bug in my understanding, causing a deadlock which I will share in the end.

Note that you may need to refer to PlusCal and TLA+ documentation. There's a ton of it written by people who know more than I. My interest is just as a practical user and this article does not explain PlusCal much. To be honest it is quite readable any way.

First we start with a simple C program which we'll model later, a parent waiting on its child after spawning it.

 volatile int done = 0;
 void *child(void *arg) {
     done = 1;
     return NULL;

 int main(int argc, char *argv[]) {
     printf("parent: begin\n");
     pthread_t c;
     pthread_create(&c, NULL, child, NULL); // create child
     while (done == 0); // spin
     printf("parent: end\n");
     return 0;

This trivial program works but wastes a lot of CPU due to the spin loop, especially, say if the child runs for a long time. But it is useful to write a formal model which we'll use as the basis for more advanced design.

Following is the PlusCal program to verify this:

(*--fair algorithm ThreadJoinSpin

    \* Spin variable used by child to signal to parent that it finished running.
    done = 0,
    \* Simulate the parent forking the child.
    childwait = 1,

    \* The below variables are for checking invariants.
    ChildAboutToExit = 0,
    ParentDone = 0;

    \* The invariant that has to hold true always.
    ExitChildBeforeParent == (ParentDone = 0) \/ ((ParentDone = 1) /\ (ChildAboutToExit = 1))
end define;

procedure thr_exit() begin
texit: done := 1;
ret2: return;
end procedure;

procedure thr_join() begin
check: await done = 1;
ret3: return;
end procedure;

process childproc \in {1} begin
c0:   await childwait = 0;
    \* child does something for a long time.
c1: ChildAboutToExit := 1;
c2: call thr_exit();
end process;

process parent \in {2} begin
c3: childwait := 0;
c4: call thr_join();
c5: ParentDone := 1;
end process;

There are 2 things that I make the model verify, 1. That the ExitChildBeforeParent invariant is always satisified. This invariant confirms that under no circumstance will the parent process terminate before the child does. More precisely, the parent's join returns only after the child calls thr_exit(). 2. That the program terminates without going into a deadlock.

Both of these are true with the trivial program. This gives us a foundation to replace thr_exit() and thr_join() with condition variables.

Following is a first attempt, I am only adding the “new changes” to the above program model to keep the article short.

Here is the first attempt at C program using CVs, that we will try to model check:

void thr_exit() {
    done = 1;

 void thr_join() {
    if (done == 0)

And the PlusCal model is as follows. We add a new set waitSet to keep track of all the waiters on the CV. Along with 2 new functions cvwait and cvsignal to emulate pthread_cond_wait() and pthread_cond_signal() APIs:

    waitSet = {};

procedure cvwait(p) begin
    c1: waitSet := waitSet \cup {p};
    c2: await p \notin waitSet;
    c4: return;
end procedure;

procedure cvsignal() begin
    \* This if cond is needed because otherwise
    \* the with statement waits forever if waitset is empty.
    if waitSet = {} then
        c7: return;
    end if;
    \* Non deterministically pick something to wake up.
    c8: with x \in waitSet do
        waitSet := waitSet \ {x};
        end with;
    c9: return;  
end procedure;

procedure thr_exit() begin
    c11:  call cvsignal();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c20: if done = 0 then
    c21:    call cvwait(p);
         end if;
    c23: return;
end procedure;

It is notable how we model wait/wake by just adding and removing the process number from a set. The TLC model checker tells us this program deadlocks, why? Because thr_exit() can be run first and signal the CV. But there's nothing waiting for it. The parent then calls thr_wait() and waits forever. Replacing the if with while also does not help. The problem is we need the child to signal only once there's someone waiting.

Seems like we need some synchronization to make sure the signal/wait don't race with each other, let us add a lock.

void thr_exit() {

 void thr_join() {
    pthread_cond_wait(&c, &m);

To implement the lock in PlusCal, we can just do a simple test-and-set lock:

procedure lock() begin
  cas:  if mutex = 0 then
        mutex := 1; return;
        goto cas;
        end if;
end procedure;    

procedure unlock() begin
  unlock_it:  mutex := 0;
  ret:        return;
end procedure;

Anything under a label in PlusCal is atomic and executed as one unit. This helps us to model a Compare-And-Swap operation pretty nicely.

We also need to modify our CV signal/wait functions to be callable under a lock. In particular, we cannot wait on the CV with the lock held as we would then deadlock.

procedure cvwait(p) begin
    c0: call unlock();
    c1: waitSet := waitSet \cup {p};
    c2: await p \notin waitSet;
    c3: call lock();
    c4: return;
end procedure;

procedure cvsignal() begin
    \* This if cond is needed because otherwise
    \* the with statement waits forever if waitset is empty.
    if waitSet = {} then
        c7: return;
    end if;
    \* Non deterministically pick something to wake up.
    c8: with x \in waitSet do
        waitSet := waitSet \ {x};
        end with;
    c9: return;  
end procedure;

Now armed with these modeled primitives, let us use them in our join/exit functions:

procedure thr_exit() begin
    c10:  call lock();
    c11:  call cvsignal();
    c12:  call unlock();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c20: call lock();
    c21: call cvwait(p);
    c22: call unlock();
    c23: return;
end procedure;

The model checker again complaints of a deadlock! Though the signaling and wait cannot be happen at the same time, there is still the original problem of the parent running much later than the child even though exclusively, and then waiting on the CV forever. The parent should not wait if no waiting is needed. Looks like we need both the done state variable and the locking.

Lets try to model this, the C program we will model uses both locking and CVs.

 void thr_exit() {
        done = 1;

 void thr_join() {
    while (done == 0) {
           pthread_cond_wait(&c, &m);

The PlusCal program now becomes:

procedure thr_exit() begin
    c9:   call lock();
    c10:  done := 1;
    c11:  call cvsignal();
    c12:  call unlock();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c19: call lock();
    c20: while done = 0 do
    c21:    call cvwait(p);
         end while;
    c22: call unlock();
    c23: return;
end procedure;

However, turns out the model checker fails even for this! After some staring, I discovered there is a bug in my condition variable implementation itself. The cvwait procedures needs a subtle change.

The change is that, the process waiting (in this case the parent) has to be added to the waitqueue while the lock is being held. A slight reordering of c0 and c1 lines fix the cvwait(). With this the model now passes, by satisfying the invariant while not locking up:

procedure cvwait(p) begin
    c1: waitSet := waitSet \cup {p};
    c0: call unlock();
    c2: await p \notin waitSet;
    c3: call lock();
    c4: return;
end procedure;

Model checking can be a powerful tool to clarify understanding of simple primitives, that can have subtle bugs. While it has its limitations, quick model checking can identify and rule out bugs before they show up in the wild and clarify your understanding.

As an exercise for the reader, replace the while in c20 with an if statement. You should see it fail as well.

Model checking also shows that in thread_exit(), we only need to hold the lock when setting the done variable, though care must be taken to make sure that the waitSet can be concurrently queued and dequeued into. Otherwise holding the lock may still be needed. Assuming that queue and dequeue are atomic, thread_exit() can rewritten as:

procedure thr_exit() begin
    c9:   call lock();
    c10:  done := 1;
    c12:  call unlock();
    c11:  call cvsignal();
    c13:  return;
end procedure;

Note: A few examples of the C code were borrowed from the Operating Systems: Three Easy Pieces text book. I am grateful to them.


from Jakub Kicinski

Kernel TLS implements the record encapsulation and cryptography of the TLS protocol. There are four areas where implementing (a portion of) TLS in the kernel helps:

  • enabling seamless acceleration (NIC or crypto accelerator offload)
  • enabling sendfile on encrypted connections
  • saving extra data copies (data can be encrypted as it is copied into the kernel)
  • enabling the use of TLS on kernel sockets (nbd, NFS etc.)

Kernel TLS handles only data records turning them into a cleartext data stream, all the control records (TLS handshake etc.) get sent to the application via a side channel for user space (OpenSSL or such) to process. The first implementation of kTLS was designed in the good old days of TLS 1.2. When TLS 1.3 came into the picture the interest in kTLS had slightly diminished and the implementation, although functional, was rather simple and did not retain all the benefits. This post covers developments in the Linux 5.20 implementation of TLS which claws back the performance lost moving to TLS 1.3. One of the features we lost in TLS 1.3 was the ability to decrypt data as it was copied into the user buffer during read. TLS 1.3 hides the true type of the record. Recall that kTLS wants to punt control records to a different path than data records. TLS 1.3 always populates the TLS header with application_data as the record type and the real record type is appended at the end, before record padding. This means that the data has to be decrypted for the true record type to be known.

Problem 1 – CoW on big GRO segments is inefficient

kTLS was made to dutifully decrypt the TLS 1.3 records first before copying the data to user space. Modern CPUs are relatively good at copying data, so the copy is not a huge problem in itself. What’s more problematic is how the kTLS code went about performing the copy. The data queued on TCP sockets is considered read-only by the kernel. The pages data sits in may have been zero-copy-sent and for example belong to a file. kTLS tried to decrypt “in place” because it didn’t know how to deal with separate input/output skbs. To decrypt “in place” it calls skb_cow_data(). As the name suggests this function makes a copy of the memory underlying an skb, to make it safe for writing. This function, however, is intended to be run on MTU-sized skbs (individual IP packets), not skbs from the TCP receive queue. The skbs from the receive queue can be much larger than a single TLS record (16kB). As a result TLS would CoW a 64kB skb 4 times to extract the 4 records inside it. Even worse if we consider that the last record will likely straddle skbs so we need to CoW two 64kB skbs to decrypt it “in place”. The diagram below visualizes the problem and the solution. SKB CoW The possible solutions are quite obvious – either create a custom version of skb_cow_data() or teach TLS to deal with different input and output skbs. I opted for the latter (due to further optimizations it enables). Now we use a fresh buffer for the decrypted data and there is no need to CoW the big skbs TCP produces. This fix alone results in ~25-45% performance improvement (depending on the exact CPU SKU and available memory bandwidth). A jump in performance from abysmal to comparable with the user space OpenSSL.

Problem 2 – direct decrypt

Removing pointless copies is all well and good, but as mentioned we also lost the ability to decrypt directly to the user space buffer. We still need to copy the data to user space after it has been decrypted (A in the diagram below, here showing just a single record not full skb). SKB direct decrypt We can’t regain the full efficiency of TLS 1.2 because we don’t know the record type upfront. In practice, however, most of the records are data/application records (records carrying the application data rather than TLS control traffic like handshake messages or keys), so we can optimize for that case. We can optimistically decrypt to the user buffer, hoping the record contains data, and then check if we were right. Since decrypt to a user space buffer does not destroy the original encrypted record if we turn out to be wrong we can decrypting again, this time to a kernel skb (which we can then direct to the control message queue). Obviously this sort of optimization would not be acceptable in the Internet wilderness, as attackers could force us to waste time decrypting all records twice. The real record type in TLS 1.3 is at the tail of the data. We must either trust that the application will not overwrite the record type after we place it in its buffer (B in the diagram below), or assume there will be no padding and use a kernel address as the destination of that chunk of data (C). Since record padding is also rare – I chose option (C). It improves the single stream performance by around 10%.

Problem 3 – latency

Applications tests have also showed that kTLS performs much worse than user space TLS in terms of the p99 RPC response latency. This is due to the fact that kTLS holds the socket lock for very long periods of time, preventing TCP from processing incoming packets. Inserting periodic TCP processing points into the kTLS code fixes the problem. The following graph shows the relationship between the TCP processing frequency (on the x axis in kB of consumed data, 0 = inf), throughput of a single TLS flow (“data”) and TCP socket state. TCP CWND SWND The TCP-perceived RTT of the connection grows the longer TLS hogs the socket lock without letting TCP process the ingress backlog. TCP responds by growing the congestion window. Delaying the TCP processing will prevent TCP from responding to network congestion effectively, therefore I decided to be conservative and use 128kB as the TCP processing threshold. Processing the incoming packets has the additional benefit of TLS being able to consume the data as it comes in from the NIC. Previously TLS had access to the data already processed by TCP when the read operation began. Any packets coming in from the NIC while TLS was decrypting would be backlogged at TCP input. On the way to user space TLS would release the socket lock, allowing the TCP backlog processing to kick in. TCP processing would schedule a TLS worker. TLS worker would tell the application there is more data.


from linusw

We are discussing and working toward adding the language Rust as a second implementation language in the Linux kernel. A year ago Jake Edge made an excellent summary of the discussions so far on Rust for the Linux kernel and we (or rather Miguel and Wedson) have made further progress since then. For the record I think this is overall a good idea and worth a try. I wanted to add some background that was sketched in a mail thread for the kernel summit.

TL;DR: my claim is that Rust is attempting to raise the abstraction in the programming language and ultimately to join computer science and software engineering into one single discipline, an ambition that has been around since these disciplines were created.

Beginning with ALGOL

The first general high-level language was FORTRAN, which is still in use for some numerical analysis tasks around the world. Then came ALGOL, which attracted a wider audience.

The first “real” operating system (using virtual memory etc) for the Atlas Machine supervisor in 1962 was as far as I can tell implemented in Atlas autocode which was a dialect of ALGOL, which was the lingua franca at the time. Pure ALGOL could not be used because ALGOL 60 had no input/output primitives, so every real-world application of ALGOL, i.e. any application not solely relying on compiled-in constants, required custom I/O additions.

Algol specifications Copies of the first specifications of ALGOL 60, belonging at one time to Carl-Erik Fröberg at Lund University.

ALGOL inspired CPL that inspired BCPL that inspired the B programming language that inspired the C programming language, which we use for the Linux kernel.

Between 1958 and 1968 ALGOL was the nexus in a wide attempt to join computer languages with formal logic. In this timespan we saw the ALGOL 58, ALGOL 60 and ALGOL 68 revisions come out. The outcome was that it established computer science as a discipline and people could start building their academic careers on that topic. One notable outcome was the BNF form for describing syntax in languages. This time was in many ways formative for computer science: the first three volumes of Donald Knuths The Art of Computer Programming were published in close proximity to these events.

To realize that ALGOL was popular and widespread at the time that Unix was born, and that C was in no way universally accepted, it would suffice to read a piece of the original Bourne Shell source code tree for example:

	INT		xp;
	WHILE arg
	DO REG STRING	s=mactrim(arg->argval);
	   setname(s, xp);
	   IF flags&execpr
	   THEN prs(s);
		IF arg THEN blank(); ELSE newline(); FI

This doesn't look much like C as we know it, it looks much more like ALGOL 68. The ALGOL 68 definition added constructions such as IF/FI, DO/OD etc, which were not present in ALGOL 60. The reason is that Stephen Bourne was an influential contributor to ALGOL 68 and created a set of macros so that the C preprocessor would turn his custom dialect of ALGOL into C, for which I think someone on Reddit suggested to nominate bash for the obfuscated C contest.

This is just one of the instances where we can see that the C programming language was not universally loved. The Bourne Shell scripting language that we all love and use is also quite close to ALGOL 68, so the descendants of this language is used more than we may think.

Around 1970 Niklaus Wirth was working to improve ALGOL68 with what he called ALGOL W. Tired of the slowness of the language committee process he forked ALGOL and created the programming language Pascal which was a success in its own right. In his very interesting IEEE article named A Brief History of Software Engineering Professor Wirth gives his perspective on some of the events around that time: first he writes about the very influential NATO conference on software engineering 1968 in Garmisch, Germany which served to define software engineering as a distinct discipline. To counter the so-called software crisis – the problems presented by emerging large complex systems – the suggestion was to raise the abstraction in new languages.

To raise the abstraction means to use more mathematical, machine independent constructs in the language. First consider the difference between low-level and high-level languages: a simple operation such as x = x + 1 is not high level, and just a fancy assembly instruction; if we compile it we can readily observe the resulting code in some kind of ADD instruction in the resulting object code. However a[i] = x + 1 raises abstraction past the point of high-level languages. This is because indexing into an array requires knowledge of the target machine specifics: base addresses, memory layout, etc. This makes the instruction more high-level and thus raises the abstraction of the language. The assumption is that several further higher levels of abstraction exist. We will look into some of these languages in the following sections.

The Garmisch conference is famous in Unix circles because Douglas McIlroy was present and presented his idea of componentized software as a remedy against rising complexity, an idea that was later realized in the form of Unix's pipes and filters mechanism. D-Bus and similar component interoperation mechanisms are contemporary examples of such software componentry — another way to counter complexity and make software less fragile, but not the focus in this article.

Wirth makes one very specific and very important observation about the Garmisch conference:

Ultimately, analytic verification and correctness proofs were supposed to replace testing.

This means exactly what it says: with formally verified programming languages, all the features and constructs that are formally proven need not be tested for. Software engineering is known for advocating test-driven development (TDD) to this day, and the ambition was to make large chunks of TDD completely unnecessary. Software testing has its own chapter in the mentioned report from the Garmisch NATO conference where the authors A.I. Llewelyn and R.F. Wickens conclude:

There are, fundamentally, two different methods of determining whether a product meets its specification. One can analyse the product in great detail and from this determine if it is in accordance with its specification, or one can measure its performance experimentally and see if the results are in accord with the specification; the number and sophistication of the experiments can be varied to provide the degree of confidence required of the results.

The first part of this paragraph i.e. “analyze in great detail” is what Wirth calls analytic verification and is today called formal verification. The latter part of this paragraph is what we call test-driven development, TDD. Also: the former is a matter of computer science, while the latter is a matter of software engineering. So here is a fork in the road.

Wirth also claims the discussions in Garmisch had a distinct influence on Pascal. This can be easily spotted in Pascal strings, which was one of his principal improvements over ALGOL: Pascal strings are arrays of char, but unlike C char, a Pascal char is not the same as a byte; instead it is defined as belonging to an “ordered character set”, which can very well be ISO8859-1 or Unicode, less, more or equal to 255 characters in size. Strings stored in memory begin with an positive integer array length which defines how long the string is, but this is none of the programmer's business, this shall be altered by the language runtime and not by any custom code. Indexing out of bounds is therefore not possible and can be trivially prohibited during compilation and at runtime. This raises the abstraction of strings: they are set-entities, they have clear boundaries, they need special support code to handle the length field in memory. Further Pascal also has set types, such as:

    JanuaryDays : set of 1..31;

Perhaps Pascal's application to real-world problems didn't work out as expected, as it has since also defined PChar as a NULL-terminated pointer to a sequence of characters, akin to C strings. However it should be noted that Pascal pointers are persistently typed and cannot be converted: casting is not possible in Pascal. A Pascal pointer to an integer is always a pointer to an integer.

From Wirth's perspective, C “presented a great leap backward” and he claims “it revealed that the community at large had hardly grasped the true meaning of the term 'high-level language' which became an ill-understood buzzword”. He attributes the problem to Unix which he says “acted like a Trojan horse for C”. He further details the actual technical problems with C:

C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous.

His point about C lacking index checking is especially important: it can be brought into question if C is really a high-level language. It is not fully abstracting away the machine specifics of handling an array. Language theorists can occasionally refer to C as a “big macro assembler”, the only thing abstracted away is really the raw instruction set.

Wirth however also goes on to state the appealing aspects of the C programming language:

people at large, particularly in academia, found it intriguing and “better than assembly code” (...) its rules could easily be broken, exactly what many programmers cherished. It was possible to manage access to all of a computer’s idiosyncracies, to items that a high-level language would properly hide. C provided freedom, where high-level languages were considered as straight-jackets enforcing unwanted discipline. It was an invitation to use tricks which had been necessary to achieve efficiency in the early days of computers.

We can see why an efficiency-oriented operating system kernel such as Linux will tend toward C.

It's not like these tricks stopped after the early days of computing. Just the other day I wrote a patch for Linux with two similar code paths, which could be eliminated by cast:ing a (const void *) into a (void *) which I then quipped about in the commit message of the revised patch. The reason for violating formal rules in this case — is that of a choice between two evils, and chosing the lesser evil: in a choice between formal correctness and code reuse I chose code reuse. And C enables that kind of choice. The languages presented later in this article absolutely do not allow that kind of choice, and C casts are seen as nothing less than an abomination.

The language family including C and also Pascal is referred to as imperative programming languages. The defining character is that the programmer “thinks like a computer” or imagine themselves as the program counter to be exact. “First I do this, next I do this, then I do this” – a sequence of statements executed in order, keeping the computer state (such as registers, memory locations and stacks) in the back of your head.

The immediate appeal to operating system programmers should be evident: this closely models what an OS developer needs to keep in mind, such as registers, stacks, cache frames, MMU tables, state transitions in hardware and so on. It is possible to see the whole family of imperative languages as domain specific languages for the domain of writing operating systems, so it would be for operating system developers what OpenGL is for computer graphics software developers.

Lambda Calculus for Defining Languages

In 1966 one of the early adopters and contributors to ALGOL (alongside Peter Naur, Tony Hoare and Niklaus Wirth), Peter Landin, published two articles in the Journal of the ACM titled Correspondence between ALGOL 60 and Church's Lambda-notation part I and part II. In the first article he begins with a good portion of dry humour:

Anyone familiar with both Church's λ-calculi and ALGOL 60 will have noticed a superficial resemblance between the way variables tie up with the λ's in a nest of λ-expressions, and the way that identifiers tie up with the headings in a nest of procedures and blocks.

He is of course aware that no-one beside himself had been in the position to realize this: the overlap between people familiar with Alonzo Church's λ-calculus and with ALGOL 60 was surprisingly down to one person on the planet. What is surprising is that it was even one person.

Alonzo Church was a scholar of mathematical logic and computability, the supervisor of Alan Turing's doctoral thesis and active in the same field as Kurt Gödel (those men quoted each other in their respective articles). The lambda calculus ties into the type set theory created by Bertrand Russell and the logical-mathematical programme, another universe of history we will not discuss here.

What λ-calculus (Lambda-calculus) does for a programming language definition is analogous to what regular expressions does for a languages syntax, but for it's semantics. While regular expressions can express how to parse a body of text in a language with regular grammar, expressions in λ-calculus can go on from the abstract syntax tree and express what an addition is, what a subtraction is, or what a bitwise OR is. This exercise is seldomly done in e.g. compiler construction courses, but defining semantics is an inherent part of a programming language definition.

Perhaps the most remembered part of Landin's papers is his humorous term syntactic sugar which denotes things added to a language to make the life of the programmer easier, but which has no semantic content that cannot be expressed by the basic features of the language. The basic mathematical features of the language, on the other hand, are best expressed with λ-calculus.

A notable invention in Landin's first article about defining ALGOL in terms of λ-calculus are the keywords let and where chosen to correspond to λ-calculus' Applicable Expressions. These keywords do not exist in ALGOL: they are part of a language to talk about a language, or in more complicated terms: a meta-language. So here we see the first steps toward a new language derived from λ-calculus. Landin does not give this language a name in this article, but just refers to it as “AE”. The AE executes in a theoretical machine called SECD, which is another trick of the trade, like Alan Turings “turing machine”: rather close to a mathematicians statement “let's assume we have...” The complete framework for defining ALGOL in λ-calculus is called AE/SECD.

Functional Programming

Functional programming languages then, implements lambda calculus. The central idea after some years of experience with defining languages such as ALGOL in terms of lambda calculus, is to just make the language resemble lambda calculus expressions to begin with, and the verification of the semantics will be simple and obvious.

In 1966 Peter Landin followed up his articles using λ-calculus to describe ALGOL with his article The Next 700 Programming Languages. Here he invents functional programming in the form of an invented language called ISWIM (If You See What I Mean), as you can see again with a good dry humour. The language is λ-calculus with “syntactic sugar” on top, so a broad family of languages are possible to create using the framework as a basis. Landin's article was popular, and people did invent languages. Maybe not 700 of them. Yet.

In section 10 of his article, named Eliminating explicit sequencing, Landin starts speculating and talks about a game that can be played with ALGOL: by removing any goto statements and labels, the program get a less sequential nature, i.e. the program counter is just advancing to the next line or iterating a loop. He quips:

What other such features are there? This question is considered because, not surprisingly, it turns out that an emphasis on describing things in terms of other things leads to the same kind of requirements as an emphasis against explicit sequencing.

He then goes on to show how to transform an ALGOL program into a purely functional ISWIM program and concludes:

The special claim of ISWlM is that it grafts procedural notions onto a purely functional base without disturbing many of the desirable properties. (...) This paper can do no more than begin the task of explaining their practical significance.

This reads as a call to action: we need to create functional programming languages akin to ISWIM, and we need to get rid of the J operator (the program control flow operator). Landin never did that himself.

The Meta Language ML

A few years later, in 1974, computer scientist Robin Milner, inspired by ISWIM and as a response to Landin's challenge, created the language ML, short for Meta Language. This is one of the 700 next languages and clearly recognized Landin's ideas about a language for defining languages, a grammar for defining grammar: a meta language with a meta grammar.

He implemented the language on the DEC10 computer with the help of Malcolm Newey, Lockwood Morris, Mike Gordon and Chris Wadswort. The language was later ported to the VAX architectures.

The language was based on ISWIM and dropped the so-called J operator (program point operator). It is domain-specific, and intended for authoring a tool for theorem proving called LCF. Standard ML has been fully semantically specified and formally verified. This language became widely popular, both in academia and industry.

Removing the J operator made ML a declarative language, i.e. it does not specify the order of execution of statements, putting it in the same class of languages as Prolog or for that matter: Makefiles: there is no control flow in a Makefile, just a number of conditions that need to be evaluated to arrive at a complete target.

ML still has one imperative language feature: assignment. Around this time, some scholars thought both the J operator and assignment were unnecessary and went on to define purely functional languages such as Haskell. We will not consider them here, they are outside the scope of this article. ML and everything else we discuss can be labelled as impure: a pejorative term invented by people who like purely functional languages. These people dislike not only the sequencing nature of imperative languages but also the assignment (such as happens with the keyword let) and prefer to think about evaluating relationships between abstract entities.

ML can be grasped intuitively. For example this expression in ML evaluates to the integer 64:

    val m : int = 4
    val n : int = m*m

Here we see some still prominent AE/SECD, ISIWM features such as the keyword let for binding variables, or rather, associate names with elements such as integers and functions (similar to := assignment in some languages). The we see an implementation section in. We can define functions in ML, like this to compute the square root of five times x:

val rootfivex : real -> real =
    fn x : real => Math.sqrt (5.0 * x)

Notice absence of constructs such as BEGIN/END or semicolons. ML, like Python and other languages use whitespace to find beginning and end of basic blocks. The notation real –> real clearly states that the function takes a real number as input and produces a real number as output. The name real reflects some kind of mathematical ambition. The language cannot handle the mathematical set of real numbers — the ML real is what other languages call a float.

ML has more syntactic sugar, so the following is equivalent using the keyword fun (fun-notation):

fun rootfivex (x:real):real = Math.sqrt (5.0 * x)

The syntax should be possible to grasp intuitively. Another feature of ML and other functional languages is that they easily operate on tuples i.e. an ordered sequence of variables, and tuples can also be returned from functions. For example you can calculate the distance between origin and two coordinates in a x/y-oriented plane like this:

fun dist (x:real, y:real):real = Math.sqrt (x*x + y*y)

This function can then be called elsewhere like this:

val coor (x:real, y:real)
val d = dist(coor)

The type real of d will be inferred from the fact that the dist() function returns a real.

ML gets much more complex than this. One of the upsides of the language that is universally admired is that ML programs, like most programs written in functional languages can be proven correct in the computational sense. This can be done within certain ramifications: for example input/output operations need to specify exactly which values are input or an undefined behaviour will occur.

CAML and OCaml

In 1987 Ascánder Suárez at the French Institute for Research in Computer Science and Automation (INRIA) reimplemented a compiler and runtime system for ML in LISP and called the result CAML for Categorical Abstract Machine Language, a pun on the fact that it ran on a virtual machine (Category Abstract Machine) and the heritage from ML proper. The abstract machine used was the LLM3 abstract LISP machine, which in turn ran on another computer. It was not fast.

CAML was reimplemented in C in 1990-91 by Xavier Leroy, creating Caml Light, which was faster, because it was not written in a virtual machine running a virtual machine. Caml Light was more like Java and used a bytecode interpreter for its virtual machine.

In 1995, Caml Special Light introduced a native compiler, so the bytecode produced from the Caml compiler could be compiled to object code and executed with no virtual machine overhead, using a native runtime environment. Didier Rémy, Jérôme Vouillon and Jacques Garrigue continued the development of Caml.

Objective Caml arrived in 1996 and added some object oriented features to Caml. In 2011 the extended Caml Special Light compiler, and language derivative (dialect) of ML was renamed OCaml. In essence the compiler and language has a symbiotic relationship. There is no second implementation of OCaml.

From the 1990s and forward, what is now the OCaml language and implementation has gained traction. It is a very popular functional programming language, or rather, popular as far as functional programming goes. It has optimized implementations for most architectures. The compiler itself is now written mostly in OCaml, but the runtime in C is still around, to hook into each operating system where the program will eventually run. The language and compiler has been used for a variety of applications. Every major Linux distribution carries packages with the OCaml compiler and libraries. There is even a GTK+ 3 OCaml library binding, so OCaml GUI programs can be created.

OCaml simplifies binding labels to numbers etc, here is bubblesort implemented in OCaml:

(* Bubblesort in OCaml, Linus Walleij 2022 *)
let sort v =
  let newv = Array.make (Array.length v) 0 in
  for i = 1 to (Array.length v) - 1 do
    if v.(i - 1) > v.(i) then begin
      newv.(i - 1) <- v.(i);
      newv.(i) <- v.(i - 1);
      (* Copy back so we are working on the same thing *)
      v.(i - 1) <- newv.(i - 1);
      v.(i) <- newv.(i);
    end else begin
      newv.(i - 1) <- v.(i - 1);
      newv.(i) <- v.(i);

let rec ordered v =
  if Array.length v = 0 then true
  else if Array.length v = 1 then true
  (* ... or if the rest of the array is ordered *)
  else if v.(0) < v.(1) && ordered (Array.sub v 1 (Array.length v - 1)) then true
  else false;;

let plist v =
  print_string "V = ";
  for i = 0 to (Array.length v) - 1 do begin
    print_int v.(i);
    if i < (Array.length v - 1) then print_string ",";
  print_endline "";;

let rec sortme v =
  if ordered v then v
  else sortme (sort v);;

let v = [| 14 ; 4 ; 55 ; 100 ; 11 ; 29 ; 76 ; 19 ; 6 ; 82 ; 99 ; 0 ; 57 ; 36 ; 61 ; 30 |];;
plist v;;
plist (sortme v);;

My experience with working with this example is that OCaml makes a “bit of resistance” to changing contents of things like arrays by indexing. It “dislikes” any imperative constructs and kind of nudges you in the direction of purely logical constructs such as the ordered function above. This is just my personal take.

OCaml is still a dialect of ML. The file ending used on all files is .ml as well. OCaml – like Pythons pip or Perls CPAN has its own package system and library called opam. The prime application is still the OCaml Ergo Library, a library for automatic theorem proving. If your first and foremost use of computers is theorem proving, ML and OCaml continue to deliver since 1974. The more recent and widely popular Coq theorem prover is also written in OCaml.

Rust then

Rust was initially developed in 2006 as a hobby project by Graydon Hoare who was at the time working at Mozilla. OCaml and ML is mentioned as the biggest influence on the language, apart from C/C++. A typical sign of this influence would be that the first compiler for Rust was written in OCaml. A notable contributor to this codebase, apart from Hoare, is Brendan Eich, one of the founders of the Mozilla project and the inventor of JavaScript. While Brendan did not contribute much code he was at the time CTO of Mozilla, and this shows that when Mozilla started supporting the project in 2009 Rust was certainly well anchored in the organization, and Eich's early contributions to the language should be noted. (It may be commonplace that people in the CTO position at middle sized companies make commits to complex code bases, but I am not aware in that case.)

Despite the OCaml codebase the first documentation of the language talks more about other functional or declarative languages such as NIL, Hermes, Erlang, Sather, Newsqueak, Limbo and Napier. These origins with extensive quotes from e.g. Joe Armstrong (the inventor of Erlang) have been toned down in contemporary Rust documentation. It is however very clear that Graydon has a deep interest in historical computer languages and is convinced that they have something to teach us, and the expressed ambition is to draw on these languages to pick the best parts. In his own words:

I've always been a language pluralist — picture my relationship towards languages like a kid enjoying a wide variety of building blocks, musical instruments or plastic dinosaurs — and I don't think evangelism or single-language puritanism is especially helpful.

What is unique about Rust is that it fuses “impure” functional programming with imperative programming, bringing several concepts from ML and OCaml over into the language.

Another characteristic is that Rust compiled to target machine code from day one, rather than using any kind of virtual machine as did Peter Landins ISWIM, or the ML and OCaml languages (and as does say Java, or Python). Graydon probably did this intuitively, but a post he made in 2019 underscores the point: that virtual machines, even as an intermediate step, is bad language engineering and just generally a bad idea.

In 2013 Graydon stepped down as main lead for Rust for personal reasons which he has detailed in a posting on Reddit.

Rust has had the same symbiotic relationship between language and a single compiler implementation as OCaml, but this is changing, as there is now a second, GCC-based implementation in the works.

Here is bubblesort implemented in Rust:

/* Bubblesort in Rust, Linus Walleij 2022 */
fn sort(array : &mut [i32]) {
   let mut x : i32;
   if array.len() == 1 {
   for i in 1..array.len() {
      if array[i - 1] > array[i] {
      	 x = array[i - 1];
	 array[i - 1] = array[i];
	 array[i] = x;

fn is_ordered(array : &[i32]) -> bool {
   if array.len() <= 1 {
     return true;
   for i in 1..array.len() {
     if array[i - 1] > array[i] {
       return false;
   return true;

fn parray(array : &[i32]) {
   let mut x : i32;
   print!("V = ");
   for i in 0..array.len() {
       x = array[i];
       if i != (array.len() - 1) {

fn main() {
   let mut array: [i32; 16] = [14, 4, 55, 100, 11, 29, 76, 19, 6, 82, 99, 0, 57, 36, 61, 30];
   while !is_ordered(&array) {
     sort(&mut array);

Rust leaves itself to easier imperative programming than OCaml: the keyword mut becomes quite similar to C:s const correctness tagging in this example. Since is_ordered and parray isn't altering the contents of the array these functions do not need to be marked with mut. You see some familiar virtues from Pascal: arrays “know” their length, and we use a method to obtain it: array.len().

The stated ambition is improved memory safety, data-race safety (concurrency) and type safety. The article Safe Systems Programming in Rust certainly presents the ambition in a straight-forward manner. Graydon also underscores the focus on memory and concurrency safety in a 2016 blog post.

But make no mistake. The current underlying ambition is definitely nothing different from the ambition of the ALGOL committee between 1958 and 1968: to raise the abstraction of the language through the ambition to join computer programming with formal logic. This comes from the arrival of strong academic support for the language.

A typical indication of this ambition is the well-funded RustBelt project involving a large amount of academic researchers, all familiar with formal logic, and resulting in such artefacts as Ralf Jung's PhD thesis Understanding and Evolving the Rust Programming Language. Here, formal logic in Rust Belt and the Coq proof assistant is used and concludes (from the abstract):

Together, these proofs establish that, as long as the only unsafe code in a well-typed λRust program is confined to libraries that satisfy their verification conditions, the program is safe to execute.

What is meant by “safe to execute” is that no use-after-free, dangling pointers, stale references, NULL pointer exceptions etc can ever occur in safe Rust code, because it is proven by formal logic: QED. It does not stop you from e.g. dividing by zero however, that problem is out-of-scope for the exercise.

To me personally the most astonishing fact about Jung's thesis is that it manages to repeatedly cite and reference the computer scientist Tony Hoare without quoting the inventor of the Rust language, Graydon Hoare, a single time. In a way it confirms Graydon's own statement that Rust “contains nothing new” from a language point of view.

The C programming language cannot be subject to the same scrutiny as Rust, simply because of all the (ab)use it allows, and which was mentioned by Wirth in his historical perspective: if a type can be changed by a cast and array indexing is not even part of the language, there is nothing much to prove. What has been interesting for scholars to investigate is a well-defined subset of C, such as the eBPF subset, which also partly explains the strong interest in eBPF: like with Rust, the build environment and language runtime has been defined with much stricter constraints and thus can be subject to formal verification.

The ambition of Rust is, as I perceieve it, and whether the people driving it even knows it or not, to finish what the ALGOL committe as primus motor started in 1958, and what the Garmisch NATO conference concluded was necessary in 1968: to develop a language for systems programming that rely on formal logic proof, and to fulfil what ALGOL never could, what Pascal never could, and what the whole maybe-not-700 functional programming languages never could: a language that joins the disciplines of computer science and software Engineering into ONE discipline, where the scholars of each can solve problems together.

That is the ambition of Rust as an implementation language for operating systems, such as Linux: provide a language backed by current top-of-the-line computer science research, for immediate application to software engineering developing the top-of-the-line operating system.

What it offers Linux is raised abstraction to counter the problems of complexity identified in the 1968 Garmisch NATO conference and now bleeding obvious given the recurring security incidents, and thereby would bring the engineering project Linux closer to computer science.

Other approaches to increased Linux (memory- concurrency-) safety are possible: notably increased testing, which is the engineering go-to panacea. And automated testing of Linux has indeed increased a lot in recent years. Raising the abstraction of the implementation language and proving it formally comes with the ambition to make testing less important.

[Mathieu Poirer and Jesper Jansson has helped out in reviewing this blog post, for which I am forever grateful: remaining errors, bugs and biased opinions are my own.]


from Jakub Kicinski

In light of ongoing work to improve the TCP Tx zero-copy efficiency [1] one begins to wonder what can be done on the Rx side. Tx zero-copy is generally easier to implement because it requires no extra HW support. It's primarily a SW exercise in keeping references to user data rather than owning a copy of it.


There had been efforts to support direct Rx to user space buffers by performing header-data splitting in the HW and sending headers (for kernel consumption) to a different buffer than the data. There are two known implementations which depend on header-data splitting (HDS).

First one, which is upstream today [2] depends on data payloads being received in page-sized chunks and mapping (mmap'ing?) the data into the process's virtual address space. Even though modifying the virtual address map is not cheap this scheme works well for large transfers. Importantly applications which use zero-copy are often more DRAM bandwidth constrained than CPU bound. Meaning that even for smaller transfers they may prefer to burn extra cycles modifying page tables than copying data and using up the precious DRAM transfers.


The second approach is to pre-register user memory and let the NIC DMA the data directly there [3]. The memory in this case can be main DRAM or accelerator memory, that's not really important for us here. The model is similar to that of AF_XDP UMEMs. It still depends on header-data split as we don't want the application to be able to modify the headers, as they flow thru the networking stack and crash the kernel. Additionally the device must provide sufficient flow steering to be able to direct traffic for an application to its buffers / queues – we don't want the data to end up in memory of the wrong application. Once the TCP stack is done processing the headers, it simply tells the application where the data is.


Apart from the ability to direct the data to memory other than DRAM (compute accelerators, disks) the second approach has the advantage of not requiring page table changes and neatly page-sized payloads. It is harder to implement in SW because the stack does not necessarily have access to the payload memory. Although netgpu/zctap patches have stalled the idea may have sufficient merit to eventually come to fruition.

Looking ahead

Both of the approaches described above deal with packet-sized chunks of data. Admittedly some HW supports data coalescing (GRO-HW/LRO) but it's limited in size, best effort, latency inducing and generally unproven at scale.

Given that we already modify the HW to support zero-copy Rx (HDS for the former case, HDS+steering for the latter) what modifications would help us get to the next level? The data coalescing can certainly be improved.

LRO depends on intelligence in the NIC and maintaining state about connections which always poses scaling challenges. At the same time modern applications most often know the parameters of the transfer upfront, so they can put processing hints in the packet.

The receiver needs to know (1) which memory pool / region to place the data into and (2) at what offset. We can assume the sender gets these parameters thru the RPC layer. Sender can insert the placement information into the packet it sends. Either wrapped in a UDP header in front of the TCP header, or as a TCP option.

One useful modification may be to express the offset in terms of the TCP sequence number. Instead of providing the absolute offset where the packet data has to land (i.e. “data of this packet has to land in memory X at offset Y”) provide a base TCP sequence number and offset. The destination address would then be computed as

address = mem_pool[packet.dma.mem_id].base + 
          packet.dma.offset +
          packet.tcp.seq_no - packet.dma.base_tcp_seq

This simplifies the sender's job as it no longer has to change the offset as it breaks up a TSO super-frame into MTU-sized segments.

Last thing – we must protect applications from rogue writes. This can be done by programming an allow list of flow to memory pool pairs into the NIC. Unfortunately the full list scales linearly with the number of flows. A better approach would be to depend on a security protocol like Google's PSP [4]. PSP hands out association IDs to the senders, and only after authentication. The PSP IDs are much smaller (4B) than a IPv6 flow key (36B) if we have to keep a full list. We can try to be clever about how we hand them out – for instance allocating the “direct write” IDs from high numbers and less trusted connections from low numbers. We can also modify PSP to maintain a key per memory pool rather than per device, or incorporate the memory region ID in the key derivation algorithm. I'm not sufficiently crypto-savvy to know if the latter would weaken the protection too much.


To sum up in the scheme I'm dreaming up we add the following fields after the PSP header:

  • 16b – memory id – ID of the region;
  • 64b – base offset – absolute offset within the memory region;
  • 32b – TCP sequence number base for the transfer.

Since we have 16 extra bits to round up to full 128b header we can consider adding a generation number for the memory region. This could be useful if memory region configuration / page table update is asynchronous so that we can post the read request before NIC confirmed the configuration is complete. In most cases configuration should be done by the time the data arrives, if it's not we can fall back to non-zero-copy TCP.

I don't work for Netronome/Corigine any more but it's certainly something their HW can easily support with a FW update. Much like PSP itself. What an amazing piece of HW that is...


from Vlastimil Babka

In this post I would like to raise the awareness a bit about an effort to reduce the limitations of anonymous VMA merging, in the form of an ongoing master thesis by Jakub Matena, which I'm supervising. I suspect there might be userspace projects that would benefit and maybe their authors are aware of the current limitations and would welcome if they were relaxed, but they don't read the linux-mm mailing list – the last version of the RFC posted there is here

In a high-level summary, merging of anonymous VMAs in Linux generally happens as soon as they become adjacent in the address space and have compatible access protection bits (and also mempolicies etc.). However due to internal implementation details (involving VMA and page offsets) some operations such as mremap() that moves VMAs around the address space can cause anonymous VMAs not to merge even if everything else is compatible. This is then visible as extra entries in /proc/pid/maps that could be in theory be one larger entry, the associated larger memory and CPU overhead of VMA operations, or even hitting the limit of VMAs per process, set by the vm.max_map_count sysctl. A related issue is that mremap() syscall itself cannot currently process multiple VMAs, so a process that needs to further mremap() the non-merged areas would need to somehow learn the extra boundaries first and perform a sequence of multiple mremap()'s to achieve its goal.

Does any of the above sound familiar because you found that out already while working on a Linux application? Then we would love your feedback on the RFC linked above (or privately). The issue is that while in many scenarios the merging limitations can be lifted by the RFC, it doesn't come for free in both of some overhead of e.g. mremap(), and especially the extra complexity of an already complex code. Thus identifying workloads that would benefit a lot would be helpful. Thanks!


from metan's blog

The new LTP release will include changes that have introduced concept of a test maximal runtime so let me briefly explain what exactly that is. To begin with let's make an observation about a LTP test duration. Most of the LTP tests do fall into two categories when duration of the test is considered. First type of tests is fast, generally under a second or two and most of the time even fraction of that. These tests mostly prepare simple environment, call a syscall or two, clean up and are done. The second type of tests runs for longer and their duration is usually counted in minutes. These tests include I/O stress test, various regression tests that are looping in order to hit a race, timer precision tests that have to sample time intervals and so on.

Historically in LTP the test duration was limited by a single value called timeout, that defaulted to a compromise of 5 minutes, which is the worst value for both classes of the tests. That is because it's clearly too long for short running tests and at the same time too short for significant fraction of the long running tests. This was clear just by checking the tests that actually adjusted the default timeout. Quite a few short running tests that were prone to deadlocks decreased the default timeout to a much shorter interval and at the same time quite a few long running tests did increase it as well.

But back at how the test duration was handled in the long running tests. The test duration for long running tests is usually bounded by a time limit as well as a limit on a number of iterations and the test exits on whichever is hit first. In order to exit the test before the timeout these tests watched the elapsed runtime and did exit the main loop if the runtime got close enough to the test timeout. The problem was that close enough was loosely defined and implemented in each test differently. That obviously leads to a different problems. For instance if test looped until there was 10 seconds left to the timeout and the test cleanup did take more than 10 seconds on a slower hardware, there was no way how to avoid triggering the timeout which resulted in test failure. If test timeout was increased the test simply run for longer duration and hit the timeout at the end either way. At the same time if the test did use proportion of the timeout left out for the test cleanup things didn't work out when the timeout was scaled down in order to shorten the test duration.

After careful analysis it became clear that the test duration has to be bound by a two distinct values. The new values are now called timeout and max_runtime and the test duration is bound by a sum of these two. The idea behind this should be clear to the reader at this point. The max_runtime limits the test active part, that is the part where the actual test loop is executed and the timeout covers the test setup and cleanup and all inaccuracies in the accounting. Each of them can be scaled separately which gives us enough flexibility to be able to scale from small embedded boards all the way up to the supercomputers. This change also allowed us to change the default test timeout to 30 seconds. And if you are asking yourself a question how max_runtime is set for short running tests the answer is simple it's set to zero since the default timeout is more than enough to cope with these.

All of this also helps to kill the misbehaving tests much faster since we have much better estimation for the expected test duration. And yes this is a big deal when you are running thousands of testcases, it may speed up the testrun quite significantly even with a few deadlocked tests.

But things does not end here, there is a bit of added complexity on the top of this. Some of the testcases will call the main test loop more than once. That is because we have a few “multipliers” flags that can increase test coverage quite a bit. For instance we have so called .all_filesystems flag, that when set, will execute the test on the top of the most commonly used filesystems. There is also flag that can run the test for a different variants, which is sometimes used to run the test for a more than one syscall variant, e.g. for clock_gettime() we run the same test for both syscall and VDSO. All these multipliers have to be taken into an account when overall test duration is computed. However we do have all these flags in the metadata file now hence we are getting really close to a state where we will have a tool that can compute an accurate upper bound for duration for a given test. However that is completely different story for a different short article.


from Konstantin Ryabitsev

Once every couple of years someone unfailingly takes advantage of the following two facts:

  1. most large git hosting providers set up object sharing between forks of the same repository in order to save both storage space and improve user experience
  2. git's loose internal structure allows any shared object to be accessed from any other repository

Thus, hilarity ensues on a fairly regular basis:

Every time this happens, many wonder how come this isn't treated like a nasty security bug, and the answer, inevitably, is “it's complicated.”

Blobs, trees, commits, oh my

Under the hood, git repositories are a bunch of objects — blobs, trees, and commits. Blobs are file contents, trees are directory listings that establish the relationship between file names and the blobs, and commits are like still frames in a movie reel that show where all the trees and blobs were at a specific point in time. Each next commit refers to the hash of the previous commit, which is how we know in what order these still frames should be put together to make a movie.

Each of these objects has a hash value, which is how they are stored inside the git directory itself (look in .git/objects). When git was originally designed, over a decade ago, it didn't really have a concept of “branches” — there was just a symlink HEAD pointing to the latest commit. If you wanted to work on several things at once, you simply cloned the repository and did it in a separate directory with its own HEAD. Cloning was a very efficient operation, as through the magic of hardlinking, hundreds of clones would take up about as much room on your disk as a single one.

Fast-forward to today

Git is a lot more complicated these days, but the basic concepts are the same. You still have blobs, trees, commits, and they are all still stored internally as hashes. Under the hood, git has developed quite a bit over the past decade to make it more efficient to store and retrieve millions and tens of millions of repository objects. Most of them are now stored inside special pack files, which are organized rather similar to compressed video clips — formats like webm don't really store each frame in a separate image, as there is usually very little difference between any two adjacent frames. It makes much more sense to store just the difference (“delta”) between two still images until you come to a designated “key frame”.

Similarly, when generating pack files, git will try to calculate the deltas between objects and only store their incremental differences — at least until it decides that it's time to start from a new “key frame” just so checking out a tag from a year ago doesn't require replaying a year worth of diffs. At the same time, there has been a lot of work to make the act of pushing/pulling objects more efficient. When someone sends you a pull request and you want to review their changes, you don't want to download their entire tree. Your git client and the remote git server compare what objects they already have on each end, with the goal to send you just the objects that you are lacking.

Optimizing public forks

If you look at the GitHub links above, check out how many forks torvalds/linux has on that hosting service. Right now, that number says “41.1k”. With the best kinds of optimizations in place, a bare linux.git repository takes up roughtly 3 GB on disk. Doing quick math, if each one of these 41.1k forks were completely standalone, that would require about 125 TB of disk storage. Throw in a few hundred terabytes for all the forks of Chromium, Android, and Gecko, and soon you're talking Real Large Numbers. Which is why nobody actually does it this way.

Remember how I said that git forks were designed to be extremely efficient and reuse the objects between clones? This is how forks are actually organized on GitHub (and, for that matter), except it's a bit more complicated these days than simply hardlinking the contents of .git/objects around.

On side of things we store the objects from all forks of linux.git in a single “object storage” repository (see for the gory details). This has many positive side-effects:

  • all of, with its hundreds of linux.git forks takes up just 30G of disk space
  • when Linus merges his usual set of pull requests and performs “git push”, he only has to send a very small subset of those objects, because we probably already have most of them
  • similarly, when maintainers pull, rebase, and push their own forks, they don't have to send any of the objects back to us, as we already have them

Object sharing allows to greatly improve not only the backend infrastructure on our end, but also the experience of git's end-users who directly benefit from not having to push around nearly as many bits.

The dark side of object sharing

With all the benefits of object sharing comes one important downside — namely, you can access any shared object through any of the forks. So, if you fork linux.git and push your own commit into it, any of the 41.1k forks will have access to the objects referenced by your commit. If you know the hash of that object, and if the web ui allows to access arbitrary repository objects by their hash, you can even view and link to it from any of the forks, making it look as if that object is actually part of that particular repository (which is how we get the links at the start of this article).

So, why can't GitHub (or prevent this from happening? Remember when I said that a git repository is like a movie full of adjacent still frames? When you look at a scene in a movie, it is very easy for you to identify all objects in any given still frame — there is a street, a car, and a person. However, if I show you a picture of a car and ask you “does this car show up in this movie,” the only way you can answer this question is by watching the entire thing from the beginning to the end, carefully scrutinizing every shot.

In just the same way, to check if a blob from the shared repository actually belongs in a fork, git has to look at all that repository's tips and work its way backwards, commit by commit, to see if any of the tree objects reference that particular blob. Needless to say, this is an extremely expensive operation, which, if enabled, would allow anyone to easily DoS a git server with only a handful of requests.

This may change in the future, though. For example, if you access a commit that is not part of a repository, GitHub will now show you a warning message:

Looking up “does this commit belong in this repository” used to be a very expensive operation, too, until git learned to generate commit graphs (see man git-commit-graph). It is possible that at some point in the future a similar feature will land that will make it easy to perform a similar check for the blob, which will allow GitHub to show a similar warning when someone accesses shared blobs by their hash from the wrong repo.

Why this isn't a security bug

Just because an object is part of the shared storage doesn't really have any impact on the forks. When you perform a git-aware operation like “git clone” or “git pull,” git-daemon will only send the objects actually belonging to that repository. Furthermore, your git client deliberately doesn't trust the remote to send the right stuff, so it will perform its own connectivity checks before accepting anything from the server.

If you're extra paranoid, you're encouraged to set receive.fsckObjects for some additional protection against in-flight object corruption, and if you're really serious about securing your repositories, then you should set up and use git object signing:

This is, incidentally, also how you would be able to verify whether commits were made by the actual Linus Torvalds or merely by someone pretending to be him.

Parting words

This neither proves nor disproves the identity of “Satoshi.” However, given Linus's widely known negative opinions of C++, it's probably not very likely that it's the language he'd pick to write some proof of concept code.


from metan's blog

Unfortunately FOSDEM is going to be virtual again this year, but that does not stop us from organizing the testing and automation devroom. Have a look at our CfP and if you have something interesting to present go ahead and fill in a submission!


from nmenon

One of the cool things with is the fact that we can rotate maintainership depending on workload. So, is now my personal tree and we have picked up as a co-maintained TI tree that Vignesh and I rotate responsibilities with Tony Lindgren and Tero in backup.

Thanks to Konstantin and Stephen in making this happen.!

NOTE: No change in Tony's tree @


from Konstantin Ryabitsev

This is the second installment in the series where we're looking at using the public-inbox lei tool for interacting with remote mailing list archives such as In the previous article we looked at delivering your search results locally, and today let's look at doing the same, but with remote IMAP folders. For feedback, send a follow-up to this message on the workflows list:

For our example query today, we'll do some stargazing. The following will show us all mail sent by Linus Torvalds:

f:torvalds AND rt:1.month.ago..

I'm mostly using it because it's short, but you may want to use something similar if you have co-maintainer duties and want to automatically receive a copy of all mail sent by your fellow subsystem maintainers.

Note on saving credentials

When accessing IMAP folders, lei will require a username and password. Unless you really like typing them in manually every time you run lei up, you will probably want to have them cached on your local system. Lei will defer to git-credential-helper for this purpose, so if you haven't already set this up, you will want to do that now.

The two commonly used credential storage backends on Linux are “libsecret” and “store”:

  • libsecret is the preferred mechanism, as it will work with your Desktop Environment's keyring manager to store the credentials in a relatively safe fashion (encrypted at rest).

  • store should only be used if you don't have any other option, as it will record the credentials without any kind of encryption in the ~/.git-credentials file. However, if nothing else is working for you and you are fairly confident in the security of your system, it's something you can use.

Simply run the following command to configure the credential helper globally for your environment:

git config --global credential.helper libsecret

For more in-depth information about this topic, see man git-credential.

Getting your IMAP server ready

Before you start, you should get some information about your IMAP server, such as your login information. For my examples, I'm going to use Gmail, Migadu, and a generic Dovecot IMAP server installation, which should hopefully cover enough ground to be useful for the vast majority of cases.

What you will need beforehand:

  • the IMAP server hostname and port (if it's not 993)
  • the IMAP username
  • the IMAP password

It will also help to know the folder hierarchy. Some IMAP servers create all subfolders below INBOX, while others don't really care.

Generic Dovecot

We happen to be running Dovecot on, so I'm going to use it as my “generic Dovecot” system and run the following command:

lei q -I -d mid \
  -o imaps:// \
  <<< 'f:torvalds AND rt:1.month.ago..'

The <<< bit above is a Bash-ism, so if you're using a different shell, you can use the POSIX-compliant heredoc format instead:

lei q -I -d mid \
  -o imaps:// <<EOF
f:torvalds AND rt:1.month.ago..

The first time you run it, you should get a username: and password: prompt, but after that the credentials should be cached and no longer required on each repeated access to the same imaps server.

NOTE: some IMAP servers use the dot . instead of the slash / for indicating folder hierarchy, so if INBOX/torvalds is not working for you, try INBOX.torvalds instead.

Refreshing and subscribing to IMAP folders

If the above command succeeded, then you should be able to view the IMAP folder in your mail client. If you cannot see torvalds in your list of available folders, then you may need to refresh and/or subscribe to the newly created folder. The process will be different for every mail client, but it shouldn't be too hard to find.

The same with Migadu

If you have a account (see, then you probably already know that we ask you not to use your account for subscribing to busy mailing lists. This is due to Migadu imposing soft limits on how much incoming email is allowed for each hosted domain — so using lei + IMAP is an excellent alternative.

To set this up with your account (or any other account hosted on Migadu), use the following command:

lei q -I -d mid \
  -o imaps:// \
  <<< 'f:torvalds AND rt:1.month.ago..'

Again, you will need to subscribe to the new lei/torvalds folder to see it in your mail client.

The same with Gmail

If you are a Gmail user and aren't already using IMAP, then you will need to jump through a few additional hoops before you are able to get going. Google is attempting to enhance the security of your account by restricting how much can be done with just your Google username and password, so services like IMAP are not available without setting up a special “app password” that can only be used for mail access.

Enabling app passwords requires that you first enable 2-factor authentication, and then generate a random app password to use with IMAP. Please follow the process described in the following Google document:

Once you have the app password for use with IMAP, you can use lei and imaps just like with any other IMAP server:

lei q -I -d mid \
  -o imaps:// \
  <<< 'f:torvalds AND rt:1.month.ago..'

It requires a browser page reload for the folder to show up in your Gmail web UI.

Automating lei up runs

If you're setting up IMAP access, then you probably want IMAP updates to happen behind the scenes without your direct involvement. All you need to do is periodically run lei up --all (plus -q if you don't want non-critical output).

If you're just getting started, then you can set up a simple screen session with a watch command at a 10-minute interval, like so:

watch -n 600 lei up --all

You can then detach from the screen terminal and let that command continue behind the scenes. The main problem with this approach is that it won't survive a system reboot, so if everything is working well and you want to make the command a bit more permanent, you can set up a systemd user timer.

Here's the service file to put in ~/.config/systemd/user/lei-up-all.service:

Description=lei up --all service

ExecStart=/usr/bin/lei up --all -q


And the timer file to put in ~/.config/systemd/user/lei-up-all.timer:

Description=lei up --all timer



Enable the timer:

systemctl --user enable --now lei-up-all.timer

You can use journalctl -xn to view the latest journal messages and make sure that the timer is running well.

CAUTION: user timers only run when the user is logged in. This is not actually that bad, as your keyring is not going to be unlocked unless you are logged into the desktop session. If you want to run lei up as a background process on some server, you should set up a system-level timer and use a different git-credential mechanism (e.g. store) — and you probably shouldn't do this on a shared system where you have to worry about your account credentials being stolen.

Coming up next

In the next installment we'll look at some other part of lei and public-inbox... I haven't yet decided which. :)


from Konstantin Ryabitsev

I am going to post a series of articles about public inbox's new lei tool (stands for “local email interface”, but is clearly a “lorelei” joke :)). In addition to being posted on the blog, it is also available on the workflows mailing list, so if you want to reply with a follow up, see this link:

What's the problem?

One of kernel developers' perennial complaints is that they just get Too Much Damn Email. Nobody in their right mind subscribes to “the LKML” ( because it acts as a dumping ground for all email and the resulting firehose of patches and rants is completely impossible for a sane human being to follow.

For this reason, actual Linux development tends to happen on separate mailing lists dedicated to each particular subsystem. In turn, this has several negative side-effects:

  1. Developers working across multiple subsystems end up needing to subscribe to many different mailing lists in order to stay aware of what is happening in each area of the kernel.

  2. Contributors submitting patches find it increasingly difficult to know where to send their work, especially if their patches touch many different subsystems.

The script is an attempt to solve the problem #2, and will look at the diff contents in order to suggest the list of recipients for each submitted patch. However, the submitter needs to be both aware of this script and know how to properly configure it in order to correctly use it with git-send-email.

Further complicating the matter is the fact that relies on the entries in the MAINTAINERS file. Any edits to that file must go through the regular patch submission and review process and it may take days or weeks before the updates find their way to individual contributors.

Wouldn't it be nice if contributors could just send their patches to one place, and developers could just filter out the stuff that is relevant to their subsystem and ignore the rest?

lore meets lei

Public-inbox started out as a distributed mailing list archival framework with powerful search capabilities. We were happy to adopt it for our needs when we needed a proper home for kernel mailing list archives — thus, came online.

Even though it started out as merely a list archival service, it quickly became obvious that lore could be used for a lot more. Many developers ended up using its search features to quickly locate emails of interest, which in turn raised a simple question — what if there was a way to “save a search” and have it deliver all new incoming mail matching certain parameters straight to the developers' inbox?

You can now do this with lei.

lore's search syntax

Public-inbox uses Xapian behind the scenes, which allows to narrowly tailor the keyword database to very specific needs.

For example, did you know that you can search for patches that touch specific files? Here's every patch that touched the MAINTAINERS file:

How about every patch that modifies a function that starts with floppy_:

Say you're the floppy driver maintainer and wanted to find all mail that touches drivers/block/floppy.c and modifies any function that starts with floppy_ or has “floppy” in the subject and maybe any other mail that mentions “floppy” and has the words “bug” or “regression”? And maybe limit the results to just the past month.

Here's the query:

    (dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy
     OR ((nq:bug OR nq:regression) AND nq:floppy))
    AND rt:1.month.ago..

And here are the results:

Now, how about getting that straight into your mailbox, so you don't have to subscribe to the (very busy) linux-block list, if you are the floppy maintainer?

Installing lei

Lei is very new and probably isn't yet available as part of your distribution, but I hope that it will change quickly once everyone realizes how awesome it is.

I'm working on packaging lei for Fedora, so depending on when you're reading this, try dnf install lei — maybe it's already there. If it's not in Fedora proper yet, you can get it from my copr:

    dnf copr enable icon/b4
    dnf install lei

If you're not a Fedora user, just consult the INSTALL file:

Maildir or IMAP?

Lei can deliver search results either into a local maildir, or to a remote IMAP folder (or both). We'll do local maildir first and look at IMAP in a future follow-up, as it requires some preparatory work.

Getting going with lei-q

Let's take the exact query we used for the floppy drive above, and get lei to deliver entire matching threads into a local maildir folder that we can read with mutt:

    lei q -I -o ~/Mail/floppy \
      --threads --dedupe=mid \
      '(dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy \
      OR ((nq:bug OR nq:regression) AND nq:floppy)) \
      AND rt:1.month.ago..'

Before you run it, let's understand what it's going to do:

  • -I will query the aggregated index that contains information about all mailing lists archived on It doesn't matter to which list the patch was sent — if it's on lore, the query will find it.

  • -o ~/Mail/floppy will create a new Maildir folder and put the search results there. Make sure that this folder doesn't already exist, or lei will clobber anything already present there (unless you use --augment, but I haven't tested this very extensively yet, so best to start with a clean slate).

  • --threads will deliver entire threads even if the match is somewhere in the middle of the discussion. This is handy if, for example, someone says “this sounds like a bug in the floppy subsystem” somewhere in the middle of a conversation and --threads will automatically get you the entire conversation context.

  • --dedupe=mid will deduplicate results based on the message-id header. The default behaviour is to dedupe based on the body contents, but with so many lists still adding junky “sent to the foo list” footers, this tends to result in too many duplicated results. Passing --dedupe=mid is less safe (someone could sneak in a bogus message with an identical message-id and have it delivered to you instead), but more convenient. YMMV, BYOB.

  • Make sure you don't omit the final “..” in the rt: query parameter, or you will only get mail that was sent on that date, not since that date.

As always, backslashes and newlines are there just for readability — you don't need to use them.

After the command completes, you should get something similar to what is below:

    # /usr/bin/curl -Sf -s -d ''
    # /home/user/.local/share/lei/store 0/0
    # 122/?
    # 227/227
    # 150 written to /home/user/Mail/floppy/ (227 matches)

A few things to notice here:

  1. The command actually executes a curl call and retrieves the results as an mbox file.
  2. Lei will automatically convert 1.month.ago into a precise timestamp
  3. The command wrote 150 messages into the maildir we specified

We can now view these results with mutt (or neomutt):

    neomutt -f ~/Mail/floppy

It is safe to delete mail from this folder — it will not get re-added during lei up runs, as lei keeps track of seen messages on its own.

Updating with lei-up

By default, lei -q will save your search and start keeping track of it. To see your saved searches, run:

    $ lei ls-search

To fetch the newest messages:

    lei up ~/Mail/floppy

You will notice that the first line of output will say that lei automatically limited the results to only those that arrived since the last time lei was invoked for this particular saved search, so you will most likely get no new messages.

As you add more queries in the future, you can update them all at once using:

    lei up --all

Editing and discarding saved searches

To edit your saved search, just run lei edit-search. This will bring up your $EDITOR with the configuration file lei uses internally:

    ; to refresh with new results, run: lei up /home/user/Mail/floppy
    ; `maxuid' and `lastresult' lines are maintained by "lei up" for optimization
        q = (dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy OR \
            ((nq:bug OR nq:regression) AND nq:floppy)) AND rt:1.month.ago..
    [lei "q"]
        include =
        external = 1
        local = 1
        remote = 1
        threads = 1
        dedupe = mid
        output = maildir:/home/user/Mail/floppy
    [external "/home/user/.local/share/lei/store"]
        maxuid = 4821
    [external ""]
        lastresult = 1636129583

This lets you edit the query parameters if you want to add/remove specific keywords. I suggest you test them on first before putting them into the configuration file, just to make sure you don't end up retrieving tens of thousands of messages by mistake.

To delete a saved search, run:

    lei forget-search ~/Mail/floppy

This doesn't delete anything from ~/Mail/floppy, it just makes it impossible to run lei up to update it.

Subscribing to entire mailing lists

To subscribe to entire mailing lists, you can query based on the list-id header. For example, if you wanted to replace your individual subscriptions to linux-block and linux-scsi with a single lei command, do:

    lei q -I -o ~/Mail/lists --dedupe=mid \
      '( OR AND rt:1.week.ago..'

You can always edit this to add more lists at any time.

Coming next

In the next series installment, I'll talk about how to deliver these results straight to a remote IMAP folder and how to set up a systemd timer to get newest mail automatically (if that's your thing — I prefer to run lei up manually and only when I'm ready for it).


from metan's blog

As usual we had a LTP release at the end of the September. What was unusual though is the number of patches that went it, we got 483 patches, which is about +150 than the last three releases. And the number of patches did slowly grow even before that.

While it's great and I'm happy that the project is growing, there is a catch, grow like this puts additional strain on the maintainers, particularly on the patch reviewers. For me it was +120 patches reviewed during the four months period and that only counts the final versions of patches that were accepted to the repository, it's not unusual to have three or more revisions before the work is ready to be merged.

While I managed to cope with it reasonably fine the work that I had on TODO for the project was stalled. One of the things I finally want to move forward is making the runltp-ng official LTP test runner, but there is much more. So the obvious question is how to make things better and one of the things we came up was automation.

What we implemented for LTP is 'make check' that runs different tools on the test source code that is supposed to be used before patch is sent for a review. For C code we use the well known and custom sparse based checker to identify most common problems. The tooling is set up automatically when you call 'make check' for a first time and we tried to make it as effortless as possible, so that there is no reason not to use during the development. We also use for shell code and hopefully the number of checks will grow over the time. Hopefully this should eliminate on average at least one revision for a patchset which would be hundreds of patches during our development cycle.

Ideally this will fix the problem for a while and we will make more effective use of our resources, but eventually we will get to a point where more maintainers and reviewers are needed, which is problem that is hard to solve without your help.