Konstantin Ryabitsev

kernel.org administrator

Once every couple of years someone unfailingly takes advantage of the following two facts:

  1. most large git hosting providers set up object sharing between forks of the same repository in order to save both storage space and improve user experience
  2. git's loose internal structure allows any shared object to be accessed from any other repository

Thus, hilarity ensues on a fairly regular basis:

Every time this happens, many wonder how come this isn't treated like a nasty security bug, and the answer, inevitably, is “it's complicated.”

Blobs, trees, commits, oh my

Under the hood, git repositories are a bunch of objects — blobs, trees, and commits. Blobs are file contents, trees are directory listings that establish the relationship between file names and the blobs, and commits are like still frames in a movie reel that show where all the trees and blobs were at a specific point in time. Each next commit refers to the hash of the previous commit, which is how we know in what order these still frames should be put together to make a movie.

Each of these objects has a hash value, which is how they are stored inside the git directory itself (look in .git/objects). When git was originally designed, over a decade ago, it didn't really have a concept of “branches” — there was just a symlink HEAD pointing to the latest commit. If you wanted to work on several things at once, you simply cloned the repository and did it in a separate directory with its own HEAD. Cloning was a very efficient operation, as through the magic of hardlinking, hundreds of clones would take up about as much room on your disk as a single one.

Fast-forward to today

Git is a lot more complicated these days, but the basic concepts are the same. You still have blobs, trees, commits, and they are all still stored internally as hashes. Under the hood, git has developed quite a bit over the past decade to make it more efficient to store and retrieve millions and tens of millions of repository objects. Most of them are now stored inside special pack files, which are organized rather similar to compressed video clips — formats like webm don't really store each frame in a separate image, as there is usually very little difference between any two adjacent frames. It makes much more sense to store just the difference (“delta”) between two still images until you come to a designated “key frame”.

Similarly, when generating pack files, git will try to calculate the deltas between objects and only store their incremental differences — at least until it decides that it's time to start from a new “key frame” just so checking out a tag from a year ago doesn't require replaying a year worth of diffs. At the same time, there has been a lot of work to make the act of pushing/pulling objects more efficient. When someone sends you a pull request and you want to review their changes, you don't want to download their entire tree. Your git client and the remote git server compare what objects they already have on each end, with the goal to send you just the objects that you are lacking.

Optimizing public forks

If you look at the GitHub links above, check out how many forks torvalds/linux has on that hosting service. Right now, that number says “41.1k”. With the best kinds of optimizations in place, a bare linux.git repository takes up roughtly 3 GB on disk. Doing quick math, if each one of these 41.1k forks were completely standalone, that would require about 125 TB of disk storage. Throw in a few hundred terabytes for all the forks of Chromium, Android, and Gecko, and soon you're talking Real Large Numbers. Which is why nobody actually does it this way.

Remember how I said that git forks were designed to be extremely efficient and reuse the objects between clones? This is how forks are actually organized on GitHub (and git.kernel.org, for that matter), except it's a bit more complicated these days than simply hardlinking the contents of .git/objects around.

On git.kernel.org side of things we store the objects from all forks of linux.git in a single “object storage” repository (see https://pypi.org/project/grokmirror/ for the gory details). This has many positive side-effects:

  • all of git.kernel.org, with its hundreds of linux.git forks takes up just 30G of disk space
  • when Linus merges his usual set of pull requests and performs “git push”, he only has to send a very small subset of those objects, because we probably already have most of them
  • similarly, when maintainers pull, rebase, and push their own forks, they don't have to send any of the objects back to us, as we already have them

Object sharing allows to greatly improve not only the backend infrastructure on our end, but also the experience of git's end-users who directly benefit from not having to push around nearly as many bits.

The dark side of object sharing

With all the benefits of object sharing comes one important downside — namely, you can access any shared object through any of the forks. So, if you fork linux.git and push your own commit into it, any of the 41.1k forks will have access to the objects referenced by your commit. If you know the hash of that object, and if the web ui allows to access arbitrary repository objects by their hash, you can even view and link to it from any of the forks, making it look as if that object is actually part of that particular repository (which is how we get the links at the start of this article).

So, why can't GitHub (or git.kernel.org) prevent this from happening? Remember when I said that a git repository is like a movie full of adjacent still frames? When you look at a scene in a movie, it is very easy for you to identify all objects in any given still frame — there is a street, a car, and a person. However, if I show you a picture of a car and ask you “does this car show up in this movie,” the only way you can answer this question is by watching the entire thing from the beginning to the end, carefully scrutinizing every shot.

In just the same way, to check if a blob from the shared repository actually belongs in a fork, git has to look at all that repository's tips and work its way backwards, commit by commit, to see if any of the tree objects reference that particular blob. Needless to say, this is an extremely expensive operation, which, if enabled, would allow anyone to easily DoS a git server with only a handful of requests.

This may change in the future, though. For example, if you access a commit that is not part of a repository, GitHub will now show you a warning message:

Looking up “does this commit belong in this repository” used to be a very expensive operation, too, until git learned to generate commit graphs (see man git-commit-graph). It is possible that at some point in the future a similar feature will land that will make it easy to perform a similar check for the blob, which will allow GitHub to show a similar warning when someone accesses shared blobs by their hash from the wrong repo.

Why this isn't a security bug

Just because an object is part of the shared storage doesn't really have any impact on the forks. When you perform a git-aware operation like “git clone” or “git pull,” git-daemon will only send the objects actually belonging to that repository. Furthermore, your git client deliberately doesn't trust the remote to send the right stuff, so it will perform its own connectivity checks before accepting anything from the server.

If you're extra paranoid, you're encouraged to set receive.fsckObjects for some additional protection against in-flight object corruption, and if you're really serious about securing your repositories, then you should set up and use git object signing:

This is, incidentally, also how you would be able to verify whether commits were made by the actual Linus Torvalds or merely by someone pretending to be him.

Parting words

This neither proves nor disproves the identity of “Satoshi.” However, given Linus's widely known negative opinions of C++, it's probably not very likely that it's the language he'd pick to write some proof of concept code.

This is the second installment in the series where we're looking at using the public-inbox lei tool for interacting with remote mailing list archives such as lore.kernel.org. In the previous article we looked at delivering your search results locally, and today let's look at doing the same, but with remote IMAP folders. For feedback, send a follow-up to this message on the workflows list:

For our example query today, we'll do some stargazing. The following will show us all mail sent by Linus Torvalds:

f:torvalds AND rt:1.month.ago..

I'm mostly using it because it's short, but you may want to use something similar if you have co-maintainer duties and want to automatically receive a copy of all mail sent by your fellow subsystem maintainers.

Note on saving credentials

When accessing IMAP folders, lei will require a username and password. Unless you really like typing them in manually every time you run lei up, you will probably want to have them cached on your local system. Lei will defer to git-credential-helper for this purpose, so if you haven't already set this up, you will want to do that now.

The two commonly used credential storage backends on Linux are “libsecret” and “store”:

  • libsecret is the preferred mechanism, as it will work with your Desktop Environment's keyring manager to store the credentials in a relatively safe fashion (encrypted at rest).

  • store should only be used if you don't have any other option, as it will record the credentials without any kind of encryption in the ~/.git-credentials file. However, if nothing else is working for you and you are fairly confident in the security of your system, it's something you can use.

Simply run the following command to configure the credential helper globally for your environment:

git config --global credential.helper libsecret

For more in-depth information about this topic, see man git-credential.

Getting your IMAP server ready

Before you start, you should get some information about your IMAP server, such as your login information. For my examples, I'm going to use Gmail, Migadu, and a generic Dovecot IMAP server installation, which should hopefully cover enough ground to be useful for the vast majority of cases.

What you will need beforehand:

  • the IMAP server hostname and port (if it's not 993)
  • the IMAP username
  • the IMAP password

It will also help to know the folder hierarchy. Some IMAP servers create all subfolders below INBOX, while others don't really care.

Generic Dovecot

We happen to be running Dovecot on mail.codeaurora.org, so I'm going to use it as my “generic Dovecot” system and run the following command:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://mail.codeaurora.org/INBOX/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

The <<< bit above is a Bash-ism, so if you're using a different shell, you can use the POSIX-compliant heredoc format instead:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://mail.codeaurora.org/INBOX/torvalds <<EOF
f:torvalds AND rt:1.month.ago..
EOF

The first time you run it, you should get a username: and password: prompt, but after that the credentials should be cached and no longer required on each repeated access to the same imaps server.

NOTE: some IMAP servers use the dot . instead of the slash / for indicating folder hierarchy, so if INBOX/torvalds is not working for you, try INBOX.torvalds instead.

Refreshing and subscribing to IMAP folders

If the above command succeeded, then you should be able to view the IMAP folder in your mail client. If you cannot see torvalds in your list of available folders, then you may need to refresh and/or subscribe to the newly created folder. The process will be different for every mail client, but it shouldn't be too hard to find.

The same with Migadu

If you have a linux.dev account (see https://korg.docs.kernel.org/linuxdev.html), then you probably already know that we ask you not to use your account for subscribing to busy mailing lists. This is due to Migadu imposing soft limits on how much incoming email is allowed for each hosted domain — so using lei + IMAP is an excellent alternative.

To set this up with your linux.dev account (or any other account hosted on Migadu), use the following command:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://imap.migadu.com/lei/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

Again, you will need to subscribe to the new lei/torvalds folder to see it in your mail client.

The same with Gmail

If you are a Gmail user and aren't already using IMAP, then you will need to jump through a few additional hoops before you are able to get going. Google is attempting to enhance the security of your account by restricting how much can be done with just your Google username and password, so services like IMAP are not available without setting up a special “app password” that can only be used for mail access.

Enabling app passwords requires that you first enable 2-factor authentication, and then generate a random app password to use with IMAP. Please follow the process described in the following Google document: https://support.google.com/mail/answer/185833

Once you have the app password for use with IMAP, you can use lei and imaps just like with any other IMAP server:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://imap.gmail.com/lei/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

It requires a browser page reload for the folder to show up in your Gmail web UI.

Automating lei up runs

If you're setting up IMAP access, then you probably want IMAP updates to happen behind the scenes without your direct involvement. All you need to do is periodically run lei up --all (plus -q if you don't want non-critical output).

If you're just getting started, then you can set up a simple screen session with a watch command at a 10-minute interval, like so:

watch -n 600 lei up --all

You can then detach from the screen terminal and let that command continue behind the scenes. The main problem with this approach is that it won't survive a system reboot, so if everything is working well and you want to make the command a bit more permanent, you can set up a systemd user timer.

Here's the service file to put in ~/.config/systemd/user/lei-up-all.service:

[Unit]
Description=lei up --all service
ConditionPathExists=%h/.local/share/lei

[Service]
Type=oneshot
ExecStart=/usr/bin/lei up --all -q

[Install]
WantedBy=mail.target

And the timer file to put in ~/.config/systemd/user/lei-up-all.timer:

[Unit]
Description=lei up --all timer
ConditionPathExists=%h/.local/share/lei

[Timer]
OnUnitInactiveSec=10m

[Install]
WantedBy=default.target

Enable the timer:

systemctl --user enable --now lei-up-all.timer

You can use journalctl -xn to view the latest journal messages and make sure that the timer is running well.

CAUTION: user timers only run when the user is logged in. This is not actually that bad, as your keyring is not going to be unlocked unless you are logged into the desktop session. If you want to run lei up as a background process on some server, you should set up a system-level timer and use a different git-credential mechanism (e.g. store) — and you probably shouldn't do this on a shared system where you have to worry about your account credentials being stolen.

Coming up next

In the next installment we'll look at some other part of lei and public-inbox... I haven't yet decided which. :)

I am going to post a series of articles about public inbox's new lei tool (stands for “local email interface”, but is clearly a “lorelei” joke :)). In addition to being posted on the blog, it is also available on the workflows mailing list, so if you want to reply with a follow up, see this link:

What's the problem?

One of kernel developers' perennial complaints is that they just get Too Much Damn Email. Nobody in their right mind subscribes to “the LKML” (linux-kernel@vger.kernel.org) because it acts as a dumping ground for all email and the resulting firehose of patches and rants is completely impossible for a sane human being to follow.

For this reason, actual Linux development tends to happen on separate mailing lists dedicated to each particular subsystem. In turn, this has several negative side-effects:

  1. Developers working across multiple subsystems end up needing to subscribe to many different mailing lists in order to stay aware of what is happening in each area of the kernel.

  2. Contributors submitting patches find it increasingly difficult to know where to send their work, especially if their patches touch many different subsystems.

The get_maintainer.pl script is an attempt to solve the problem #2, and will look at the diff contents in order to suggest the list of recipients for each submitted patch. However, the submitter needs to be both aware of this script and know how to properly configure it in order to correctly use it with git-send-email.

Further complicating the matter is the fact that get_maintainer.pl relies on the entries in the MAINTAINERS file. Any edits to that file must go through the regular patch submission and review process and it may take days or weeks before the updates find their way to individual contributors.

Wouldn't it be nice if contributors could just send their patches to one place, and developers could just filter out the stuff that is relevant to their subsystem and ignore the rest?

lore meets lei

Public-inbox started out as a distributed mailing list archival framework with powerful search capabilities. We were happy to adopt it for our needs when we needed a proper home for kernel mailing list archives — thus, lore.kernel.org came online.

Even though it started out as merely a list archival service, it quickly became obvious that lore could be used for a lot more. Many developers ended up using its search features to quickly locate emails of interest, which in turn raised a simple question — what if there was a way to “save a search” and have it deliver all new incoming mail matching certain parameters straight to the developers' inbox?

You can now do this with lei.

lore's search syntax

Public-inbox uses Xapian behind the scenes, which allows to narrowly tailor the keyword database to very specific needs.

For example, did you know that you can search lore.kernel.org for patches that touch specific files? Here's every patch that touched the MAINTAINERS file:

How about every patch that modifies a function that starts with floppy_:

Say you're the floppy driver maintainer and wanted to find all mail that touches drivers/block/floppy.c and modifies any function that starts with floppy_ or has “floppy” in the subject and maybe any other mail that mentions “floppy” and has the words “bug” or “regression”? And maybe limit the results to just the past month.

Here's the query:

    (dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy
     OR ((nq:bug OR nq:regression) AND nq:floppy))
    AND rt:1.month.ago..

And here are the results:

Now, how about getting that straight into your mailbox, so you don't have to subscribe to the (very busy) linux-block list, if you are the floppy maintainer?

Installing lei

Lei is very new and probably isn't yet available as part of your distribution, but I hope that it will change quickly once everyone realizes how awesome it is.

I'm working on packaging lei for Fedora, so depending on when you're reading this, try dnf install lei — maybe it's already there. If it's not in Fedora proper yet, you can get it from my copr:

    dnf copr enable icon/b4
    dnf install lei

If you're not a Fedora user, just consult the INSTALL file:

Maildir or IMAP?

Lei can deliver search results either into a local maildir, or to a remote IMAP folder (or both). We'll do local maildir first and look at IMAP in a future follow-up, as it requires some preparatory work.

Getting going with lei-q

Let's take the exact query we used for the floppy drive above, and get lei to deliver entire matching threads into a local maildir folder that we can read with mutt:

    lei q -I https://lore.kernel.org/all/ -o ~/Mail/floppy \
      --threads --dedupe=mid \
      '(dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy \
      OR ((nq:bug OR nq:regression) AND nq:floppy)) \
      AND rt:1.month.ago..'

Before you run it, let's understand what it's going to do:

  • -I https://lore.kernel.org/all/ will query the aggregated index that contains information about all mailing lists archived on lore.kernel.org. It doesn't matter to which list the patch was sent — if it's on lore, the query will find it.

  • -o ~/Mail/floppy will create a new Maildir folder and put the search results there. Make sure that this folder doesn't already exist, or lei will clobber anything already present there (unless you use --augment, but I haven't tested this very extensively yet, so best to start with a clean slate).

  • --threads will deliver entire threads even if the match is somewhere in the middle of the discussion. This is handy if, for example, someone says “this sounds like a bug in the floppy subsystem” somewhere in the middle of a conversation and --threads will automatically get you the entire conversation context.

  • --dedupe=mid will deduplicate results based on the message-id header. The default behaviour is to dedupe based on the body contents, but with so many lists still adding junky “sent to the foo list” footers, this tends to result in too many duplicated results. Passing --dedupe=mid is less safe (someone could sneak in a bogus message with an identical message-id and have it delivered to you instead), but more convenient. YMMV, BYOB.

  • Make sure you don't omit the final “..” in the rt: query parameter, or you will only get mail that was sent on that date, not since that date.

As always, backslashes and newlines are there just for readability — you don't need to use them.

After the command completes, you should get something similar to what is below:

    # /usr/bin/curl -Sf -s -d '' https://lore.kernel.org/all/?x=m&t=1&q=(omitted)
    # /home/user/.local/share/lei/store 0/0
    # https://lore.kernel.org/all/ 122/?
    # https://lore.kernel.org/all/ 227/227
    # 150 written to /home/user/Mail/floppy/ (227 matches)

A few things to notice here:

  1. The command actually executes a curl call and retrieves the results as an mbox file.
  2. Lei will automatically convert 1.month.ago into a precise timestamp
  3. The command wrote 150 messages into the maildir we specified

We can now view these results with mutt (or neomutt):

    neomutt -f ~/Mail/floppy

It is safe to delete mail from this folder — it will not get re-added during lei up runs, as lei keeps track of seen messages on its own.

Updating with lei-up

By default, lei -q will save your search and start keeping track of it. To see your saved searches, run:

    $ lei ls-search
    /home/user/Mail/floppy

To fetch the newest messages:

    lei up ~/Mail/floppy

You will notice that the first line of output will say that lei automatically limited the results to only those that arrived since the last time lei was invoked for this particular saved search, so you will most likely get no new messages.

As you add more queries in the future, you can update them all at once using:

    lei up --all

Editing and discarding saved searches

To edit your saved search, just run lei edit-search. This will bring up your $EDITOR with the configuration file lei uses internally:

    ; to refresh with new results, run: lei up /home/user/Mail/floppy
    ; `maxuid' and `lastresult' lines are maintained by "lei up" for optimization
    [lei]
        q = (dfn:drivers/block/floppy.c OR dfhh:floppy_* OR s:floppy OR \
            ((nq:bug OR nq:regression) AND nq:floppy)) AND rt:1.month.ago..
    [lei "q"]
        include = https://lore.kernel.org/all/
        external = 1
        local = 1
        remote = 1
        threads = 1
        dedupe = mid
        output = maildir:/home/user/Mail/floppy
    [external "/home/user/.local/share/lei/store"]
        maxuid = 4821
    [external "https://lore.kernel.org/all/"]
        lastresult = 1636129583

This lets you edit the query parameters if you want to add/remove specific keywords. I suggest you test them on lore.kernel.org first before putting them into the configuration file, just to make sure you don't end up retrieving tens of thousands of messages by mistake.

To delete a saved search, run:

    lei forget-search ~/Mail/floppy

This doesn't delete anything from ~/Mail/floppy, it just makes it impossible to run lei up to update it.

Subscribing to entire mailing lists

To subscribe to entire mailing lists, you can query based on the list-id header. For example, if you wanted to replace your individual subscriptions to linux-block and linux-scsi with a single lei command, do:

    lei q -I https://lore.kernel.org/all/ -o ~/Mail/lists --dedupe=mid \
      '(l:linux-block.vger.kernel.org OR l:linux-scsi.vger.kernel.org) AND rt:1.week.ago..'

You can always edit this to add more lists at any time.

Coming next

In the next series installment, I'll talk about how to deliver these results straight to a remote IMAP folder and how to set up a systemd timer to get newest mail automatically (if that's your thing — I prefer to run lei up manually and only when I'm ready for it).

Linux development depends on the ability to send and receive emails. Unfortunately, it is common for corporate gateways to post-process both outgoing and incoming messages with the purposes of adding lengthy legal disclaimers or performing anti-phishing link quarantines, both of which interferes with regular patch flow.

While it is possible to configure free options like GMail to work well with sending and receiving patches, Google services may not be available in all geographical locales — or there may be other reasons why someone may prefer not to have a gmail.com address.

For this reason, we have partnered with Migadu to provide a mail hosting service under the linux.dev domain. If you're a Linux subsystem maintainer or reviewer and you need a mailbox to do your work, we should be able to help you out.

We hope to expand the service to include other kernel developers in the near future.

Please see our service documentation page for full details.

asciicast

One of the side-effects of the recent UMN Affair has been renewed scrutiny of the kernel development process that continues to rely on patches sent via email. This prompted me to revisit my end-to-end patch attestation work and get it to the point where I consider it to be both stable for day-to-day work and satisfactory from the point of view of underlying security and usability.

Goals of the project

These were the goals at the outset:

  • make it opt-in and don't interfere with existing tooling and workflows
  • be as behind-the-scenes and non-intrusive as possible
  • be simple and easy to understand, explain, and audit

I believe the proposed solution hits all of these points:

  • the implementation is very similar to DKIM and uses email headers for cryptographic attestation of all relevant content (“From:” and “Subject:” headers, plus the message body). Any existing tooling will simply ignore the unrecognized header.
  • cryptographic signing is done via a git hook invoked automatically by git-send-email (sendemail-validate), so it only needs to be set up once and doesn't require remembering to do any extra steps
  • the library doing the signing is only a few hundred lines of Python code and reuses the DKIM standard for most of its logic

Introducing patatt

The library is called “patatt” (for Patch Attestation, obviously), and can be installed from PyPi:

  • pip install --user patatt

It only requires PyNaCl (Python libsodium bindings), git, and gnupg (if signing with a PGP key). The detailed principles of operation are described on the PyPi project page, so I will not duplicate them here.

The screencast linked above shows patatt in action from the point of view of a regular patch contributor.

If you have an hour or so, you can also watch my presentation to the Digital Identity Attestation WG:

Youtube video

Support in b4

Patatt is fully supported starting with version 0.7.0 of b4 — here it is in action verifying a patch from Greg Kroah-Hartman:

$ b4 am 20210527101426.3283214-1-gregkh@linuxfoundation.org
[...]
---
  ✓ [PATCH] USB: gr_udc: remove dentry storage for debugfs file
  ---
  ✓ Signed: openpgp/gregkh@linuxfoundation.org
  ✓ Signed: DKIM/linuxfoundation.org
---
Total patches: 1
---
[...]

As you see above, b4 verified that the DKIM header was valid and that the PGP signature from Greg Kroah-Hartman passed as well, giving double assurance that the message was not modified between leaving Greg's computer and being checked on the end-system of the person retrieving the patch.

Keyring management

Patatt (and b4) also introduce the idea of tracking contributor public keys in the git repository itself. It may sound silly — how can the repository itself be a source of trusted keys? However, it actually makes a lot of sense and isn't any worse than any other currently used public key distribution mechanism:

  • git is already decentralized and can be mirrored to multiple locations, avoiding any single points of failure
  • all contents are already versioned and key additions/removals can be audited and “git blame’d”
  • git commits themselves can be cryptographically signed, which allows a small subset of developers to act as “trusted introducers” to many other contributors (mimicking the “keysigning” process)

Contributor public keys can be added either to the main branch itself, along with the project codebase (perhaps in the .keys toplevel subdirectory), or it can be managed in a dedicated ref, such as refs/meta/keyring). The latter can be especially useful for large projects where patches are collected by subsystem maintainers and then submitted as pull requests for inclusion into the mainline repository. Keeping the keyring in its own ref assures that it stays out of the way of regular development work but is still centrally managed and tracked.

Further work

I am hoping that people will now start using cryptographic attestation for the patches they send, however I certainly can't force anyone's hand. If you are a kernel subsystem maintainer or a core developer of some other project that relies on mailed-in patches for the submission and code review process, I hope that you will give this a try.

If you have any comments, concerns, or improvement suggestions, please reach out to the tools list.

We have recently announced the availability of our new mailing list platform that will eventually take on the duties currently performed by vger. Off the bat, there were a few questions about how it works under the hood — especially regarding DMARC-friendly cofiguration.

Under the hood

There is nothing fancy about the setup — mailing lists are managed by mlmmj, while all delivery operations are handled by Postfix. All outgoing mail is delivered via kernel.org mirror edge nodes (see the output of dig -t txt +short _spf.kernel.org if you are curious what they are), which is mostly done to speed up delivery by spreading out the queue across several systems.

When mlmmj writes to the archive directory of the mailing list, the message is immediately picked up by public-inbox-watch and appended to the public-inbox archive for that list. The archive is then replicated to lore.kernel.org (using grokmirror integration), which usually happens within 60 seconds. This replication is parallel to Postfix delivering mail to list subscribers, so is not dependent on the size of the mail queue. In theory, it shouldn't take longer than a few minutes for a message sent to a lists.linux.dev address to show up on lore.kernel.org. Similarly, messages should never go missing from the public-inbox archive if they got accepted by mlmmj for delivery (I know, famous last words).

Appeasing DMARC

It's a common misconception that mailing lists are somehow incompatible with DMARC. There are two key principles to follow:

  1. The Envelope-From should be that of the mailing list domain. For example, if I send an email to linux-staging, the envelope-from of the outgoing message will be changed to linux-staging+bounces-x@lists.linux.dev, with some subscriber bounce tracking information in place of x. This way, when MTAs are looking at DMARC for “konstantin@linuxfoundation.org”, the SPF check will be performed against “lists.linux.dev” instead of the domain of the original sender (“linuxfoundation.org”).

  2. There should be no changes to any of the existing message headers and no modifications to the message body. This is actually the part that generally trips up mailing list operators, as it is a long-standing practice to do two things when it comes to mailing lists: modify the subject to insert a terse list identifier (e.g. Subject: [linux-staging] [PATCH] ...) and append a footer to the message body with mailing list administrative info. Doing either of these things will almost certainly invalidate DKIM signatures and therefore cause the message to fail the DMARC check. It is correct to add proper List-Id/List-Subscribe/etc headers, though — and hopefully the domain of the original sender isn't misconfigured to include these headers into the DKIM signature (true story, saw this happen).

Following the above advice will work for nearly all cases except where a domain sets a DMARC policy, but the message is sent without a DKIM signature. If this happens, DMARC validators are supposed to use a kludgy “alignment” check where the envelope-from must match the From: header. In that particular case the messages we send out will fail DMARC checks, unfortunately. As far as I'm concerned, this is the fault of domain owners and is properly fixed by setting up proper DKIM signing and giving users a way to send outgoing mail via proper SMTP gateways.

(There is a way to work around this by rewriting the “From: “ header so that it matches the list domain as well, but let's just not go there, as rewriting the From: header is not an acceptable solution for lists working with code reviews.)

Here's a write-up I randomly found while writing this post that goes into some more detail regarding DMARC and mailing lists.

Why no ARC headers?

We don't currently add ARC headers — as far as I can tell, they aren't required for operating a mailing list that properly sets the envelope-from. In theory, using ARC signing may help with the “DMARC with no DKIM” corner-case above, but I'm not convinced this is worth the crazy header bloat. Who knows, I may change my mind about this in the future.

Parting words

In short, the best way to assure that a message sent via subspace.kernel.org is delivered to all subscribers is to send it from a domain that properly DKIM-signs all mail. If you run your own server, you can either set up OpenDKIM on your own (it's not complicated, honest), or you can pay some money to a company like Mailgun to do it for you.

Many people know that you can PGP-sign git objects — such as tags or commits themselves — but very few know of another attestation feature that git provides, which is signed git pushes.

Why sign git pushes? And how are they different from signed tags/commits?

Signed commits are great, but one thing they do not indicate is intent. For example, you could write some dangerous proof-of-concept code and push it into refs/heads/dangerous-do-not-use. You could even push it into some other fork hosted on a totally different server, just to make it clear that this is not production-ready code.

However, if your commits are PGP-signed, someone could take them and replay over any other branch in any other fork of your repository. To anyone checking the commit signatures, everything will look totally legitimate, as the actual commits are signed by you — never mind that they contain dangerous vulnerable code and were never intended to be pushed into something like refs/heads/next. At the very least, you will look reckless for pushing bad code, even though you were just messing around in a totally separate environment set up specifically for experimentation.

To help hedge against this problem, git provides developers a way to sign their actual pushes, as a means to attest “yes, I actually did intend to push these commits into this ref in this repository on this server, and here's my PGP signature to prove it.” When a push is signed, git will both check the signature it received against a trusted keyring and generate a “push certificate” that can be logged in something like a transparency log:

https://git.kernel.org/pub/scm/infra/transparency-logs/gitolite/git/1.git/plain/m?id=c06eebe4875d6103d580efcf8cd78cc9cc4b5192

Now, before you rush to enable signed pushes, please keep in mind that this functionality needs to first be enabled on the server side, and the vast majority of public git hosting forges do NOT have this turned on. Thankfully, git provides an if-asked setting, which will first check if the remote server supports signed pushes, and only generate the push certificate if the remote server accepts them. To enable this feature for yourself, simply add the following to your ~/.gitconfig:

[push]
    gpgSign = if-asked

Enabling on the server side

If you are running your own git server, then it is easy to enable this on the server side. Add the following either to each repository config file, or to /etc/gitconfig to enable it globally:

[receive]
    advertisePushOptions = true
    certNonceSeed = "<uniquerandomstring>"

You should set the certNonceSeed setting to some randomly generated long string that should be kept secret. It is combined with the timestamp to generate a one-time value (“nonce”) that the git client is required to sign and provides both a mechanism to prevent replay attacks and to offer proof that the certificate was generated for that specific server (though for others to verify this, they would need to know the value of the nonce seed).

Once you have this feature enabled, it is up to you what you do with the generated certificates. You can simply opt to record them, just like we do with our transparency log, or you can actually reject pushes that do not come with valid push certificates. I refer you to the git documentation and to our post-receive-activity-feed hook, which we use to generate the transparency log:

This week we made public all of our git commit logs, going back to 2013, in hopes to increase the transparency of high-importance kernel.org operations. All writes performed on public git repositories are now recorded in a public-inbox feed, which is immediately replicated to multiple worldwide servers. This is done with the goal do make it difficult for someone to make changes to any git repository hosted on kernel.org without it generating a verifiable, tamper-evident record.

The transparency logs are available at the following address:

https://git.kernel.org/pub/scm/infra/transparency-logs/gitolite/git/

You can read more detailed documentation here:

https://korg.docs.kernel.org/gitolite/transparency-log.html

You can have lore.kernel.org mailing lists delivered right into your inbox straight from the git archive (in fact, this will work for any public-inbox server, not just for lore.kernel.org). It's efficient and (optionally) preserves a full copy of entire list archives on your system — should you wish to keep them.

Note: this requires grokmirror-2.0.2+, as earlier versions do not come with the grok-pi-piper utility.

Installing grokmirror-2.0

Easiest is to install from pip:

pip install --user grokmirror~=2.0.2

You may have grokmirror available from your distro packages, too, but make sure it's version 2.0.2 or above.

Installing procmail

Procmail should be available with your distribution, so install it like any other package.

Configuring procmail

Procmail configuration can easily be a topic for a whole book in itself, but if you just want to have messages delivered into your inbox, all you have to do is create a ~/.procmailrc with the following contents:

DEFAULT=$HOME/Mail/

# Don't deliver duplicates sent to multiple lists
:0 Wh: .msgid.lock
| formail -D 8192 .msgid.cache

If your mailbox is not in ~/Mail, then you should adjust the above accordingly.

Configuring grokmirror

Create a ~/.config/lore.conf with the following contents. We'll use three lists as examples: git, linux-hardening, and linux-doc, but you'll obviously want to use the lists you care about. You can see which lists are available from https://lore.kernel.org/lists, or the exact git repositories on https://erol.kernel.org/.

[core]
toplevel = ~/.local/share/grokmirror/lore
log = ${toplevel}/grokmirror.log

[remote]
site = https://lore.kernel.org
manifest = https://lore.kernel.org/manifest.js.gz

[pull]
post_update_hook = ~/.local/bin/grok-pi-piper -c ~/.config/pi-piper.conf
refresh = 300
include = /git/*
          /linux-hardening/*
          /linux-doc/*

The above assumes that you installed grokmirror with pip install --user. Now make the toplevel directory for the git repos:

$ mkdir -p ~/.local/share/grokmirror/lore

Configuring pi-piper

The last step is to create ~/.config/pi-piper.conf:

[DEFAULT]
pipe = /usr/bin/procmail
shallow = yes

The important bit here is shallow = yes. Public-inbox stores every mail message as a separate commit, so once a message is piped to procmail and delivered, we usually don't care about keeping a copy of that commit any more. If you set shallow = yes, pi-piper will prune all but the last successfully processed commit out of your local git copy by turning those repos into shallow git repositories. This helps to greatly save disk space, especially for large archives.

If you do want to keep full archives, then don't set shallow. You can change your mind at any time by running git fetch _grokmirror master --unshallow in each underlying git repository (you can find them in ~/.local/share/grokmirror/lore/).

You can also specify the shallow option per list:

[DEFAULT]
pipe = /usr/bin/procmail

[linux-hardening]
shallow = yes

Running grok-pull

You can now run grok-pull to get the initial repo copies. Note, that during the first run grokmirror will perform full clones even if you specified shallow = yes in the pi-piper config, so it may take some time for large archives like those for the git list. However, once the pi-piper hook runs, they will be repacked to almost nothing. Future versions of grokmirror may become smarter about this and perform shallow clones from the beginning.

During the initial pi-piper run, there will be no mail delivered, as it will just perform initial setup and make a note where the HEAD is pointing. If you run grok-pull again, two things may happen:

  1. There will be no changes and grok-pull will exit right away
  2. If there are changes, they will be fetched and the hook will deliver them to procmail (and to your inbox)

Running in the background

You can run grok-pull in the background, where it will check for updates as frequently as the refresh setting says (300 seconds in the example above).

You can either background it “the old way”:

grok-pull -o -c ~/.config/lore.conf &

Or the new way, using a systemd user service:

$ cat .config/systemd/user/grok-pull@.service

[Unit]
Description=Grok-pull service for %I
ConditionPathExists=%h/.config/%i.conf

[Service]
ExecStart=%h/.local/bin/grok-pull -o -c %h/.config/%i.conf
Type=simple
Restart=on-failure

[Install]
WantedBy=default.target

$ systemctl --user enable grok-pull@lore
$ systemctl --user start grok-pull@lore

If you make changes to ~/.config/lore.conf, for example to add new lists, you will need to restart the service:

$ systemctl --user restart grok-pull@lore

Combining with mbsync

You can totally combine this with mbsync and deliver into the same local inbox. As a perk, any messages injected from grokmirror will be uploaded to your remote imap mailbox. See this post from mcgrof about configuring mbsync:

Troubles

Email tools@linux.kernel.org if you have any trouble getting the above to work. The grok-pi-piper utility is fairly new, so it's entirely possible that it's full of bugs.

With git, a cryptographic signature on a commit provides strong integrity guarantees of the entire history of that branch going backwards, including all metadata and all contents of the repository, all the way back to the initial commit. This is possible because git records the hash of the previous commit in each next commit's metadata, creating an unbreakable cryptographic chain of records. If you can verify the cryptographic signature at the tip of the branch, you effectively verify that branch's entire history.

For example, let's take a look at linux.git, where the latest tag at the time of writing, v5.8-rc7, is signed by Linus Torvalds. (Tag signatures are slightly different from commit signatures — but in every practical sense they offer the same guarantees.)

If you have a git checkout of torvalds/linux.git, you are welcome to follow along.

$ git cat-file -p v5.8-rc7
object 92ed301919932f777713b9172e525674157e983d
type commit
tag v5.8-rc7
tagger Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700

Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----

iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
171s5Hs=
=BQIl
-----END PGP SIGNATURE-----

If we have Linus's key in our GnuPG keyring, we can easily verify that this tag is valid:

$ git verify-tag v5.8-rc7
gpg: Signature made Sun 26 Jul 2020 05:14:06 PM EDT
gpg:                using RSA key ABAF11C65A2970B130ABE3C479BE3E4300411886
gpg:                issuer "torvalds@linux-foundation.org"
gpg: Good signature from "Linus Torvalds <torvalds@kernel.org>" [unknown]
gpg:                 aka "Linus Torvalds <torvalds@linux-foundation.org>" [full]

The entire contents of this tag are signed, so this tells us that when Linus signed the tag, the “object hash” on his system was 92ed301919932f777713b9172e525674157e983d. But what exactly is that “object hash?” What are the contents that are hashed here? We can find out by asking git to tell us more about that object:

$ git cat-file -p 92ed301919932f777713b9172e525674157e983d
tree f16e3e4bcea2d875a17d2278ff67364b3277b10a
parent 1c8594b8427290c178c5d39885eacd9e41f68743
author Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700
committer Linus Torvalds <torvalds@linux-foundation.org> 1595798046 -0700

Linux 5.8-rc7

The above contents in their entirety (slightly differently formatted) is what gives us the sha1 hash 92ed301919932f777713b9172e525674157e983d. So, thus far, we have unbroken cryptographic attestation from Linus's PGP signature to two other important bits about his git repository:

  • information about the state of his source code (tree)
  • information about the previous commit in the history (parent)
  • information about the author of the commit and the committer, which are the one and the same in this particular case
  • information about the date and time when the commit was made

Let's take a look a the tree line — what contents were hashed to arrive at that checksum? Let's ask git:

$ git cat-file -p f16e3e4bcea2d875a17d2278ff67364b3277b10a
100644 blob a0a96088c74f49a961a80bc0851a84214b0a9f83    .clang-format
100644 blob 43967c6b20151ee126db08e24758e3c789bcb844    .cocciconfig
100644 blob a64d219137455f407a7b1f2c6b156c5575852e9e    .get_maintainer.ignore
100644 blob 4b32eaa9571e64e47b51c43537063f56b204d8b3    .gitattributes
100644 blob d5f4804ed07cd36336a5e80f2a24e45104f902cf    .gitignore
100644 blob db4f2295bd9d792b47eb77aab179a9db0d968454    .mailmap
100644 blob a635a38ef9405fdfcfe97f3a435393c1e9cae971    COPYING
100644 blob 0787b5872906c8a92a63cde3961ed630e2ec93b6    CREDITS
040000 tree 37e1b4166d912d69738beca645d3d539da4bbf30    Documentation
[...]
040000 tree ba6955ee6228666d9ef117fdd45df2e53ba0e221    virt

This is the entirety of the top-level Linux kernel directory contents. The blob entries are sha1sum's of the actual file contents in that directory, so these are straightforward. Subdirectories are represented as other tree entries, which also consist of blob and tree records going all the way down to the last sublevel, which will only contain blobs.

So, tree f16e3e4bcea2d875a17d2278ff67364b3277b10a in the commit record is a checksum of other checksums and it allows us to verify that each and every file in linux.git is exactly the same as it was on Linus Torvalds' system when he created the commit. If any file is changed, the tree checksum would be different and the whole repository would be considered invalid, because the object hash would be different than in the commit.

Finally, if we look at the object mentioned in parent 1c8594b8427290c178c5d39885eacd9e41f68743, we will see that it is a hash of another commit, containing its own tree and parent records:

$ git cat-file -p 1c8594b8427290c178c5d39885eacd9e41f68743
tree d56de40028d9ecdbebfc2121fd1ce1213fa09fa2
parent 40c60ac32174f0c0c090cd31d0d1712f2478e689
parent ca9b31f6bb9c6aa9b4e5f0792f39a97bbffb8c51
author Linus Torvalds <torvalds@linux-foundation.org> 1595796417 -0700
committer Linus Torvalds <torvalds@linux-foundation.org> 1595796417 -0700
mergetag object ca9b31f6bb9c6aa9b4e5f0792f39a97bbffb8c51
[...]

If we cared to, we could walk each commit all the way back to the beginning of Linux git history, but we don't need to do that — verifying the checksum of the latest commit is sufficient to provide us all the necessary assurances about the entire history of that tree.

So, if we verify the signature on the tag and confirm that it matches the key belonging to Linus Torvalds, we will have strong cryptographic assurances that the repository on our disk is byte-for-byte the same as the repository on the computer belonging to Linus Torvalds — with all its contents and its entire history going back to the initial commit.

The difference between signed tags and signed commits is minimal — in the case of commit signing, the signature goes into the commit object itself. It is generally a good practice to PGP-sign commits, particularly in environments where multiple people can push to the same repository branch. Signed commits provide easy forensic proof of code origins (e.g. without commit signing Alice can fake a commit to pretend that it was actually authored by Bob). It also allows for easy verification in cases where someone wants to cherry-pick specific commits into their own tree without performing a git merge.

If you are looking to get started with git and PGP signatures, I can recommend my own Protecting Code Integrity guide, or its kernel-specific adaptation that ships with the kernel docs: Kernel Maintainer PGP Guide.

Obligatory note: sha1 is not considered sufficiently strong for hashing purposes these days, and this is widely acknowledged by the git development community. Significant efforts are under way to migrate git to stronger cryptographic hashes, but they require careful planning and implementation in order to minimize disruption to various projects using git. To my knowledge, there are no effective attacks against sha1 as used by git, and git developers have added further precautions against sha1 collision attacks in git itself, which helps buy some time until stronger hashing implementations are considered ready for real-world use.