people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from Konstantin Ryabitsev

For the past few weeks I've been working on a tool to fetch patches from lore.kernel.org and perform the kind of post-processing that is common for most maintainers:

  • rearrange the patches in proper order
  • tally up various follow-up trailers like Reviewed-by, Acked-by, etc
  • check if a newer series revision exists and automatically grab it

The tool started out as get-lore-mbox, but has now graduated into its own project called b4 — you can find it on git.kernel.org and pypi.

To use it, all you need to know is the message-id of one of the patches in the thread you want to grab. Once you have that, you can use the lore.kernel.org archive to grab the whole thread and prepare an mbox file that is ready to be fed to git-am:

$ b4 am 20200312131531.3615556-1-christian.brauner@ubuntu.com
Looking up https://lore.kernel.org/r/20200312131531.3615556-1-christian.brauner@ubuntu.com
Grabbing thread from lore.kernel.org
Analyzing 26 messages in the thread
Found new series v2
Will use the latest revision: v2
You can pick other revisions using the -vN flag
---
Writing ./v2_20200313_christian_brauner_ubuntu_com.mbx
  [PATCH v2 1/3] binderfs: port tests to test harness infrastructure
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
  [PATCH v2 2/3] binderfs_test: switch from /dev to a unique per-test mountpoint
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
  [PATCH v2 3/3] binderfs: add stress test for binderfs binder devices
    Added: Reviewed-by: Kees Cook <keescook@chromium.org>
---
Total patches: 3
---
 Link: https://lore.kernel.org/r/20200313152420.138777-1-christian.brauner@ubuntu.com
 Base: 2c523b344dfa65a3738e7039832044aa133c75fb
       git checkout -b v2_20200313_christian_brauner_ubuntu_com 2c523b344dfa65a3738e7039832044aa133c75fb
       git am ./v2_20200313_christian_brauner_ubuntu_com.mbx

As you can see, it was able to:

  • grab the whole thread
  • find the latest revision of the series (v2)
  • tally up the Reviewed-by trailers from Kees Cook and insert them into proper places
  • save all patches into an mbox file
  • show the commit-base (since it was specified)
  • show example git checkout and git am commands

Pretty neat, eh? You don't even need to know on which list the thread was posted — lore.kernel.org, through the magic of public-inbox, will try to find it automatically.

If you want to try it out, you can install b4 using:

pip install b4

(If you are wondering about the name, then you should click the following links: V'ger, Lore, B-4.)

The same, but now with patch attestation

On top of that, b4 also introduces support for cryptographic patch attestation, which makes it possible to verify that patches (and their metadata) weren't modified in transit between developers. This is still an experimental feature, but initial tests have been pretty encouraging.

I tried to design this mechanism so it fulfills the following requirements:

  • it must be unobtrusive and not pollute the mailing lists with attestation data
  • it must be possible to submit attestation after the patches were already sent off to the list (for example, from a different system, or after being asked to do so by the maintainer/reviewer)
  • it must not invent any new crypto or key distribution routines; this means sticking with PGP/GnuPG — at least for the time being

If you are curious about the technical details, I refer you to my original RFC where I describe the implementation.

If you simply want to start using it, then read on.

Submitting patch attestation

If you would like to submit attestation for a patch or a series of patches, the best time to do that is right after you use git send-email to submit your patches to the list. Simply run the following:

b4 attest *.patch

This will do the following:

  • create a set of 3 hashes per each patch (for the metadata, for the commit message, and for the patch itself)
  • add these hashes to a YAML-style document
  • PGP-sign the attestation document using the PGP key you set up with git
  • connect to mail.kernel.org:587 and send the attestation document to the signatures@kernel.org mailing list.

If you don't want to send that attestation right away, use the -n flag to simply generate the message and save it locally for review.

Verifying patch attestation

When running b4 am, the tool will automatically check if attestation is available by querying the signatures archive on lore.kernel.org. If it finds the attestation document, it will run gpg --verify on it. All of the following checks must pass before attestation is accepted:

  1. The signature must be “good” (signed contents weren't modified)
  2. The signature must be “valid” (not done with a revoked/expired key)
  3. The signature must be “trusted” (more on this below)

If all these checks pass, b4 am will show validation checkmarks next to the patches as it processes them:

$ b4 am 202003131609.228C4BBEDE@keescook
Looking up https://lore.kernel.org/r/202003131609.228C4BBEDE@keescook
Grabbing thread from lore.kernel.org
...
---
Writing ./v2_20200313_keescook_chromium_org.mbx
  [✓] [PATCH v2 1/2] selftests/harness: Move test child waiting logic
  [✓] [PATCH v2 2/2] selftests/harness: Handle timeouts cleanly
  ---
  [✓] Attestation-by: Kees Cook <keescook@chromium.org> (pgp: 8972F4DFDC6DC026)
---
Total patches: 2
---
...

These checkmarks give you assurance that all patches are exactly the same as when they were generated by the developer on their system.

Trusting on First Use (TOFU)

The most bothersome part of PGP is key management. In fact, it's the most bothersome part of any cryptographic attestation scheme — you either have to delegate your trust management to some shadowy Certification Authority, or you have to do a lot of decision making of your own when evaluating which keys to trust.

GnuPG tries to make it a bit easier by introducing the “Trust on First Use” (TOFU) model. The first time you come across a key, it is considered automatically trusted. If you suddenly come across a different key with the same identity on it, GnuPG will mark both keys as untrusted and let you decide on your own which one is “the right one.”

If you want to use the TOFU trust policy for patch attestation, you can add the following configuration parameter to your $HOME/.gitconfig:

[b4]
  attestation-trust-model = tofu

Alternatively, you can use the traditional GnuPG trust model, where you rely on cross-certification (“key signing”) to make a decision on which keys you trust.

Where to get help

If either b4 or patch attestation are breaking for you — or with any questions or comments — please reach out for help on the kernel.org tools mailing list:

  • tools@linux.kernel.org
 
Read more...

from David Ahern

Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.

This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:

  • Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS

  • Install docker-ce

  • Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:

    ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
    --containerd=/run/containerd/containerd.sock
    
  • Tell systemd about the change and restart docker

    systemctl daemon-reload
    systemctl restart docker
    

With that, docker pull should work fine – in mgmt vrf or default vrf.

 
Read more...

from David Ahern

Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).

The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:

$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic

despite VRF being enabled and built:

$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m

which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: $ sudo apt-get install --install-recommends linux-generic-hwe-18.04

although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.

First, let's enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:

    echo "net.ipv4.tcp_l3mdev_accept=1" >> /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf

Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]

Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018: https://bugs.launchpad.net/netplan/+bug/1773522

One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:

#!/bin/bash

ip link show dev mgmt 2>/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default

The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]

Reboot node to make the changes take effect.

WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.

After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).

Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use 'ip vrf exec' as a helper to do the VRF binding. 'ip vrf exec' uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device ('mgmt' in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).

Let's see what happens:

$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'

Theoretically, this should Just Work, but it clearly does not. Why?

Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0

So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.

Where does the theory breakdown? In 3 places.

First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:

$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache

That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let's add the loopback addresses to it:

    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache

The second problem is systemd-resolved binds its socket to the loopback device:

    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=13))

The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.

The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.

Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.

$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf > /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]

When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding: $ ip vrf exec mgmt su – dsahern

Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].

Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:

NS=$(ip netns identify)
[ -n "$NS" ] && NS=":${NS}"

VRF=$(ip vrf identify)
[ -n "$VRF" ] && VRF=":${VRF}"

And then adding '${NS}${VRF}' after the host in PS1:

PS1='${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ '

For example, now the prompt becomes: dsahern@myhost:mgmt:~$

References:

[1] VRF tutorial, Open Source Summit, North America, Sept 2017 http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf

[2] VRF helpers, e.g., systemd instances for VRF https://github.com/CumulusNetworks/vrf

[3] Example using VRF in dhclient-exit-hook https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf

[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.

 
Read more...

from joelfernandes

The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

There's are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part's lock counter during locking and likewise for unlock.

During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with 'A MUST' in the below pseudo-code:

        reader 1        writer                        reader 2
        -------------------------------------------------------
        // read_lock
        // enter
        Read: idx = 0;
        <long delay>    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                        done.
                                                      Read: idx = 1;
        lock[0]++;
                                                      lock[1]++;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] <- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] <- A MUST!

NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its 'srcu_idx' equivalent.

Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic? Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the 'flip' logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a 'flipped' part would only monotonically decrease since new readers go to its counterpart.

 
Read more...

from metan's blog

When we were designing the new LTP test library we ended up with something that could be called a test driver model. There is plenty of of reasons why tests should have well defined structure and why declarative style for test requirements and dependencies is better than conditions buried deep in a chain of function calls. Not only that simplifies the test code and helps the author to focus on the actual test code, this arrangement also helps to export the test metadata into a machine parsable format, which I've been writing about in my previous posts.

So how does LTP driver mode looks like? First of all LTP tests consists of three functions. There is a setup(), which is called once at the start of the test, a test() function which may be called repeatedly and a cleanup() that may be called asynchronously. Asynchronously means that the test library will call the cleanup() before the test exits, which may be triggered prematurely by a fatal failure in the middle of the test run.

Apart from these functions there are plenty of bit flags, strings, and arrays that describe what the test requires and what should be prepared before these functions are called. There are plenty of basic bit flags such as needs_root, needs_tmpdir, and so on. Then there are a bit more advanced flags that can prepare a block device prior to the test and clean it up automatically or run the test() function for all filesystems supported by a kernel.

Basic LTP test

#include "tst_test.h"

static void test(void)
{
        tst_res(TPASS, "Test passed");
}

static struct tst_test test = {
        .test_all = test,
};

This is the most basic LTP test that succeeds. There are a few default test parameters such as -i which allows the test to be executed repeatedly, so the test function has to be able to cope with that. The actual test also runs in a forked process while the test library process watches for timeouts and cleans up once the test is done. However all of that is invisible to the test itself.

Tests that runs on all supported filesystems

#include "tst_test.h"

#define MNTPOINT "mntpoint"

static void test(void)
{
        /* Actual test is done here and the device is mounted at MNTPOINT */
}

static void setup(void)
{
        tst_res(TINFO, "Running test on %s device %s filesystem",  
                tst_device.dev, tst_device.fs_type);
}

static struct tst_test test = {
        .setup = setup,
        .test_all = test,
        .mount_device = 1,
        .all_filesystems = 1,
        .mntpoint = MNTPOINT,
};

This is a bit more complex example of the test library. We ask for the test to be executed for each supported filesystem. Supported means that kernel is able mount it and mkfs is installed so that the device could be formatted. The test library also supports FUSE which needs a special handling since we need a kernel support for a fuse and a user-space binary that implements the filesystem.

Once the test is started the test library compiles a list of supported filesystems and creates unique test temporary directory because .needs_tmpdir is implied by .mount_device flag. For each filesystem the device is formatted and mounted at .mntpoint which is relative to the test temporary directory. After that a new process witch will run the test is forked. The main library process then starts a timeout timer and waits. The child, if set, runs the setup() function, which is usually used to populate the newly formatted device with files and so on. Then finally the test() function is executed, possibly repeatedly. Once the test is done the child, if set, runs the cleanup(). But since the device is mounted in the test library and cleaned up automatically as well as the temporary directory there is nothing to do. Also if the child that runs the test crashes the device is unmounted regardless.

You may ask where did we get the device for the test in the first place, most of the time this is a loopback device created by the test library as well, which is created at the start of the test automatically and released once it finishes as well.

This sums up a very shallow look at the test library, there is much more implemented in there, for example we do have a flag for restoring the system wall clock after a test, support for parsing kernel .config and many more, but I guess that I should stop now because the blog post is already longer than I wanted it to be.

 
Read more...

from Konstantin Ryabitsev

If Greg KH ever writes a book about his work as the stable kernel maintainer, it should be titled “Everyone must upgrade” (or it could be a Dr. Who fanfic about Cybermen, I guess). Today, I'm going to borrow a leaf out of that non-existent book to loudly proclaim that all patch submissions must include base-commit info.

What is a base-commit?

When you submit a single patch or a series of patches to a kernel maintainer, there is one important piece of information that they need to know in order to properly apply it. Specifically, they need to know what was the state of your git tree at the time when you wrote that code. Kernel development moves very quickly and there is no guarantee that a patch written mid-January would still apply at the beginning of February, unless there were no significant changes to any of the files your patch touches.

To solve this problem, you can include a small one-liner in your patch:

base-commit: abcde12345

This tells the person reviewing your patch that, at the time when you wrote your code, the latest commit in the git repository was abcde12345. It is now easy for the maintainer to do the following:

git checkout -b incoming_patch abcde12345
git am incoming_patch.mbx

This will tell git to create a new branch using abcde12345 as the parent commit and apply your patches at that point in history, ensuring that there will be no failed or rejected hunks due to recent code changes.

After reviewing your submission the maintainer can then merge that branch back into master, resolving any conflicts during the merge stage (they are really good at that), instead of having to modify patches during the git am stage. This saves maintainers a lot of work, because if your patches require revisions before they can be accepted, they don't have to manually edit anything at all.

Automated CI systems

Adding base-commit info really makes a difference when automated CI systems are involved. With more and more CI tests written for the Linux kernel, maintainers are hoping to be able to receive test reports for submitted patches even before they look at them, as a way to save time and effort.

Unfortunately, if the CI system does not have the base-commit information to work with, it will most likely try to apply your patches to the latest master. If that fails, there will be no CI report, which means the maintainers will be that much less likely to look at your patches.

How to add base-commit to your submission

If you are using git-format-patch (and you really should be), then you can already automatically include the base commit information. The easiest way to do so is by using topical branches and git format-patch --base=auto, for example:

$ git checkout -t -b my-topical-branch master
Branch 'my-topical-branch' set up to track local branch 'master'.
Switched to a new branch 'my-topical-branch'

[perform your edits and commits]

$ git format-patch --base=auto --cover-letter -o outgoing/ master
outgoing/0000-cover-letter.patch
outgoing/0001-First-Commit.patch
outgoing/...

When you open outgoing/0000-cover-letter.patch for editing, you will notice that it will have the base-commit: trailer at the very bottom.

Once you have the set of patches to send, you should run them through checkpatch.pl to make sure that there are no glaring errors, and then submit them to the proper developers and mailing lists.

You can learn more by reading the submitting patches document, which now includes a section about base-commit info as well.

 
Read more...

from Konstantin Ryabitsev

WriteFreely recently added support for creating and editing posts via the command-line wf tool and this functionality is available to all users at people.kernel.org.

On the surface, this is easy to use — you just need to write out a markdown-formatted file and then use wf publish myfile.md to push it into your blog (as draft). However, there are some formatting-related caveats to be aware of.

Line-breaks

Firstly, WriteFreely's MD flavour differs from GitHub's in how it treats hard linebreaks: specifically, they will be preserved in the final output. On GitHub, if you write the following markdown:

Hello world! Dis next line. And dis next line.

And dis next para. Pretty neat, huh?

GitHub will collapse single linebreaks and only preserve the double linebreak to separate text into two paragraphs. On the contrary, WriteFreely will preserve all newlines as-is. I was first annoyed by difference from other markdown flavours, but then I realized that this is actually more like how email is rendered, and found zen and peace in this. :)

Therefore, publishing via wf post will apply stylistic markdown formatting and properly linkify all links, but will preserve all newlines as if you were reading an email message on lore.kernel.org.

There's some discussion about making markdown flavouring user-selectable, so if you want to add your voice to the discussion, please do it there.

Making it behave more like GitHub's markdown

If you do want to make it behave more like GitHub's markdown, you need to make sure that:

  1. You aren't using hard linebreaks to wrap your long lines
  2. You are publishing using --font serif

E.g.:

  $ gedit mypost.md
  $ cat mypost.md | wf post --font serif

This will render things more like how you get them by publishing from the WriteFreely's web interface.

Using “post” and “publish” actually puts things into drafts

I found this slightly confusing, but this is not a bad feature in itself, as it allows previewing your post before putting it out into the world. The way it works is:

  $ vim myfile.md
  $ cat myfile.md | wf post
  https://people.kernel.org/abcrandomstr

You can then access that URL to make sure everything got rendered correctly. If something isn't quite right, you can update it via using its abcrandomstr preview URL:

  $ vim myfile.md
  $ cat myfile.md | wf update abcrandomstr

After you're satisfied, you can publish the post using the “move to Yourblog” link in the Drafts view.

Read the friendly manual

Please read the user guide and the markdown reference to try things out.

 
Read more...

from metan's blog

First of all what's result propagation and what is wrong with it. Result propagation happens when test does a function call and the test result depends on the return value. Or if a test executes a sub-process and the result depends on the return value. Sometimes the propagation is quite simple but more often the chain is complicated and the code is prone to errors. I've seen quite a few testcases that were failing but the test results were being ignored because the failure was lost in propagation.

Naturally I wanted to avoid this problems when designing the LTP test library. The main requirements for the solution were:

  • Keep it as simple as possible
  • No need to propagate results even from processes started by exec()
  • Thread safe

In the end the solution was quite simple, the functions that report test results in LTP tests use atomic increments on counters stored in a piece of shared memory.

When LTP tests starts, the library allocates a page of shared memory, the memory is backed by a unique file on tmpfs and the path is stored in an environment variable. Which also means that you can use this interface from basically any programming language including a shell, since all you need is the environment variable and small C helper that increments the counters.

The shared page of memory could also be used for synchronization, once we have the page in place in all tests there is plenty of room to be used by futexes, which is what the checkpoint synchronization primitives in LPT are based on. And again, since the path to the shared memory is available even to processes started by exec(), we can synchronize shell parts of the tests against C code which I think is pretty cool feature.

 
Read more...

from joelfernandes

The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available — by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0's stores and P1's loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (“message”).

        int buf = 0, flag = 0;

        P0()
        {
                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);
        }

        P1()
        {
                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);
        }

Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney's perfbook for details on that).

EXTENDS Integers, TLC
(*--algorithm mp_pattern
variables
    buf = 0,
    flag = 0;

process Writer = 1
variables
    begin
e0:
       either
e1:        buf := 1;
e2:        flag := 1;
        or
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
variables
    r1 = 0,
    r2 = 0;  
    begin
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)

Sure enough, the assert r1 = 0 \/ r2 = 1; fires when the PlusCal program is run through the TLC model checker.

I do find the either or block clunky, and wish I could just do something like:

non_deterministic {
        buf := 1;
        flag := 1;
}

And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.

Other notes:

  • PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal's syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don't want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don't specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.

  • I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.

 
Read more...

from Benson Leung

tl;dr: There are now 8. Thunderbolt 3 cables officially count too. It's getting hard to manage, but help is on the way.

Edited lightly 09-16-2019: Tables 3-1 and 5-1 from USB Type-C Spec reproduced as tables instead of images. Made an edit to clarify that Thunderbolt 3 passive cables have always been complaint USB-C cables.

If you recall my first cable post, there were 6 kinds of cables with USB-C plugs on both ends. I was also careful to preface that it was true as of USB Type-C™ Specification 1.4 on June 2019.

Last week, the USB-IF officially published the USB Type-C™ Specification Version Revision 2.0, August 29, 2019.

This is a major update to USB-C and contains required amendments to support the new USB4™ Spec.

One of those amendments? Introducing a new data rate, 20Gbps per lane, or 40Gbps total. This is called “USB4 Gen 3” in the new spec. One more data rate means the matrix of cables increases by a row, so we now have 8 C-to-C cable kinds, see Table 3-1:

Table 3-1 USB Type-C Standard Cable Assemblies

Cable Ref Plug 1 Plug 2 USB Version Cable Length Current Rating USB Power Delivery USB Type-C Electronically Marked
CC2-3 C C USB 2.0 ≤ 4 m 3 A Supported Optional
CC2-5 5 A Required
CC3G1-3 C C USB 3.2 Gen1 and USB4 Gen2 ≤ 2 m 3 A Supported Required
CC3G1-5 5 A
CC3G2-3 C C USB 3.2 Gen2 and USB4 Gen2 ≤ 1 m 3 A Supported Required
CC3G2-5 5 A
CC3G3-3 C C USB4 Gen3 ≤ 0.8 m 3 A Supported Required
CC3G3-5 5 A

Listed, with new cables in bold: 1. USB 2.0 rated at 3A 2. USB 2.0 rated at 5A 3. USB 3.2 Gen 1 rated at 3A 4. USB 3.2 Gen 1 rated at 5A 5. USB 3.2 Gen 2 rated at 3A 6. USB 3.2 Gen 2 rated at 5A 7. USB4 Gen 3 rated at 3A 8. USB4 Gen 3 rated at 5A

New cables 7 and 8 have the same number of wires as cables 3 through 6, but are built to tolerances such that they can sustain 20Gbps per set of differential pairs, or 40Gbps for the whole cable. This is the maximum data rate in the USB4 Spec.

Also, please notice in the table above that (informative) maximum cable length shrinks as speed increases. Gen 1 cables can be 2M long, while Gen 3 cables can be 0.8m. This is just a practical consequence of physics and signal integrity when it comes to passive cables.

Data Rates

Data rates require some explanation too, as advancements since USB 3.1 means that the same physical cable is capable of way more when used in a USB4 system.

A USB 3.1 Gen 1 cable built and sold in 2015 would have been advertised to support 5Gbps operation in 2015. Fast forward to 2019 or 2020, that exact same physical cable (Gen 1), will actually allow you to hit 20gbps using USB4. This is due to advancements in the underlying phy on the host and client-side, but also because USB4 uses all 8 SuperSpeed wires simultaneously, while USB 3.1 only used 4 (single lane operation versus dual-lane operation).

The same goes for USB 3.1 Gen 2 cables, which would have been sold as 10gbps cables. They are able to support 20gbps operation in USB4, again, because of dual-lane.

Table 5-1 Certified Cables Where USB4-compatible Operation is Expected

Cable Signaling USB4 Operation Notes
USB Type-C Full-Featured Cables (Passive) USB 3.2 Gen1 20 Gbps This cable will indicate support for USB 3.2 Gen1 (001b) in the USB Signaling field of its Passive Cable VDO response. Note: even though this cable isn’t explicitly tested, certified or logo’ed for USB 3.2 Gen2 operation, USB4 Gen2 operation will generally work.
USB 3.2 Gen2 (USB4 Gen2) 20 Gbps This cable will indicate support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
USB4 Gen3 40 Gbps This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Passive Cable VDO response.
Thunderbolt™ 3 Cables (Passive) TBT3 Gen2 20 Gbps This cable will indicate support for USB 3.2 Gen1 (001b) or USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
TBT3 Gen3 40 Gbps In addition to indicating support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response, this cable will indicate that it supports TBT3 Gen3 in the Discover Mode VDO response.
USB Type-C Full-Featured Cables (Active) USB4 Gen2 20 Gbps This cable will indicate support for USB4 Gen2 (010b) in the USB Signaling field of its Active Cable VDO response.
USB4 Gen3 40 Gbps This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Active Cable VDO response.

What about Thunderbolt 3 cables? Thunderbolt 3 cables physically look the same as a USB-C to USB-C cable and the passive variants of the cables comply with the existing USB-C spec and are to be regarded as USB-C cables of kinds 3 through 6. In addition to being compliant USB-C cables, Intel needed a way to mark some of their cables as 40Gbps capable, years before USB-IF defined the Gen 3 40gbps data rate level. They did so using extra alternate mode data objects in the Thunderbolt 3 cables' electronic marker, amounting to extra registers that mark the cable as high speed capable.

The good news is that since Intel decided to open up the Thunderbolt 3 spec, the USB-IF was able to completely take in and make Passive 20Gbps and 40Gbps Thunderbolt 3 cables supported by USB4 devices. A passive 40Gbps TBT3 cable you bought in 2016 or 2017 will just work at 40Gbps on a USB4 device in 2020.

How Linux USB PD and USB4 systems can help identify cables for users

By now, you are likely ever so confused by this mess of cable and data rate possibilities. The fact that I need a matrix and a decoder ring to explain the landscape of USB-C cables is a bad sign.

In the real world, your average user will pick a cable and will simply not be able to determine the capabilities of the cable by looking at it. Even if the cable has the appropriate logo to distinguish them, not every user will understand what the hieroglyphs mean.

Software, however, and Power Delivery may very well help with this. I've been looking very closely at the kernel's USB Type-C Connector Class.

The connector class creates the following structure in sysfs, populating these nodes with important properties queried from the cable, the USB-C port, and the port's partner:

/sys/class/typec/
/sys/class/typec/port0 <---------------------------Me
/sys/class/typec/port0/port0-partner/ <------------My Partner
/sys/class/typec/port0/port0-cable/ <--------------Our Cable
/sys/class/typec/port0/port0-cable/port0-plug0 <---Cable SOP'
/sys/class/typec/port0/port0-cable/port0-plug1 <---Cable SOP"

You may see where I'm going from here. Once user space is able to see what the cable and its e-marker chip has advertised, an App or Settings panel in the OS could tell the user what the cable is, and hopefully in clear language what the cable can do, even if the cable is unlabeled, or the user doesn't understand the obscure logos.

Lots of work remains here. The present Type-C Connector class needs to be synced with the latest version of the USB-C and PD spec, but this gives me hope that users will have a tool (any USB-C phone with PD) in their pocket to quickly identify cables.

 
Read more...

from metan's blog

What is wrong with sleep() then?

First of all this is something I had to fight off a lot and still have to from time to time. In most of the cases sleep() has been misused to avoid a need for a proper synchronization, which is wrong for at least two reasons.

The first is that it may and will introduce very rare test failures, that means somebody has to spend time looking into these, which is a wasted effort. Also I'm pretty sure that nobody likes tests that will fail rarely for no good reason. Even more so you cannot run such tests with a background load to ensure that everything works correctly on a bussy system, because that will increase the likehood of a failure.

The second is that this wastes resources and slowns down a test run. If you think that adding a sleep to a test is not a big deal, let me put things into a perspective. There is about 1600 syscall tests in Linux Test Project (LTP), if 7.5% of them would sleep just for one second, we would end up with two minutes of wasted time per testrun. In practice most of the test I've seen waited for much longer just to be sure that things will works even on slower hardware. With sleeps between 2 and 5 seconds that puts us somewhere between 4 and 10 minutes which is between 13% and 33% of the syscall runtime on my dated thinkpad, where the run finishes in a bit less than half an hour. It's even worse on newer hardware, because this slowdown will not change no matter how fast your machine is, which is maybe the reason why this was acceptable twenty years ago but it's not now.

When sleep() is acceptable then?

So far in my ten years of test development I met only a few cases where sleep() in a test code was appropriate. From the top of my head I remeber:

  • Filesystem tests for file timestamps, atime, mtime, etc.
  • Timer related tests where we sample timer in a loop
  • alarm() and timer_create() test where we wait for the timer to fire
  • Leap second tests

How to fix the problem?

Unfortunately there is no silver bullet since there are plenty of reasons for a race condition to happen and each class has to be dealt with differently.

Fortunately there are quite a few very common classes that could be dealt with quite easily. So in LTP we wrote a few synchronization primitives and helper functions that could be used by a test, so there is no longer any excuse to use sleep() instead.

The most common case was a need to synchronize between parent and child processes. There are actually two different cases that needed to be solved. First is a case where child has to execute certain piece of code before parent can continue. For that LTP library implements checkpoints with simple wait and wake functions based on futexes on a piece of shared memory set up by the test library. The second case is where child has to sleep in a syscall before parent can continue, for which we have a helper that polls /proc/$PID/stat. Also sometimes tests can be fixed just be adding a waitpid() in the parent which ensures that child is finished before parent runs.

There are other and even more complex cases where particular action is done asynchronously, or a kernel resource deallocation is deffered to a later time. In such cases quite often the best we can do is to poll. In LTP we ended up with a macro that polls by calling a piece of code in a loop with exponentially increasing sleeps between retries. Which means that instead of sleeping for a maximal time event can possibly take the sleep is capped by twice of the optimal sleeping time while we avoid polling too aggressively.

 
Read more...

from tglx

E-Mail interaction with the community

You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.

Private mail

Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:

  • Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.

  • Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.

  • Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.

  • Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.

Reply to all

Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.

HTML e-mail

HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.

Multipart e-mail

Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

See also: http://daringfireball.net/2007/07/on_top

Trim replies

If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It's an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.

Quoting code

If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.

If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.

 
Read more...

from tglx

E-Mail interaction with the community

You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.

Private mail

Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:

  • Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.

  • Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.

  • Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.

  • Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.

Reply to all

Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.

HTML e-mail

HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.

Multipart e-mail

Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

See also: http://daringfireball.net/2007/07/on_top

Trim replies

If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It's an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.

Quoting code

If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.

If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.

 
Read more...

from metan's blog

We are trying to resurrect the automated testing track on FOSDEM along with Anders and Matt from Linaro. Now that will not happen and even does not make sense to happen without you. I can talk and drink beer in the evening with Matt and Anders just fine, we do not need a devroom for that.

We also have a few people prepared to give a talks on various topics but not nearly enough to fill in and justify a room for a day. So if you are interested in attending or even better giving a talk or driving a discussion please let us know.

 
Read more...

from joelfernandes

Note: At the time of this writing, it is kernel v5.3 release. RCU moves fast and can change in the future, so some details in this article may be obsolete.

The RCU subsystem and the task scheduler are inter-dependent. They both depend on each other to function correctly. The scheduler has many data structures that are protected by RCU. And, RCU may need to wake up threads to perform things like completing grace periods and callback execution. One such case where RCU does a wake up and enters the scheduler is rcu_read_unlock_special().

Recently Paul McKenney consolidated RCU flavors. What does this mean?

Consider the following code executing in CPU 0:

preempt_disable();
rcu_read_lock();
rcu_read_unlock();
preempt_enable();

And, consider the following code executing in CPU 1:

a = 1;
synchronize_rcu();  // Assume synchronize_rcu
                    // executes after CPU0's rcu_read_lock
b = 2;

CPU 0's execution path shows 2 flavors of RCU readers, one nested into another. The preempt_{disable,enable} pair is an RCU-sched flavor RCU reader section, while the rcu_read_{lock,unlock} pair is an RCU-preempt flavor RCU reader section.

In older kernels (before v4.20), CPU 1's synchronize_rcu() could return after CPU 0's rcu_read_unlock() but before CPU 0's preempt_enable(). This is because synchronize_rcu() only needs to wait for the “RCU-preempt” flavor of the RCU grace period to end.

In newer kernels (v4.20 and above), the RCU-preempt and RCU-sched flavors have been consolidated. This means CPU 1's synchronize_rcu() is guaranteed to wait for both of CPU 1's rcu_read_unlock() and preempt_enable() to complete.

Now, lets get a bit more detailed. That rcu_read_unlock() most likely does very little. However, there are cases where it needs to do more, by calling rcu_read_unlock_special(). One such case is if the reader section was preempted. A few more cases are:

  • The RCU reader is blocking an expedited grace period, so it needed to report a quiescent state quickly.
  • The RCU reader is blocking a grace period for too long (~100 jiffies on my system, that's the default but can be set with rcutree.jiffies_till_sched_qs parameter).

In all these cases, the rcu_read_unlock() needs to do more work. However, care must be taken when calling rcu_read_unlock() from the scheduler, that's why this article on scheduler deadlocks.

One of the reasons rcu_read_unlock_special() needs to call into the scheduler is priority de-boosting: A task getting preempted in the middle of an RCU read-side critical section results in blocking the completion of the critical section and hence could prevent current and future grace periods from ending. So the priority of the RCU reader may need to be boosted so that it gets enough CPU time to make progress, and have the grace period end soon. But it also needs to be de-boosted after the reader section completes. This de-boosting happens by calling of the rcu_read_unlock_special() function in the outer most rcu_read_unlock().

What could go wrong with the scheduler using RCU? Let us see this in action. Consider the following piece of code executed in the scheduler:

  reader()
	{
		rcu_read_lock();
		do_something();     // Preemption happened
                /* Preempted task got boosted */
		task_rq_lock();     // Disables interrupts
                rcu_read_unlock();  // Need to de-boost
		task_rq_unlock();   // Re-enables interrupts
	}

Assume that the rcu_read_unlock() needs to de-boost the task's priority. This may cause it to enter the scheduler and cause a deadlock due to recursive locking of RQ/PI locks.

Because of these kind of issues, there has traditionally been a rule that RCU usage in the scheduler must follow:

“Thou shall not hold RQ/PI locks across an rcu_read_unlock() if thou not holding it or disabling IRQ across both both the rcu_read_lock() + rcu_read_unlock().”

More on this rule can be read here as well.

Obviously, acquiring RQ/PI locks across the whole rcu_read_lock() and rcu_read_unlock() pair would resolve the above situation. Since preemption and interrupts are disabled across the whole rcu_read_lock() and rcu_read_unlock() pair; there is no question of task preemption.

Anyway, the point is rcu_read_unlock() needs to be careful about scheduler wake-ups; either by avoiding calls to rcu_read_unlock_special() altogether (as is the case if interrupts are disabled across the entire RCU reader), or by detecting situations where a wake up is unsafe. Peter Ziljstra says there's no way to know when the scheduler uses RCU, so “generic” detection of the unsafe condition is a bit tricky.

Now with RCU consolidation, the above situation actually improves. Even if the scheduler RQ/PI locks are not held across the whole read-side critical sectoin, but just across that of the rcu_read_unlock(), then that itself may be enough to prevent a scheduler deadlock. The reasoning is: during the rcu_read_unlock(), we cannot yet report a QS until the RQ/PI lock is itself released since the act of holding the lock itself means preemption is disabled and that would cause a QS deferral. As a result, the act of priority de-boosting would also be deferred and prevent a possible scheduler deadlock.

However, RCU consolidation introduces even newer scenarios where the rcu_read_unlock() has to enter the scheduler, if the “scheduler rules” above is not honored, as explained below:

Consider the previous code example. Now also assume that the RCU reader is blocking an expedited RCU grace period. That is just a fancy term for a grace period that needs to end fast. These grace periods have to complete much more quickly than normal grace period. An expedited grace period causes currently running RCU reader sections to receive IPIs that set a hint. Setting of this hint results in the outermost rcu_read_unlock() calling rcu_read_unlock_special(), which otherwise would not occur. When rcu_read_unlock_special() gets called in this scenario, it tries to get more aggressive once it notices that the reader has blocked an expedited RCU grace period. In particular, it notices that preemption is disabled and so the grace period cannot end due to RCU consolidation. Out of desperation, it raises a softirq (raise_softirq()) in the hope that the next time the softirq runs, the grace period could be ended quickly before the scheduler tick occurs. But that can cause a scheduler deadlock by way of entry into the scheduler due to a ksoftirqd-wakeup.

The cure for this problem is the same, holding the RQ/PI locks across the entire reader section results in no question of a scheduler related deadlock due to recursively acquiring of these locks; because there would be no question of expedited-grace-period IPIs, hence no question of setting of any hints, and hence no question of calling rcu_read_unlock_special() from scheduler code. For a twist of the IPI problem, see special note.

However, the RCU consolidation throws yet another curve ball. Paul McKenney explained on LKML that there is yet another situation now due to RCU consolidation that can cause scheduler deadlocks.

Consider the following code, where previous_reader() and current_reader() execute in quick succession in the context of the same task:

       previous_reader()
	{
		rcu_read_lock();
		do_something();      // Preemption or IPI happened
		local_irq_disable(); // Cannot be the scheduler
		do_something_else();
		rcu_read_unlock();  // As IRQs are off, defer QS report
                                    //but set deferred_qs bit in 
                                    //rcu_read_unlock_special
		do_some_other_thing();
		local_irq_enable();
	}

        // QS from previous_reader() is still deferred.
	current_reader() 
	{
		local_irq_disable();  // Might be the scheduler.
		do_whatever();
		rcu_read_lock();
		do_whatever_else();
		rcu_read_unlock();    // Must still defer reporting QS
		do_whatever_comes_to_mind();
		local_irq_enable();
	}

Here previous_reader() had a preemption; even though the current_reader() did not – but the current_reader() still needs to call rcu_read_unlock_special() from the scheduler! This situation would not happen in the pre-consolidated-RCU world because previous_reader()'s rcu_read_unlock() would have taken care of it.

As you can see, just following the scheduler rule of disabling interrupts across the entire reader section does not help. To detect the above scenario; a new bitfield deferred_qs has been added to the task_struct::rcu_read_unlock_special union. Now what happens is, at rcu_read_unlock()-time, the previous reader() sets this bit, and the current_reader() checks this bit. If set, the call to raise_softirq() is avoided thus eliminating the possibility of a scheduler deadlock.

Hopefully no other scheduler deadlock issue is lurking!

Coming back to the scheduler rule, I have been running overnight rcutorture tests to detect if this rule is ever violated. Here is the test patch checking for the unsafe condition. So far I have not seen this condition occur which is a good sign.

I may need to check with Paul McKenney about whether proposing this checking for mainline is worth it. Thankfully, LPC 2019 is right around the corner! ;–)


Special Note

[1] The expedited IPI interrupting an RCU reader has a variation. For an example see below where the IPI was not received, but we still have a problem because the ->need_qs bit in the rcu_read_unlock_special union got set even though the expedited grace period started after IRQs were disabled. The start of the expedited grace period would set the rnp->expmask bit for the CPU. In the unlock path, because the ->need_qs bit is set, it will call rcu_read_unlock_special() and risk a deadlock by way of a ksoftirqd wakeup because exp in that function is true.

CPU 0                         CPU 1
preempt_disable();
rcu_read_lock();

// do something real long

// Scheduler-tick sets
// ->need_qs as reader is
// held for too long.

local_irq_disable();
                              // Expedited GP started
// Exp IPI not received
// because IRQs are off.

local_irq_enable();

// Here rcu_read_unlock will
// still call ..._special()
// as ->need_qs got set.
rcu_read_unlock();

preempt_enable();

The fix for this issue is the same as described earlier, disabling interrupts across both rcu_read_lock() and rcu_read_unlock() in the scheduler path.

 
Read more...

from metan's blog

The problem

More than ten years ago even consumer grade hardware started to have two and more CPU cores. These days you can easily buy PC with eight cores even in the local computer shops. The hardware has envolved quite fast during the last decade, but the software that does kernel testing didn't. These days we still run LTP testcases sequentially even on beefy servers with 64+ cores and terabytes of RAM. That means that syscalls testrun takes 30+ minutes while it could be reduced to less than 10 minutes, which will significantly shorten the feedback loop for any continous integration.

The naive solution

The naive solution is obviously to run $NCPU+1 tests in parallel. But there is a catch, some of the tests utilize global system resources or state and running two such tests in parallel would lead to false negatives. If you think that this situation is rare you are mistaken, there are plenty of tests that needs a block device, change system wall clock, sample kernel timers in a loop, play with SysV IPC, networking, etc. In short we will end up with many, hard to reproduce, race conditions and mostly useless results.

The proper solution

The proper solution would require:

  1. Make tests declare which resources they utilize
  2. Export this data to the testrunner
  3. Make the testrunner use that information when running tests

Happily the first part is mostly done for LTP, we have switched to new test library I've written a few years ago. The new library defines a “driver model” for a testcase. At this time about half of the tests are converted to the new library and the progress is steady.

In the new library the resources a test needs are listed in a declarative way as a C structure, so for most cases we already have this information in place. If you want to know more you can have a look at our documentation.

At this point I'm trying to solve the second part, which is making the data available to the testrunner, which mostly works at this point. Once this part is finished the missing piece would be writing a scheduller that takes this information into an account.

Unfortunately the way LTP runs test is also dated, there is an old runltp script which I would call an ugly hack at best. Adding feature such as parallel test run to that code is close to impossible. Which is the reason I've started to work on a new LTP testrunner, which I guess could be a subject for a second blog post.

 
Read more...