people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from David Ahern

When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.

This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.

Packet Processing in Linux

Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, 'top' can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload.

net_rx_action, the handler for network Rx traffic, usually runs really fast – like under 25 usecs[1] – dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see __do_softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).

One way to see the above description is using perf:

sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit \
        -e irq:softirq_entry --filter="vec == 3" \
        -e irq:softirq_exit --filter="vec == 3"  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script

The output is something like:

swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0

In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.

The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?

With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000's to 10,000's of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.

Let's take a look at one way to see this impact.

Undo the Distributed Processing

First, let's undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.

RPS is disabled on the 2 nics using

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

Next, add flow rules to push packets for the VM under test to a single CPU

DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2

Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.

openssl speed

I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as 'openssl speed' and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting).

'openssl speed' is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344

so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tms_utime will match the test time (10 seconds in this case).

Since it is is a pure userspace benchmark ANY system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896

shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead – time stolen from this process for “system needs.”

Using openssl to Infer Networking Overhead

With an understanding of how 'openssl speed' works, let's look at a near idle server:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc's in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc's in 10.00s

so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let's add network load, netperf TCP_STREAM from 2 sources, and re-do the test:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc's in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc's in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc's in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc's in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc's in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc's in 1.89s

Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets – either in the context of openssl or via ksoftirqd.

Looking at top for the previous openssl run:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5

one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.

If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5

but openssl reports ~4 seconds of userspace time:

Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc's in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc's in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc's in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc's in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc's in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc's in 4.16s

Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.

In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything – virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.

Up Next

This post is the basis for a follow up post where I will discuss a comparison of the overhead of “full stack” on a VM host to “XDP”.

[1] Measured using ebpf program on entry and exit. See net_rx_action in https://github.com/dsahern/bpf-progs

[2] Assuming no bugs in the networking stack and driver. I have examined systems where net_rx_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).

[3] https://github.com/isc-projects/ethq

 
Read more...

from joelfernandes

GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD's slab implementation) is here: https://www.youtube.com/watch?v=ZXUIFj4nRjk

The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.

Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can't see the photon yet, we can only see what we've been shown so far, and in a way, if we've not seen something because enough “time” has not passed, then we may not reclaim it yet. If we've not seen something, we will see it at some point in the future. Till then we need to sit tight.

Roughly, an implementation has 2+N counters (with N CPUs): 1. Global write sequence. 2. Global read sequence. 3. Per-cpu read sequence (read from #1 when a reader starts)

On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free'ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs. Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.

Note the scalability drawbacks of this reclaim scheme:

  1. Expensive poll operation if you have 1000s of CPUs. (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)

  2. Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.

  3. There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.

One advantage of the scheme could be implementation simplicity.

RCU (not SRCU or Userspace RCU) doesn't suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.

 
Read more...

from David Ahern

Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.

This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:

  • Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS

  • Install docker-ce

  • Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:

    ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
    --containerd=/run/containerd/containerd.sock
    
  • Tell systemd about the change and restart docker

    systemctl daemon-reload
    systemctl restart docker
    

With that, docker pull should work fine – in mgmt vrf or default vrf.

 
Read more...

from David Ahern

Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).

The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:

$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic

despite VRF being enabled and built:

$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m

which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: $ sudo apt-get install --install-recommends linux-generic-hwe-18.04

although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.

First, let's enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:

    echo "net.ipv4.tcp_l3mdev_accept=1" >> /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf

Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]

Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018: https://bugs.launchpad.net/netplan/+bug/1773522

One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:

#!/bin/bash

ip link show dev mgmt 2>/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default

The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]

Reboot node to make the changes take effect.

WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.

After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).

Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use 'ip vrf exec' as a helper to do the VRF binding. 'ip vrf exec' uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device ('mgmt' in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).

Let's see what happens:

$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'mirrors.digitalocean.com'

Theoretically, this should Just Work, but it clearly does not. Why?

Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0

So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.

Where does the theory breakdown? In 3 places.

First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:

$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache

That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let's add the loopback addresses to it:

    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache

The second problem is systemd-resolved binds its socket to the loopback device:

    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:(("systemd-resolve",pid=803,fd=13))

The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.

The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.

Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.

$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf > /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]

When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding: $ ip vrf exec mgmt su – dsahern

Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].

Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:

NS=$(ip netns identify)
[ -n "$NS" ] && NS=":${NS}"

VRF=$(ip vrf identify)
[ -n "$VRF" ] && VRF=":${VRF}"

And then adding '${NS}${VRF}' after the host in PS1:

PS1='${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ '

For example, now the prompt becomes: dsahern@myhost:mgmt:~$

References:

[1] VRF tutorial, Open Source Summit, North America, Sept 2017 http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf

[2] VRF helpers, e.g., systemd instances for VRF https://github.com/CumulusNetworks/vrf

[3] Example using VRF in dhclient-exit-hook https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf

[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.

 
Read more...

from joelfernandes

The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

There's are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part's lock counter during locking and likewise for unlock.

During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with 'A MUST' in the below pseudo-code:

        reader 1        writer                        reader 2
        -------------------------------------------------------
        // read_lock
        // enter
        Read: idx = 0;
        <long delay>    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                        done.
                                                      Read: idx = 1;
        lock[0]++;
                                                      lock[1]++;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] <- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] <- A MUST!

NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its 'srcu_idx' equivalent.

Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic? Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the 'flip' logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a 'flipped' part would only monotonically decrease since new readers go to its counterpart.

 
Read more...

from paulmck

It is quite easy to make your email agent (mutt in my case) send directly to mail.kernel.org, but this can result in inconvenient delays when you have a low-bandwidth Internet connection, and, worse yet, abject failure when you have no Internet connection at all. (Yes, I was born before the turn of the millennium. Why do you ask?)

One way to avoid these problems is to set up an email server on your laptop. This email server will queue your email when Internet is slow or completely inaccessible, and will automatically transmit your email as appropriate. There are several email servers to choose from, but I chose postfix.

First, you need to install postfix, for example, sudo apt install postfix. I have generally selected the satellite option when it asks, but there seems to be a number of different opinions expressed at different web sites.

You need to tell postfix to talk to mail.kernel.org in your /etc/postfix/main.cf file:

relayhost = [mail.kernel.org]:587

I followed linode's advice on setting up encryption, passwords, and so on, by tacking the following onto the end of my /etc/postfix/main.cf file:

# enable SASL authentication smtp_sasl_auth_enable = yes # disallow methods that allow anonymous authentication. smtp_sasl_security_options = noanonymous # where to find sasl_passwd smtp_sasl_password_maps = hash:/etc/postfix/sasl_password # Enable STARTTLS encryption smtp_use_tls = yes # where to find CA certificates smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt

But this requires setting up /etc/postfix/sasl_password with a single line:

[mail.kernel.org]:587 paulmck:fake-password

You will need to replace my kernel.org email and password with yours, obviously. Then you need to create the /etc/postfix/sasl_password.db file:

sudo postmap /etc/postfix/sasl_passwd

There was some difference of opinion across Internet as to what the ownership and permisssions of /etc/postfix/sasl_password and /etc/postfix/sasl_password.db should be, with some people arguing for maximum security via user and group both being set to root and permissions set to 0600. In my case, this was indeed secure, so much so that postfix failed to transmit any of my email, making some lame complaint about being unable to read /etc/postfix/sasl_password.db. Despite the compelling security benefits of this approach, I eventually elected to use user and group of postfix and permissions of 0640 for /etc/postfix/sasl_password.db, intentionally giving up a bit of security in favor of email actually being transmitted. :–)

I left /etc/postfix/sasl_passwd with user and group of root and mode of 0600. Somewhat pointlessly, given that its information can be easily extracted from etc/postfix/sasl_password.db by anyone with permission to read that file.

And anytime you change any postfix configuration, you need to tell postfix about it, for example:

sudo postfix reload

You might well need to make other adjustments to your postfix configuration. To that end, I strongly suggest testing your setup by sending a test email or three! ;–)

The mailq command will list queued email along with the reason why it has not yet been transmitted.

The sudo postfix flush command gives postfix a hint that now would be an excellent time for it to attempt to transmit queued email, for example, when you have an Internet connection only for a short time.

 
Read more...

from joelfernandes

The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available — by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0's stores and P1's loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (“message”).

        int buf = 0, flag = 0;

        P0()
        {
                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);
        }

        P1()
        {
                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);
        }

Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney's perfbook for details on that).

EXTENDS Integers, TLC
(*--algorithm mp_pattern
variables
    buf = 0,
    flag = 0;

process Writer = 1
variables
    begin
e0:
       either
e1:        buf := 1;
e2:        flag := 1;
        or
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
variables
    r1 = 0,
    r2 = 0;  
    begin
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)

Sure enough, the assert r1 = 0 \/ r2 = 1; fires when the PlusCal program is run through the TLC model checker.

I do find the either or block clunky, and wish I could just do something like:

non_deterministic {
        buf := 1;
        flag := 1;
}

And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.

Other notes:

  • PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal's syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don't want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don't specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.

  • I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.

 
Read more...

from Benson Leung

tl;dr: There are now 8. Thunderbolt 3 cables officially count too. It's getting hard to manage, but help is on the way.

Edited lightly 09-16-2019: Tables 3-1 and 5-1 from USB Type-C Spec reproduced as tables instead of images. Made an edit to clarify that Thunderbolt 3 passive cables have always been compliant USB-C cables.

If you recall my first cable post, there were 6 kinds of cables with USB-C plugs on both ends. I was also careful to preface that it was true as of USB Type-C™ Specification 1.4 on June 2019.

Last week, the USB-IF officially published the USB Type-C™ Specification Version Revision 2.0, August 29, 2019.

This is a major update to USB-C and contains required amendments to support the new USB4™ Spec.

One of those amendments? Introducing a new data rate, 20Gbps per lane, or 40Gbps total. This is called “USB4 Gen 3” in the new spec. One more data rate means the matrix of cables increases by a row, so we now have 8 C-to-C cable kinds, see Table 3-1:

Table 3-1 USB Type-C Standard Cable Assemblies

Cable Ref Plug 1 Plug 2 USB Version Cable Length Current Rating USB Power Delivery USB Type-C Electronically Marked
CC2-3 C C USB 2.0 ≤ 4 m 3 A Supported Optional
CC2-5 5 A Required
CC3G1-3 C C USB 3.2 Gen1 and USB4 Gen2 ≤ 2 m 3 A Supported Required
CC3G1-5 5 A
CC3G2-3 C C USB 3.2 Gen2 and USB4 Gen2 ≤ 1 m 3 A Supported Required
CC3G2-5 5 A
CC3G3-3 C C USB4 Gen3 ≤ 0.8 m 3 A Supported Required
CC3G3-5 5 A

Listed, with new cables in bold: 1. USB 2.0 rated at 3A 2. USB 2.0 rated at 5A 3. USB 3.2 Gen 1 rated at 3A 4. USB 3.2 Gen 1 rated at 5A 5. USB 3.2 Gen 2 rated at 3A 6. USB 3.2 Gen 2 rated at 5A 7. USB4 Gen 3 rated at 3A 8. USB4 Gen 3 rated at 5A

New cables 7 and 8 have the same number of wires as cables 3 through 6, but are built to tolerances such that they can sustain 20Gbps per set of differential pairs, or 40Gbps for the whole cable. This is the maximum data rate in the USB4 Spec.

Also, please notice in the table above that (informative) maximum cable length shrinks as speed increases. Gen 1 cables can be 2M long, while Gen 3 cables can be 0.8m. This is just a practical consequence of physics and signal integrity when it comes to passive cables.

Data Rates

Data rates require some explanation too, as advancements since USB 3.1 means that the same physical cable is capable of way more when used in a USB4 system.

A USB 3.1 Gen 1 cable built and sold in 2015 would have been advertised to support 5Gbps operation in 2015. Fast forward to 2019 or 2020, that exact same physical cable (Gen 1), will actually allow you to hit 20gbps using USB4. This is due to advancements in the underlying phy on the host and client-side, but also because USB4 uses all 8 SuperSpeed wires simultaneously, while USB 3.1 only used 4 (single lane operation versus dual-lane operation).

The same goes for USB 3.1 Gen 2 cables, which would have been sold as 10gbps cables. They are able to support 20gbps operation in USB4, again, because of dual-lane.

Table 5-1 Certified Cables Where USB4-compatible Operation is Expected

Cable Signaling USB4 Operation Notes
USB Type-C Full-Featured Cables (Passive) USB 3.2 Gen1 20 Gbps This cable will indicate support for USB 3.2 Gen1 (001b) in the USB Signaling field of its Passive Cable VDO response. Note: even though this cable isn’t explicitly tested, certified or logo’ed for USB 3.2 Gen2 operation, USB4 Gen2 operation will generally work.
USB 3.2 Gen2 (USB4 Gen2) 20 Gbps This cable will indicate support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
USB4 Gen3 40 Gbps This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Passive Cable VDO response.
Thunderbolt™ 3 Cables (Passive) TBT3 Gen2 20 Gbps This cable will indicate support for USB 3.2 Gen1 (001b) or USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.
TBT3 Gen3 40 Gbps In addition to indicating support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response, this cable will indicate that it supports TBT3 Gen3 in the Discover Mode VDO response.
USB Type-C Full-Featured Cables (Active) USB4 Gen2 20 Gbps This cable will indicate support for USB4 Gen2 (010b) in the USB Signaling field of its Active Cable VDO response.
USB4 Gen3 40 Gbps This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Active Cable VDO response.

What about Thunderbolt 3 cables? Thunderbolt 3 cables physically look the same as a USB-C to USB-C cable and the passive variants of the cables comply with the existing USB-C spec and are to be regarded as USB-C cables of kinds 3 through 6. In addition to being compliant USB-C cables, Intel needed a way to mark some of their cables as 40Gbps capable, years before USB-IF defined the Gen 3 40gbps data rate level. They did so using extra alternate mode data objects in the Thunderbolt 3 cables' electronic marker, amounting to extra registers that mark the cable as high speed capable.

The good news is that since Intel decided to open up the Thunderbolt 3 spec, the USB-IF was able to completely take in and make Passive 20Gbps and 40Gbps Thunderbolt 3 cables supported by USB4 devices. A passive 40Gbps TBT3 cable you bought in 2016 or 2017 will just work at 40Gbps on a USB4 device in 2020.

How Linux USB PD and USB4 systems can help identify cables for users

By now, you are likely ever so confused by this mess of cable and data rate possibilities. The fact that I need a matrix and a decoder ring to explain the landscape of USB-C cables is a bad sign.

In the real world, your average user will pick a cable and will simply not be able to determine the capabilities of the cable by looking at it. Even if the cable has the appropriate logo to distinguish them, not every user will understand what the hieroglyphs mean.

Software, however, and Power Delivery may very well help with this. I've been looking very closely at the kernel's USB Type-C Connector Class.

The connector class creates the following structure in sysfs, populating these nodes with important properties queried from the cable, the USB-C port, and the port's partner:

/sys/class/typec/
/sys/class/typec/port0 <---------------------------Me
/sys/class/typec/port0/port0-partner/ <------------My Partner
/sys/class/typec/port0/port0-cable/ <--------------Our Cable
/sys/class/typec/port0/port0-cable/port0-plug0 <---Cable SOP'
/sys/class/typec/port0/port0-cable/port0-plug1 <---Cable SOP"

You may see where I'm going from here. Once user space is able to see what the cable and its e-marker chip has advertised, an App or Settings panel in the OS could tell the user what the cable is, and hopefully in clear language what the cable can do, even if the cable is unlabeled, or the user doesn't understand the obscure logos.

Lots of work remains here. The present Type-C Connector class needs to be synced with the latest version of the USB-C and PD spec, but this gives me hope that users will have a tool (any USB-C phone with PD) in their pocket to quickly identify cables.

 
Read more...

from tglx

E-Mail interaction with the community

You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.

Private mail

Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:

  • Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.

  • Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.

  • Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.

  • Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.

Reply to all

Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.

HTML e-mail

HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.

Multipart e-mail

Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

See also: http://daringfireball.net/2007/07/on_top

Trim replies

If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It's an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.

Quoting code

If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.

If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.

 
Read more...

from tglx

E-Mail interaction with the community

You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.

Private mail

Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:

  • Private e-mail does not scale Maintainers and developers have limited time and cannot answer the same questions over and over.

  • Private e-mail is limiting the audience Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again. Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list then you already gave that consent by doing so. It's usually not required to subscribe to a mailing list. Most mailing lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.

  • Private e-mail might be considered deliberate disregard of documentation The documentation of the Linux kernel and other Open Source projects gives clear advice how to contact the community. It's clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.

  • Corporate policies are not an excuse for private e-mail If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.

Confidentiality disclaimers

When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.

If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.

Reply to all

Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.

HTML e-mail

HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.

Multipart e-mail

Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.

Text mail formatting

Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.

Top-posting

If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

A: Top-posting. Q: What is the most annoying thing in e-mail?

A: No. Q: Should I include quotations after my reply?

See also: http://daringfireball.net/2007/07/on_top

Trim replies

If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It's an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.

Quoting code

If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.

If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.

Does not work for you?

In case you can't follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional help to solve your problem.

Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.

 
Read more...

from mcgrof

I'm announcing the release of kdevops which aims at making setting up and testing the Linux kernel for any project as easy as possible. Note that setting up testing for a subsystem and testing a subsystem are two separate operations, however we strive for both. This is not a new test framework, it allows you to use existing frameworks, and set those frameworks up as easily can humanly be possible. It relies on a series of modern hip devops frameworks, it relies on ansible, vagrant and terraform, ansible roles through the Ansible Galaxy, and terraform modules.

Three example demo projects are released which demo it's use:

  • kdevops – skeleton generic example using linux-stable
  • fw-kdevops – used for testing firmware loading using linux-next. This example demo was written in about one hour tops by forking kdevops, trimming it, adding a new ansible galaxy for selftests. You are expected to be able to fork it and add your respective kernel selftest fork in a minute
  • oscheck – actively being used to test and advance the XFS filesystem for stable kernel releases. If you fork this to try to add support for testing a new filesystem under a new project, please let me know how long it took you to do that.

Fancy pictures in a nutshell

Of course you just want pictures and the ability to go home after seeing them. Should these be on instagram as well? Gosh.

A first run of kdevops

On a first run:

Running the bootlinux role on just one host

Example run of just running the ansible bootlinux role on just one host:

End of running the bootlinx ansible role on just one host

This shows what it looks like at the end of running the ansible bootlinux role after the host has booted into the new shiny kernel:

Logging into test test systems

Well, since we set up your ~/ssh/.config for you, all you gotta do now is just ssh in to the target host you want to test, it will already have the shiny new kernel installed and booted into it:

Motivations for kdevops

Below I'll document just a bit of the motivation behind this project. The documentation and demo projects should hopefully suffice for how to use all this.

Testing ain't easy, brah!

Getting contributors to your subsystem / driver in Linux is wonderful, however ensuring it doesn't break anything is a completely separate matter. It is my belief that testing a patch to ensure no testable regressions exist should be painless, and simple, however that has never been the case.

Testing frameworks ain't easy to setup, brah!

Linux kernel testing frameworks should also be really easy to set up. But that is typically never the case either. One example case of complexity in setting a test framework is fstests used to tests Linux kernel filesystems, and to ensure to the best of our ability that a new patch doesn't regress the kernel against a baseline. But wait, what is the baseline?

Setting up test systems ain't easy to ramp up, brah!

Another difficulty with testing the Linux kernel comes with the fact that you don't want to test the kernel on same kernel you're laptop is running on, otherwise you'd crash it, and if you're testing filesystems you may even end up corrupting your filesystem. So typically folks end up using virtualization technologies to setup virtual machines, boot into them, and then use the virtualized hosts as test vehicles. Another alternative is to use cloud service providers such as OpenStack, Azure, Amazon Web Services, Google Cloud Compute to create hosts on the cloud and use these instead. But I've heard complaints about how even setting up KVM can be complex, even from kernel developers! Even some kernel developers don't want to know how to set up a virtual environment to test things.

I hear ya, brah!

My litmus test for a full set up complexity is all the work required to setup fstests to test a Linux filesystem. If a solution for all the woes above were to ever be provided, I figured it'd have to allow to you easily setup fstests to test XFS without you doing much work.

I started looking into this effort first by trying to provide my own set of wrappers around KVM to let you easily setup KVM. Then I extended this effort to easily setup fstests. Both efforts were all shell hacks... It worked for me, but I was still not really happy with it all. It seemed hacky.

Ted Ts'o's xfstests-bld.git provided a cloud environment solution for using setting up fstests on Google Cloud Compute for ext filesystemes (ext2, ext3, ext4), however I was not satisfied with this given I wanted it easy to allow you to test any filesystem, and be Cloud provider agnostic.

ansible provides a proper replacement for shell hacks, in a distribution agnostic manner, and even OS agnostic manner. Vagrant lets me replace all those terrible original bash hacks to setup KVM with simple elegant descriptions of what I want a set of target set of hosts to look like. It also lets me support not only KVM but also Virtualbox, and even support Mac OS X. Terraform accomplishes the same but for cloud environments, and supports different providers.

Feedback and rants welcomed

So, give the repositories a shot, I welcome feedback and rants.

kdevops is intended to be used as the de-facto example for all of the ansible roles, and terraform modules.

fw-kdevops is intended to be forked by folks wanting a simple two host test setup where all you need is linux-next and to run selftests.

oscheck is already actively used to help advance XFS on the stable kernel releases, and is intended to be forked by folks who want to use fstests to test any filesystem on any kernel release.

 
Read more...

from Greg Kroah-Hartman

As I had this asked to me 3 times today (once in irc, and twice in email), no, the 5.3 kernel release is NOT the next planned Long Term Supported (LTS) release.

I've been saying for a few years now that I would pick the “last released” kernel of the year to be the next LTS release. And as per the wonderful pointy-hair-crystal-ball, that looks to be the 5.4 kernel release this year.

So, count on it being 5.4, unless something really bad happens in that release, such as people throwing in loads of crud because they “need” it for the LTS release. If that happens again, I'll just have to pick a different release...

 
Read more...

from paulmck

Back in the day, CPU hotplug was an optional and rarely used feature of Linux kernels built for SMP systems. However, increasing numbers of popular CPU families prohibit building Linux kernels with both CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y. So much so that I no longer have access to a system that supports mainline kernels built with this combination of Kconfig options. This means that I have been putting up with rcutorture output containing :CONFIG_HOTPLUG_CPU: improperly set errors, which is of course annoying, and will soon motivate me to rework the rcutorture test scenarios so as to eliminate this false positive. A side effect of this expected change is that rcutorture will no longer even try to test SMP kernels built without CPU hotplug.

Why prohibit this combination of Kconfig options? It turns out that some of the defenses against hardware side-channel attacks rely on CPU hotplug, so the CPU families vulnerable to such attacks are ensuring that CPU hotplug is available on all systems having at least two CPUs. Security is great, thus so far so good!

Unfortunately, this is a problem for you if you use a CPU family that permits CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y and you actually need to build kernels this way. Please keep in mind that there really have been RCU bugs that only manifest in such kernels. Given that I no longer have the means to test for these bugs, Murphy asserts that such bugs will quietly but fatally accumulate.

So what are your options? Here are a few:

  1. Rework your CPU family's Kconfig options to force CONFIG_CPU_HOTPLUG=y in any kernel build that enables CONFIG_SMP=y.
  2. Start running rcutorture on your hardware frequently enough to catch any bugs that might find their way into CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y configurations. I am happy to help you get started with this, which will require updating the rcutorture scripting to support your CPU family, defining additional rcutorture scenarios, and having someone (not me!) do the actual rcutorture runs on your hardware on a regular basis.
  3. Just ignore the problem in the fond hope that no such bugs will ever appear.

I believe that the first option is best, given that RCU is not the only Linux-kernel subsystem vulnerable to bugs specific to kernels built with CONFIG_CPU_HOTPLUG=n and CONFIG_SMP=y. But at the end of the day, it is your CPU family, so it is your decision!

 
Read more...

from paulmck

Recently, a formal-verification researcher gave the expected verdict on software: “It is surprising that any software works!” I could not resist replying “You should be surprised that you work.” The researcher muttered something about evolution.

And in fact it has been observed that “Linux is evolution, not intelligent design”. From this viewpoint, review, testing, and other validation processes provide the Darwinian fitness function, eliminating unfit mutations from the “gene pool”. Which is good reason to continue working to improve these validation processes.

But there is another implication of this viewpoint. The patches that we developers sweat so hard over are nothing more or less than the software counterpart of random mutations.

But don't take my word for it. Ask your compiler!

 
Read more...