<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>people.kernel.org Reader</title>
    <link>https://people.kernel.org</link>
    <description>Read the latest posts from people.kernel.org.</description>
    <pubDate>Fri, 10 Apr 2026 15:34:01 +0000</pubDate>
    <item>
      <title>Desktop GPIO with MCP2221A</title>
      <link>https://people.kernel.org/linusw/desktop-gpio-with-mcp2221a</link>
      <description>&lt;![CDATA[When I first began constructing the GPIO character device in 2015 I had the vision that it would make it simple to use GPIO expanders on desktop Linux, perhaps for industrial purposes.&#xA;&#xA;So ~10 years later, did this work out as planned? Let&#39;s see.&#xA;&#xA;I got myself a simple GPIO expander using the Microchip MCP2221A USB-to-I2C-and-some-GPIO chip. It is actually using the USB HID device class and just adds GPIO support under that as a sideshow to the I2C and UART functionality it provides. The initial driver for Linux by Rishi Gupta looked promising, and he quickly followed up with support for the 4 GPIO pins on the chip.&#xA;&#xA;I obtained this a while back, but it didn&#39;t &#34;just work&#34;, because like with some RS232-with-GPIO adapters, this thing contains a NVRAM that hard-configures these GPIO lines for other purposes, such as voltage readings. Actually, Matt Ranostay added support to use the ADC/DAC in the device through the IIO subsystem! This is quite a feat.&#xA;&#xA;Toolset&#xA;&#xA;The GPIO commands that I use here, such as &#xA;libgpiod uses the GPIO character devices exclusively, no funny hackery, so in my case it&#39;s these that are actually used to communicate with the GPIO controller:&#xA;&#xA;ls /dev/gpiochip&#xA;/dev/gpiochip0&#xA;/dev/gpiochip1&#xA;/dev/gpiochip2&#xA;&#xA;So using &#xA;gpiodetect &#xA;gpiochip0 [AMDI0030:00] (256 lines)&#xA;gpiochip1 [ftdi-cbus] (4 lines)&#xA;gpiochip2 [mcp2221gpio] (4 lines)&#xA;&#xA;Here we see the computer&#39;s built-in GPIOs that I could probably access by soldering into my desktop computer (brrrr!), then there are the 4 lines on an RS232-converter I am using, but it is molded into plastic and these are not easily accessible, also these are resticted by NVRAM inside the RS232 adapter:&#xA;&#xA;gpioset -c gpiochip1 --toggle 1000 0=1&#xA;gpioset: unable to request lines on chip &#39;/dev/gpiochip1&#39;: Device or resource busy&#xA;So rumor has it that to use the RS232 adapter GPIOs, some special tool is needed to modify the NVRAM.&#xA;&#xA;Accessing the MCP2221&#xA;&#xA;So let&#39;s use the MCP2221A device:&#xA;&#xA;gpioset -c gpiochip2 --toggle 1000 0=1&#xA;gpioset: unable to request lines on chip &#39;/dev/gpiochip2&#39;: Device or resource busy&#xA;Bummer: I was still unable to &#34;just use it for GPIO&#34;, and was meaning to drill into the problem. Last summer Heiko Schocher beat me to it and implemented code to force the 4 GPIO pins into GPIO mode if and only if IIO was not in use, i.e. by definition the ADC/DAC functionality would take precedence.&#xA;&#xA;Maybe this was a bad target. But I wanted something simple on USB that anyone can use, and the alternative would be to go and solder an RS232 adapter. Amazingly, there is no dedicated &#34;give me many GPIOs on USB&#34; adapter that I have found. There is gimme GPIO from Element 14, but it appears as a USB serial port (ttyUSB0) and then you have to talk a serial protocol to handle the GPIOs from userspace, and this is not what I want, I want the GPIOs to appear as a gpiochip on my Linux host, using the native GPIO character device and all. (It would probably be possible to write a Linux driver that uses Element14&#39;s device over serial, I haven&#39;t tried.)&#xA;&#xA;A small puzzle piece was all that was needed for me to use this for GPIO on a common Fedora desktop: a module parameter to enforce the GPIO mode if GPIO is what you want to do for these 4 pins, and you&#39;re not really able to disable IIO (such as for a default desktop kernel...) And voilá, after this:&#xA;&#xA;rmmod hidmcp2221 &#xA;insmod ./hid-mcp2221.ko gpiomodeenforce=1&#xA;gpioset -c gpiochip2 --toggle 1000 0=1&#xA;&#xA;Toggling a LED over GPIO&#xA;&#xA;There it is! This LED now blinks with a period of 2 seconds (toggle between on and off every second). As you realize the parameter to the &#xA;In case you&#39;re wondering about the electronics, the LED is simply connected between the GP0 output and GND* using a 220 Ohm resistor.&#xA;&#xA;Future fixes&#xA;&#xA;I don&#39;t know yet if my patch to the MCP2221A driver will be accepted or not. We shall see. If it is, I probably have to add some configuration file in &#xA;options hidmcp2221 gpiomodeenforce=1&#xA;&#xA;Can this type of GPIO be used for industrial purposes? Pretty much yes. The libgpiod library and it&#39;s applications can be set to realtime deadline scheduling class, and this should (but please test!) ascertain very strict behaviour for such applications.&#xA;&#xA;Unless you want to script things with the libgpiod utilities, libgpiod has C++, Python and Rust bindings, and it has glib bindings so doing GUI desktop applications using GPIO should be a walk in the park as well. So go and have fun.]]&gt;</description>
      <content:encoded><![CDATA[<p>When I first <a href="https://lore.kernel.org/linux-gpio/1445502750-22672-1-git-send-email-linus.walleij@linaro.org/" rel="nofollow">began constructing the GPIO character device</a> in 2015 I had the vision that it would make it simple to use GPIO expanders on desktop Linux, perhaps for industrial purposes.</p>

<p>So ~10 years later, did this work out as planned? Let&#39;s see.</p>

<p>I got myself a <a href="https://www.mikroe.com/usb-i2c-click" rel="nofollow">simple GPIO expander</a> using the <a href="https://download.mikroe.com/documents/datasheets/MCP2221A_datasheet.pdf" rel="nofollow">Microchip MCP2221A</a> USB-to-I2C-and-some-GPIO chip. It is actually using the <a href="https://en.wikipedia.org/wiki/USB_human_interface_device_class" rel="nofollow">USB HID device class</a> and just adds GPIO support under that as a sideshow to the I2C and UART functionality it provides. The <a href="https://lore.kernel.org/linux-input/7b81210829dabdc96257084ff5b4cc97f2f2ebec.1579497275.git.gupt21@gmail.com/" rel="nofollow">initial driver for Linux</a> by Rishi Gupta looked promising, and he quickly followed up with <a href="https://lore.kernel.org/linux-input/1586882894-19905-1-git-send-email-gupt21@gmail.com/" rel="nofollow">support for the 4 GPIO pins</a> on the chip.</p>

<p>I obtained this a while back, but it didn&#39;t “just work”, because like with some RS232-with-GPIO adapters, this thing contains a NVRAM that hard-configures these GPIO lines for other purposes, such as voltage readings. Actually, Matt Ranostay <a href="https://lore.kernel.org/linux-input/20220927025050.13316-6-matt.ranostay@konsulko.com/" rel="nofollow">added support to use the ADC/DAC</a> in the device through the IIO subsystem! This is quite a feat.</p>

<h2 id="toolset">Toolset</h2>

<p>The GPIO commands that I use here, such as <code>gpiodetect</code> and <code>gpioset</code> commands that you see here is part of <a href="https://git.kernel.org/pub/scm/libs/libgpiod/libgpiod.git/" rel="nofollow">libgpiod</a>, a GPIO library and utility set that is available on most distributions. In my case using Fedora it was a simple matter of <code>dnf -y install libgpiod-utils</code> to get it.</p>

<p>libgpiod uses the GPIO character devices exclusively, no funny hackery, so in my case it&#39;s these that are actually used to communicate with the GPIO controller:</p>

<pre><code># ls /dev/gpiochip*
/dev/gpiochip0
/dev/gpiochip1
/dev/gpiochip2
</code></pre>

<p>So using <code>gpiodetect</code> I could see the GPIOs where there:</p>

<pre><code># gpiodetect 
gpiochip0 [AMDI0030:00] (256 lines)
gpiochip1 [ftdi-cbus] (4 lines)
gpiochip2 [mcp2221_gpio] (4 lines)
</code></pre>

<p>Here we see the computer&#39;s built-in GPIOs that I could probably access by soldering into my desktop computer (brrrr!), then there are the 4 lines on an RS232-converter I am using, but it is molded into plastic and these are not easily accessible, also these are resticted by NVRAM inside the RS232 adapter:</p>

<pre><code># gpioset -c gpiochip1 --toggle 1000 0=1
gpioset: unable to request lines on chip &#39;/dev/gpiochip1&#39;: Device or resource busy
</code></pre>

<p>So rumor has it that to use the RS232 adapter GPIOs, some special tool is needed to modify the NVRAM.</p>

<h2 id="accessing-the-mcp2221">Accessing the MCP2221</h2>

<p>So let&#39;s use the MCP2221A device:</p>

<pre><code># gpioset -c gpiochip2 --toggle 1000 0=1
gpioset: unable to request lines on chip &#39;/dev/gpiochip2&#39;: Device or resource busy
</code></pre>

<p>Bummer: I was still unable to “just use it for GPIO”, and was meaning to drill into the problem. Last summer Heiko Schocher beat me to it and <a href="https://lore.kernel.org/linux-input/20250608163315.24842-1-hs@denx.de/" rel="nofollow">implemented code to force the 4 GPIO pins into GPIO mode</a> if and only if IIO was not in use, i.e. by definition the ADC/DAC functionality would take precedence.</p>

<p>Maybe this was a bad target. But I wanted something simple on USB that anyone can use, and the alternative would be to go and solder an RS232 adapter. Amazingly, there is no dedicated “give me many GPIOs on USB” adapter that I have found. There is <a href="https://community.element14.com/challenges-projects/element14-presents/project-videos/w/documents/72021/gimmegpio-a-simple-way-to-get-gpio-on-laptops-and-desktops----episode-699" rel="nofollow">gimme GPIO</a> from Element 14, but it appears as a USB serial port (ttyUSB0) and then you have to talk a serial protocol to handle the GPIOs from userspace, and this is not what I want, I want the GPIOs to appear as a gpiochip on my Linux host, using the native GPIO character device and all. (It would probably be possible to write a Linux driver that uses Element14&#39;s device over serial, I haven&#39;t tried.)</p>

<p>A small puzzle piece was all that was needed for me to use this for GPIO on a common Fedora desktop: <a href="https://lore.kernel.org/linux-input/20260218-hid-mcp2221-gpio-v1-1-a2ba53867354@kernel.org/" rel="nofollow">a module parameter to enforce the GPIO mode</a> if GPIO is what you want to do for these 4 pins, and you&#39;re not really able to disable IIO (such as for a default desktop kernel...) And voilá, after this:</p>

<pre><code># rmmod hid_mcp2221 
# insmod ./hid-mcp2221.ko gpio_mode_enforce=1
# gpioset -c gpiochip2 --toggle 1000 0=1
</code></pre>

<p><img src="https://dflund.se/~triad/images/toggle-gpio-led.jpg" alt="Toggling a LED over GPIO"></p>

<p>There it is! This LED now blinks with a period of 2 seconds (toggle between on and off every second). As you realize the parameter to the <code>gpioset</code> <strong>—toggle</strong> argument is in milliseconds.</p>

<p>In case you&#39;re wondering about the electronics, the LED is simply connected between the <em>GP0</em> output and <em>GND</em> using a 220 Ohm resistor.</p>

<h2 id="future-fixes">Future fixes</h2>

<p>I don&#39;t know yet if my patch to the MCP2221A driver will be accepted or not. We shall see. If it is, I probably have to add some configuration file in <code>/etc/modprobe.d/mcp2221.conf</code> to pass the <strong>gpio_mode_enforce=1</strong> parameter automatically when the device is probed, like this:</p>

<pre><code>options hid_mcp2221 gpio_mode_enforce=1
</code></pre>

<p>Can this type of GPIO be used for industrial purposes? Pretty much yes. The libgpiod library and it&#39;s applications can be set to realtime deadline scheduling class, and this should (but please test!) ascertain very strict behaviour for such applications.</p>

<p>Unless you want to script things with the libgpiod utilities, libgpiod has C++, Python and Rust bindings, and it has glib bindings so doing GUI desktop applications using GPIO should be a walk in the park as well. So go and have fun.</p>
]]></content:encoded>
      <author>linusw</author>
      <guid>https://people.kernel.org/read/a/s9ubke1bgj</guid>
      <pubDate>Fri, 20 Feb 2026 09:33:56 +0000</pubDate>
    </item>
    <item>
      <title>Guards, guard locks &amp; friends</title>
      <link>https://people.kernel.org/vschneid/guards-guard-locks-and-friends</link>
      <description>&lt;![CDATA[Links&#xA;&#xA;https://lore.kernel.org/all/20230612093537.614161713@infradead.org/T/&#xA;&#xA;Intro&#xA;&#xA;As I was catching up with the scheduler&#39;s &#34;change pattern&#34; schedchange patches, I figured it was time I got up to speed with the guard zoology.&#xA;&#xA;The first part of this post is a code exploration with some of my own musings, the second part is a TL;DR with what each helper does and when you should use&#xA;them (\)&#xA;&#xA;(\) according to my own understanding, provided as-is without warranties of any kind, batteries not included.&#xA;&#xA;It&#39;s cleanup all the way down&#xA;&#xA;The docstring for cleanup kindly points us to the relevant gcc/clang documentation. Given I don&#39;t really speak clang, here&#39;s the relevant GCC bit:&#xA;&#xA;https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute&#xA;&#xA;cleanup (cleanupfunction)&#xA;    The cleanup attribute runs a function when the variable goes out of&#xA;    scope. This attribute can only be applied to auto function scope variables;&#xA;    it may not be applied to parameters or variables with static storage&#xA;    duration. The function must take one parameter, a pointer to a type&#xA;    compatible with the variable. The return value of the function (if any) is&#xA;    ignored.&#xA;&#xA;    When multiple variables in the same scope have cleanup attributes, at exit&#xA;    from the scope their associated cleanup functions are run in reverse order&#xA;    of definition (last defined, first cleanup).&#xA;So we get to write a function that takes a pointer to the variable and does cleanup for it whenever the variable goes out of scope. Neat.&#xA;&#xA;DEFINEFREE&#xA;&#xA;That&#39;s the first one we&#39;ll meet in include/linux/cleanup.h and the most straightforward.&#xA;&#xA; DEFINEFREE(name, type, free):&#xA; simple helper macro that defines the required wrapper for a _free()&#xA; based cleanup function. @free is an expression using &#39;T&#39; to access the&#xA; variable. @free should typically include a NULL test before calling a&#xA; function, see the example below.&#xA;Long story short, that&#39;s a cleanup variable definition with some extra sprinkles on top:&#xA;&#xA;define cleanup(func)&#x9;&#x9;&#x9;attribute((cleanup(func)))&#xA;#define _free(name)&#x9;cleanup(free##name)&#xA;&#xA;define DEFINEFREE(name, type, free) \&#xA;&#x9;static _alwaysinline void _free##name(void p) { type T = (type )p; free; }&#xA;So we can e.g. define a kfree() cleanup type and stick that onto any kmalloc()&#39;d variable to get automagic cleanup without any goto&#39;s. Some languages call that a smart pointer.&#xA;&#xA;DEFINEFREE(kfree, void , if (T) kfree(T))&#xA;&#xA;void allocobj(...)&#xA;{&#xA;     struct obj p free(kfree) = kmalloc(...);&#xA;     if (!p)&#xA;&#x9;return NULL;&#xA;&#xA;     if (!initobj(p))&#xA;&#x9;return NULL;&#xA;&#xA;     returnptr(p); // This does a pointer shuffle to prevent the kfree() from happening&#xA;}&#xA;I won&#39;t get into the returnptr() faff, but if you have a look at it and wonder what&#39;s going on, it&#39;s mostly going to be because of having to do the shuffle with no double evaluation. This is relevant: https://lore.kernel.org/lkml/CAHk-=wiOXePAqytCk6JuiP6MeePL6ksDYptE54hmztiGLYihjA@mail.gmail.com/&#xA;&#xA;DEFINECLASS&#xA;&#xA;This one is pretty much going to be DEFINEFREE() but with an added quality of life feature in the form of a constructor:&#xA;&#xA; DEFINECLASS(name, type, exit, init, initargs...):&#xA; helper to define the destructor and constructor for a type.&#xA; @exit is an expression using &#39;T&#39; -- similar to FREE above.&#xA; @init is an expression in @initargs resulting in @type&#xA;&#xA;define DEFINECLASS(name, type, exit, init, initargs...)&#x9;&#x9;\&#xA;typedef type class##name##t;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;static alwaysinline void class##name##destructor(type p)&#x9;\&#xA;{ type T = p; exit; }&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;static alwaysinline type class##name##constructor(initargs)&#x9;\&#xA;{ type t = init; return t; }&#xA;&#xA;define CLASS(name, var)&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;class##name##t var _cleanup(class##name##destructor) =&#x9;\&#xA;&#x9;&#x9;class##name##constructor&#xA;You&#39;ll note that yes, it can be expressed purely as a DEFINEFREE(), but it saves us from a bit of repetition, and will enable us to craft stuff involving locks later on:&#xA;DEFINECLASS(fdget, struct fd, fdput(T), fdget(fd), int fd)&#xA;void foo(void)&#xA;{&#xA;&#x9;fd = ...;&#xA;&#x9;CLASS(fdget, f)(fd);&#xA;&#x9;if (fdempty(f))&#xA;&#x9;&#x9;return -EBADF;&#xA;&#xA;&#x9;// use &#39;f&#39; without concern&#xA;}&#xA;&#xA;DEFINEFREE(fdput, struct fd , if (T) fdput(T))&#xA;void foo(void)&#xA;{&#xA;&#x9;fd = ...;&#xA;&#x9;struct fd f _free(fdput) = fdget(fd);&#xA;&#x9;if (fdempty(f))&#xA;&#x9;&#x9;return -EBADF;&#xA;&#xA;&#x9;// use &#39;f&#39; without concern&#xA;Futex\hash\bucket case&#xA;&#xA;For a more complete example:&#xA;DEFINECLASS(hb, struct futexhashbucket ,&#xA;&#x9;     if (T) futexhashput(T),&#xA;&#x9;     futexhash(key), union futexkey key);&#xA;&#xA;int futexwake(u32 _user uaddr, unsigned int flags, int nrwake, u32 bitset)&#xA;{&#xA;&#x9;union futexkey key = FUTEXKEYINIT;&#xA;&#x9;DEFINEWAKEQ(wakeq);&#xA;&#x9;int ret;&#xA;&#xA;&#x9;ret = getfutexkey(uaddr, flags, &amp;key, FUTEXREAD);&#xA;&#x9;if (unlikely(ret != 0))&#xA;&#x9;&#x9;return ret;&#xA;&#xA;&#x9;CLASS(hb, hb)(&amp;key);&#xA;&#xA;&#x9;/ Make sure we really have tasks to wakeup /&#xA;&#x9;if (!futexhbwaiterspending(hb))&#xA;&#x9;&#x9;return ret;&#xA;&#xA;&#x9;spinlock(&amp;hb-  lock);&#xA;&#xA;&#x9;/ ... /&#xA;}&#xA;Using gcc -E to stop compilation after the preprocessor has expanded all of our fancy macros (\), the resulting code is fairly readable modulo the typedef:&#xA;typedef struct futexhashbucket  classhbt;&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;void&#xA;classhbdestructor(struct futexhashbucket  p)&#xA;{&#xA;&#x9;struct futexhashbucket  T = p;&#xA;&#x9;if (T)&#xA;&#x9;&#x9;futexhashput(T);&#xA;}&#xA;&#xA;static inline attribute((_gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;struct futexhashbucket &#xA;classhbconstructor(union futexkey key)&#xA;{&#xA;&#x9;struct futexhashbucket  t = futexhash(key);&#xA;&#x9;return t;&#xA;}&#xA;&#xA;int futexwake(u32 uaddr, unsigned int flags, int nrwake, u32 bitset)&#xA;{&#xA;&#x9;struct futexq this, next;&#xA;&#x9;union futexkey key = (union futexkey) { .both = { .ptr = 0ULL } };&#xA;&#x9;struct wakeqhead wakeq = { ((struct wakeqnode ) 0x01), &amp;wakeq.first };&#xA;&#x9;int ret;&#xA;&#xA;&#x9;if (!bitset)&#xA;&#x9;&#x9;return -22;&#xA;&#xA;&#x9;ret = getfutexkey(uaddr, flags, &amp;key, FUTEXREAD);&#xA;&#x9;if (builtinexpect(!!(ret != 0), 0))&#xA;&#x9;&#x9;return ret;&#xA;&#xA;&#x9;if ((flags &amp; 0x0100) &amp;&amp; !nrwake)&#xA;&#x9;&#x9;return 0;&#xA;&#xA;&#x9;classhbt hb attribute((cleanup(classhbdestructor))) = classhbconstructor(&amp;key);&#xA;&#xA;&#x9;if (!futexhbwaiterspending(hb))&#xA;&#x9;&#x9;return ret;&#xA;&#xA;&#x9;spinlock(&amp;hb-  lock);&#xA;&#xA;&#x9;/ ... /&#xA;}&#xA;(\) I use make V=1 on the file I want to expand, copy the big command producing the .o, ditch the -Wp,-MMD,.o.d part and add a -E to it.&#xA;&#xA;DEFINEGUARD&#xA;&#xA;For now, ignore the CONDITIONAL and LOCKPTR stuff, this is only relevant to the scoped &amp; conditional guards which we&#39;ll get to later.&#xA;define DEFINECLASSISGUARD(name) \&#xA;&#x9;DEFINECLASSISCONDITIONAL(name, false); \&#xA;&#x9;DEFINEGUARDLOCKPTR(name, T)&#xA;&#xA;define DEFINEGUARD(name, type, lock, unlock) \&#xA;&#x9;DEFINECLASS(name, type, if (!_GUARDISERR(T)) { unlock; }, ({ lock; T; }), type T); \&#xA;&#x9;DEFINECLASSISGUARD(name)&#xA;&#xA;define guard(name) \&#xA;&#x9;CLASS(name, UNIQUEID(guard))&#xA;So it&#39;s a CLASS with a constructor and destructor, but the added bonus is the automagic cleanup variable definition.&#xA;&#xA;Why is that relevant? Well, consider locks. You don&#39;t declare a variable for a lock acquisition &amp; release, you manipulate an already-allocated object (e.g. a mutex). However, no variable declaration means no cleanup. So this just declares a variable to slap _cleanup onto it and have an automagic out-of-scope cleanup callback.&#xA;&#xA;Let&#39;s have a look at an example in the thermal subsystem with a mutex critical section:&#xA;DEFINEGUARD(coolingdev, struct thermalcoolingdevice , mutexlock(&amp;T-  lock),&#xA;&#x9;     mutexunlock(&amp;T-  lock))&#xA;&#xA;static ssizet&#xA;curstatestore(struct device dev, struct deviceattribute attr,&#xA;&#x9;&#x9;const char buf, sizet count)&#xA;{&#xA;&#x9;struct thermalcoolingdevice cdev = tocoolingdevice(dev);&#xA;&#x9;unsigned long state;&#xA;&#x9;int result;&#xA;&#x9;/ ... /&#xA;&#x9;if (state   cdev-  maxstate)&#xA;&#x9;&#x9;return -EINVAL;&#xA;&#xA;&#x9;guard(coolingdev)(cdev);&#xA;&#xA;&#x9;result = cdev-  ops-  setcurstate(cdev, state);&#xA;&#x9;if (result)&#xA;&#x9;&#x9;return result;&#xA;The preprocessor output looks like so:&#xA;typedef struct thermalcoolingdevice  classcoolingdevt;&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_)) void&#xA;classcoolingdevdestructor(struct thermalcoolingdevice  p)&#xA;{&#xA;&#x9;struct thermalcoolingdevice  T = p;&#xA;&#x9;if (!({&#xA;&#x9;&#x9;&#x9;&#x9;unsigned long rc = (unsigned long)(T);&#xA;&#x9;&#x9;&#x9;&#x9;builtinexpect(!!((rc - 1)   = -4095 - 1), 0);&#xA;&#x9;&#x9;&#x9;})) {&#xA;&#x9;&#x9;mutexunlock(&amp;T-  lock);&#xA;&#x9;};&#xA;}&#xA;static inline attribute((gnuinline))&#xA;attribute((unused)) attribute_((noinstrumentfunction))&#xA;attribute((alwaysinline_)) struct thermalcoolingdevice &#xA;classcoolingdevconstructor(struct thermalcoolingdevice  T)&#xA;{&#xA;&#x9;struct thermalcoolingdevice  t =&#xA;&#x9;&#x9;({ mutexlock(&amp;T-  lock); T; });&#xA;&#x9;return t;&#xA;}&#xA;&#xA;static attribute((unused)) const bool classcoolingdevisconditional = false;&#xA;&#xA;static inline attribute((_gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_)) void &#xA;classcoolingdevlockptr(classcoolingdevt T)&#xA;{&#xA;&#x9;void ptr = (void )(unsigned long)(T);&#xA;&#x9;if (ISERR(ptr)) {&#xA;&#x9;&#x9;ptr = ((void )0);&#xA;&#x9;}&#xA;&#x9;return ptr;&#xA;}&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_)) int&#xA;classcoolingdevlockerr(classcoolingdevt T)&#xA;{&#xA;&#x9;long rc = (unsigned long)(T);&#xA;&#x9;if (!rc) {&#xA;&#x9;&#x9;rc = -16;&#xA;&#x9;}&#xA;&#x9;if (!builtinexpect(!!((unsigned long)(void )(rc)   = (unsigned long)-4095), 0)) {&#xA;&#x9;&#x9;rc = 0;&#xA;&#x9;}&#xA;&#x9;return rc;&#xA;}&#xA;&#xA;static ssizet&#xA;curstatestore(struct device dev, struct deviceattribute attr,&#xA;&#x9;&#x9;const char buf, sizet count)&#xA;{&#xA;&#x9;struct thermalcoolingdevice cdev = ({ void _mptr = (void )(dev); Staticassert(builtintypescompatiblep(typeof((dev)), typeof(((struct thermalcoolingdevice )0)-  device)) || _builtintypescompatiblep(typeof((dev)), typeof(void)), &#34;pointer type mismatch in containerof()&#34;); ((struct thermalcoolingdevice )(mptr - builtinoffsetof(struct thermalcoolingdevice, device))); });&#xA;&#x9;unsigned long state;&#xA;&#x9;int result;&#xA;&#xA;&#x9;if (sscanf(buf, &#34;%ld\n&#34;, &amp;state) != 1)&#xA;&#x9;&#x9;return -22;&#xA;&#xA;&#x9;if ((long)state &lt; 0)&#xA;&#x9;&#x9;return -22;&#xA;&#xA;&#x9;if (state   cdev-  maxstate)&#xA;&#x9;&#x9;return -22;&#xA;&#xA;&#x9;classcoolingdevt _UNIQUEIDguard435 attribute((cleanup(classcoolingdevdestructor))) = classcoolingdevconstructor(cdev);&#xA;&#xA;&#x9;result = cdev-  ops-  setcurstate(cdev, state);&#xA;&#x9;if (result)&#xA;&#x9;&#x9;return result;&#xA;&#xA;&#x9;thermalcoolingdevicestatsupdate(cdev, state);&#xA;&#xA;&#x9;return count;&#xA;}&#xA;DEFINE\LOCK\GUARD&#xA;&#xA;Okay, we have sort-of-smart pointers, classes, guards for locks, what&#39;s next? Well, certain locks need more than just a pointer for the lock &amp; unlock operations. For instance, the scheduler&#39;s runqueue locks need both a struct rq pointer and a struct rqflags pointer.&#xA;&#xA;So LOCKGUARD&#39;s are going to be enhanced GUARD&#39;s manipulating a composite type instead of a single pointer:&#xA;define _DEFINEUNLOCKGUARD(name, type, unlock, ...)&#x9;&#x9;\&#xA;typedef struct {&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;type lock;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;VAARGS_;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;} class##name##t;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;Note that there is also the &#34;no pointer&#34; special case, which is when there is no accessible type for the manipulated lock - think preemptdisable(), migratedisable(), rcureadlock(); Just like for GUARD, we still declare a variable to slap _cleanup onto it.&#xA;&#xA;Let&#39;s look at the RCU case:&#xA;&#xA;DEFINELOCKGUARD0(rcu,&#xA;&#x9;do {&#xA;&#x9;&#x9;rcureadlock();&#xA;&#x9;&#x9;/&#xA;&#x9;&#x9; sparse doesn&#39;t call the cleanup function,&#xA;&#x9;&#x9; so just release immediately and don&#39;t track&#xA;&#x9;&#x9; the context. We don&#39;t need to anyway, since&#xA;&#x9;&#x9; the whole point of the guard is to not need&#xA;&#x9;&#x9; the explicit unlock.&#xA;&#x9;&#x9; /&#xA;&#x9;&#x9;_release(RCU);&#xA;&#x9;} while (0),&#xA;&#x9;rcureadunlock())&#xA;&#xA;void wakeupifidle(int cpu)&#xA;{&#xA;&#x9;struct rq rq = cpurq(cpu);&#xA;&#xA;&#x9;guard(rcu)();&#xA;&#x9;if (isidletask(rcudereference(rq-  curr))) {&#xA;&#x9;&#x9;// ....&#xA;&#x9;}&#xA;}&#xA;&#xA;static attribute((unused)) const bool classrcuisconditional = false;&#xA;typedef struct {&#xA;&#x9;void lock;&#xA;} classrcut;&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;void&#xA;classrcudestructor(classrcut T)&#xA;{&#xA;&#x9;if (!({&#xA;&#x9;&#x9;&#x9;&#x9;unsigned long rc = ( unsigned long)(T-  lock);&#xA;&#x9;&#x9;&#x9;&#x9;_builtinexpect(!!((rc - 1)   = -4095 - 1), 0);&#xA;&#x9;&#x9;&#x9;})) {&#xA;&#x9;&#x9;rcureadunlock();&#xA;&#x9;}&#xA;}&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;void &#xA;classrculockptr(classrcut T)&#xA;{&#xA;&#x9;void ptr = (void )( unsigned long)(&amp;T-  lock);&#xA;&#x9;if (ISERR(ptr)) {&#xA;&#x9;&#x9;ptr = ((void )0);&#xA;&#x9;}&#xA;&#x9;return ptr;&#xA;}&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;int&#xA;classrculockerr(classrcut T)&#xA;{&#xA;&#x9;long rc = ( unsigned long)(&amp;T-  lock);&#xA;&#x9;if (!rc) {&#xA;&#x9;&#x9;rc = -16;&#xA;&#x9;}&#xA;&#x9;if (!builtinexpect(!!((unsigned long)(void )(rc)   = (unsigned long)-4095), 0)) {&#xA;&#x9;&#x9;rc = 0;&#xA;&#x9;}&#xA;&#x9;return rc;&#xA;}&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;classrcut&#xA;classrcuconstructor(void)&#xA;{&#xA;&#x9;classrcut t = { .lock = (void)1 }, T attribute((unused)) = &amp;t;&#xA;&#x9;do {&#xA;&#x9;&#x9;rcureadlock();&#xA;&#x9;&#x9;(void)0; // _release(RCU); just for sparse, see comment in definition&#xA;&#x9;} while (0);&#xA;&#x9;return t;&#xA;}&#xA;&#xA;void wakeupifidle(int cpu)&#xA;{&#xA;&#x9;struct rq rq = (&amp;(({ do { const void seggs _vppverify = (typeof((&amp;(runqueues)) + 0))((void )0); (void)_vppverify; } while (0); ({ unsigned long ptr; asm (&#34;&#34; : &#34;=r&#34;(ptr) : &#34;0&#34;((_typeofunqual(((&amp;(runqueues)))) )(( unsigned long)((&amp;(runqueues)))))); (typeof((typeofunqual(((&amp;(runqueues)))) )(( unsigned long)((&amp;(runqueues)))))) (ptr + (((percpuoffset[((cpu))])))); }); })));&#xA;&#x9;classrcut UNIQUEIDguard1486 attribute((cleanup(classrcudestructor))) =&#xA;&#x9;&#x9;classrcuconstructor();&#xA;&#xA;&#x9;if (isidletask(...) {&#xA;&#x9;&#x9;// ...&#xA;&#x9;}&#xA;}&#xA;&#xA;Let&#39;s look at the runqueue lock:&#xA;DEFINELOCKGUARD1(rqlockirqsave, struct rq,&#xA;&#x9;&#x9;    rqlockirqsave(T-  lock, &amp;T-  rf),&#xA;&#x9;&#x9;    rqunlockirqrestore(T-  lock, &amp;T-  rf),&#xA;&#x9;&#x9;    struct rqflags rf)&#xA;&#xA;static void schedbalanceupdateblockedaverages(int cpu)&#xA;{&#xA;&#x9;struct rq rq = cpurq(cpu);&#xA;&#xA;&#x9;guard(rqlockirqsave)(rq);&#xA;&#x9;updaterqclock(rq);&#xA;&#x9;schedbalanceupdateblockedaverages(rq);&#xA;}&#xA;&#xA;static attribute((unused)) const bool classrqlockirqsaveisconditional = false;&#xA;&#xA;typedef struct {&#xA;&#x9;struct rq lock;&#xA;&#x9;struct rqflags rf;&#xA;} classrqlockirqsavet;&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;void&#xA;classrqlockirqsavedestructor(classrqlockirqsavet T)&#xA;{&#xA;&#x9;if (!({&#xA;&#x9;&#x9;&#x9;&#x9;unsigned long rc = ( unsigned long)(T-  lock);&#xA;&#x9;&#x9;&#x9;&#x9;_builtinexpect(!!((rc - 1)   = -4095 - 1), 0);&#xA;&#x9;&#x9;&#x9;})) {&#xA;&#x9;&#x9;rqunlockirqrestore(T-  lock, &amp;T-  rf);&#xA;&#x9;}&#xA;}&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;void &#xA;classrqlockirqsavelockptr(classrqlockirqsavet T)&#xA;{&#xA;&#x9;void ptr = (void )( unsigned long)(&amp;T-  lock);&#xA;&#x9;if (ISERR(ptr)) {&#xA;&#x9;&#x9;ptr = ((void )0);&#xA;&#x9;}&#xA;&#x9;return ptr;&#xA;}&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;int&#xA;classrqlockirqsavelockerr(classrqlockirqsavet T)&#xA;{&#xA;&#x9;long rc = ( unsigned long)(&amp;T-  lock);&#xA;&#x9;if (!rc) {&#xA;&#x9;&#x9;rc = -16;&#xA;&#x9;}&#xA;&#x9;if (!builtinexpect(!!((unsigned long)(void )(rc)   = (unsigned long)-4095), 0)) {&#xA;&#x9;&#x9;rc = 0;&#xA;&#x9;}&#xA;&#x9;return rc;&#xA;}&#xA;&#xA;static inline attribute((gnuinline)) attribute((unused))&#xA;attribute_((noinstrumentfunction)) attribute((alwaysinline_))&#xA;classrqlockirqsavet&#xA;classrqlockirqsaveconstructor(struct rq l)&#xA;{&#xA;&#x9;classrqlockirqsavet t = { .lock = l }, T = &amp;t;&#xA;&#x9;rqlockirqsave(T-  lock, &amp;T-  rf);&#xA;&#x9;return t;&#xA;}&#xA;&#xA;static void schedbalanceupdateblockedaverages(int cpu)&#xA;{&#xA; struct rq rq = (&amp;(({ do { const void seggs _vppverify = (typeof((&amp;(runqueues)) + 0))((void )0); (void)_vppverify; } while (0); ({ unsigned long ptr; asm (&#34;&#34; : &#34;=r&#34;(ptr) : &#34;0&#34;((_typeofunqual(((&amp;(runqueues)))) )(( unsigned long)((&amp;(runqueues)))))); (typeof((typeofunqual(((&amp;(runqueues)))) )(( unsigned long)((&amp;(runqueues)))))) (ptr + (((percpuoffset[((cpu))])))); }); })));&#xA;&#xA; classrqlockirqsavet UNIQUEIDguard1377&#xA;&#x9; attribute((cleanup(classrqlockirqsavedestructor))) =&#xA;&#x9; classrqlockirqsaveconstructor(rq);&#xA;&#xA; updaterqclock(rq);&#xA; _schedbalanceupdateblockedaverages(rq);&#xA;}&#xA;&#xA;SCOPES&#xA;&#xA;Scope creation is slightly different for classes and guards, but follow the same principle.&#xA;&#xA;Class&#xA;&#xA;define scopedclass(name, var, label, args...)        \&#xA;&#x9;for (CLASS(name, var)(args); ; ({ goto label; })) \&#xA;&#x9;&#x9;if (0) {                                   \&#xA;label:                                                    \&#xA;&#x9;&#x9;&#x9;break;                             \&#xA;&#x9;&#x9;} else&#xA;&#xA;define scopedclass(name, var, args...) \&#xA;&#x9;scopedclass(name, var, UNIQUEID(label), args)&#xA;That for+if+goto trinity looks a bit unholy at first, but let&#39;s look at what the requirements are for a macro that lets us create a new scope:&#xA;create a new scope&#xA;declare the _cleanup variable in that new scope&#xA;make the macro usable either with a single statement, or with curly braces&#xA;&#xA;A for loop gives us the declaration and the scope. However that for loop needs to run once, and it&#39;d be a shame to have to declare a loop counter. The &#34;run exactly once&#34; mechanism is thus encoded in the form of the if+goto.&#xA;&#xA;Consider:&#xA;&#x9;for (CLASS(name, var)(args); ; ({ goto label; }))&#xA;&#x9;&#x9;if (0) {&#xA;label:&#xA;&#x9;&#x9;&#x9;break;&#xA;&#x9;&#x9;} else {&#xA;&#x9;&#x9;   stmt;&#xA;&#x9;&#x9;}&#xA;The execution order will be:&#xA;CLASS(name, var)(args);&#xA;stmt;&#xA;goto label;&#xA;break;&#xA;We thus save ourselves the need for an extra variable at the cost of mild code reader confusion, a common trick used in the kernel.&#xA;&#xA;Guards&#xA;&#xA;For guard scopes, we find the same for+if+goto construct but with some added checks. For regular (unconditional) guards, this is pretty much the same as for CLASS&#39;es:&#xA;/&#xA; Helper macro for scopedguard().&#xA;  Note that the &#34;!iscondptr(name)&#34; part of the condition ensures that&#xA; compiler would be sure that for the unconditional locks the body of the&#xA; loop (caller-provided code glued to the else clause) could not be skipped.&#xA; It is needed because the other part - &#34;_guardptr(name)(&amp;scope)&#34; - is too&#xA; hard to deduce (even if could be proven true for unconditional locks).&#xA; /&#xA;define scopedguard(name, label, args...)&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;for (CLASS(name, scope)(args);&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;     guardptr(name)(&amp;scope) || !iscondptr(name);&#x9;\&#xA;&#x9;     ({ goto label; }))&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;if (0) {&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;label:&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;&#x9;break;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;} else&#xA;&#xA;define scopedguard(name, args...)&#x9;\&#xA;&#x9;_scopedguard(name, UNIQUEID(label), args)&#xA;For conditional guards, we mainly factor in the fact that the constructor can &#34;fail&#34;. This is relevant for e.g. trylocks where the lock acquisition isn&#39;t guaranteed to succeed.&#xA;define _scopedcondguard(name, fail, label, args...)&#x9;&#x9;\&#xA;&#x9;for (CLASS(name, scope)(args); true; ({ goto label; }))&#x9;\&#xA;&#x9;&#x9;if (!_guardptr(name)(&amp;scope)) {&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;&#x9;BUILDBUGON(!iscondptr(name));&#x9;&#x9;\&#xA;&#x9;&#x9;&#x9;fail;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;label:&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;&#x9;break;&#x9;&#x9;&#x9;&#x9;&#x9;&#x9;\&#xA;&#x9;&#x9;} else&#xA;&#xA;define scopedcondguard(name, fail, args...)&#x9;\&#xA;&#x9;_scopedcondguard(name, fail, UNIQUEID(label), args)&#xA;So in the end, that _DEFINECLASSISCONDITIONAL() faff is there:&#xA;To help optimize unconditional guard scopes&#xA;To ensure conditional guard scopes are used correctly (i.e. the lock acquisition failure is expected)&#xA;&#xA;Debuggability&#xA;&#xA;You&#39;ll note that while guards delete an entire class of error associated with goto&#39;s, they shuffle the code around.&#xA;&#xA;From my experimentation, if you put the constructor and the destructor on a separate line in the CLASS/GUARD definition, you&#39;ll at least be able to tell them apart during a splat:&#xA;&#xA;DEFINELOCKGUARD1(rawspinlockirqsavebug, rawspinlockt,&#xA;&#x9;&#x9;    rawspinlockirqsave(T-  lock, T-  flags),&#xA;spinlock.h:571:&#x9;    rawspinunlockirqrestorebug(T-  lock, T-  flags),&#xA;&#x9;&#x9;    unsigned long flags)&#xA;&#xA;int trytowakeup(struct taskstruct p, unsigned int state, int wakeflags)&#xA;{&#xA;&#x9;        ...&#xA;core.c:4108&#x9;scopedguard (rawspinlockirqsavebug, &amp;p-  pilock) {&#xA;&#x9;        }&#xA;&#x9;        ...&#xA;}&#xA;&#xA;[    0.216287] kernel BUG at ./include/linux/spinlock.h:571!&#xA;[    0.217115] Oops: invalid opcode: 0000 [#1] SMP PTI&#xA;[    0.217285] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014&#xA;[    0.217285] RIP: 0010:trytowakeup (./include/linux/spinlock.h:569 (discriminator 6) kernel/sched/core.c:4108 (discriminator 6))&#xA;[    0.217285] Call Trace:&#xA;[    0.217285]  TASK&#xA;[    0.217285]  ? _pfxkthreadworkerfn (kernel/kthread.c:966)&#xA;[    0.217285]  _kthreadcreateonnode (kernel/kthread.c:535)&#xA;[    0.217285]  kthreadcreateworkeronnode (kernel/kthread.c:1043 (discriminator 1) kernel/kthread.c:1073 (discriminator 1))&#xA;[    0.217285]  ? vprintkemit (kernel/printk/printk.c:4625 kernel/printk/printk.c:2433)&#xA;[    0.217285]  workqueueinit (kernel/workqueue.c:7873 kernel/workqueue.c:7922)&#xA;[    0.217285]  kernelinitfreeable (init/main.c:1675)&#xA;[    0.217285]  ? _pfxkernelinit (init/main.c:1570)&#xA;[    0.217285]  kernelinit (init/main.c:1580)&#xA;[    0.217285]  retfromfork (arch/x86/kernel/process.c:164)&#xA;[    0.217285]  ? _pfxkernelinit (init/main.c:1570)&#xA;[    0.217285]  retfromforkasm (arch/x86/entry/entry64.S:259)&#xA;[    0.217285]  /TASK&#xA;&#xA;TL;DR&#xA;&#xA;DEFINEFREE()&#xA;    Sort-of-smart pointer&#xA;    Definition tied to the freeing function, e.g. DEFINEKFREE(kfree,...)&#xA;&#xA;DEFINECLASS()&#xA;    Like DEFINEFREE() but with factorized initialization.&#xA;&#xA;DEFINEGUARD()&#xA;    Like DEFINECLASS() but you don&#39;t need the underlying variable&#xA;    e.g. locks don&#39;t require declaring a variable, you just lock and unlock them.&#xA;&#xA;DEFINE\LOCK\GUARD()&#xA;    Like DEFINEGUARD() but when a single pointer isn&#39;t sufficient for lock/unlock operations.&#xA;    Also for &#34;special&#34; locks with no underlying type such as RCU, preempt or migrate_disable.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="links">Links</h2>

<p><a href="https://lore.kernel.org/all/20230612093537.614161713@infradead.org/T/" rel="nofollow">https://lore.kernel.org/all/20230612093537.614161713@infradead.org/T/</a></p>

<h2 id="intro">Intro</h2>

<p>As I was catching up with the scheduler&#39;s “change pattern” <code>sched_change</code> patches, I figured it was time I got up to speed with the guard zoology.</p>

<p>The first part of this post is a code exploration with some of my own musings, the second part is a TL;DR with what each helper does and when you should use
them (*)</p>

<p>(*) according to my own understanding, provided as-is without warranties of any kind, batteries not included.</p>

<h2 id="it-s-cleanup-all-the-way-down">It&#39;s <code>__cleanup__</code> all the way down</h2>

<p>The docstring for <code>__cleanup</code> kindly points us to the relevant gcc/clang documentation. Given I don&#39;t really speak clang, here&#39;s the relevant GCC bit:</p>

<p><a href="https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute" rel="nofollow">https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute</a></p>

<pre><code>cleanup (cleanup_function)
    The cleanup attribute runs a function when the variable goes out of
    scope. This attribute can only be applied to auto function scope variables;
    it may not be applied to parameters or variables with static storage
    duration. The function must take one parameter, a pointer to a type
    compatible with the variable. The return value of the function (if any) is
    ignored.

    When multiple variables in the same scope have cleanup attributes, at exit
    from the scope their associated cleanup functions are run in reverse order
    of definition (last defined, first cleanup).
</code></pre>

<p>So we get to write a function that takes a pointer to the variable and does cleanup for it whenever the variable goes out of scope. Neat.</p>

<h2 id="define-free">DEFINE_FREE</h2>

<p>That&#39;s the first one we&#39;ll meet in <code>include/linux/cleanup.h</code> and the most straightforward.</p>

<pre><code> * DEFINE_FREE(name, type, free):
 *	simple helper macro that defines the required wrapper for a __free()
 *	based cleanup function. @free is an expression using &#39;_T&#39; to access the
 *	variable. @free should typically include a NULL test before calling a
 *	function, see the example below.
</code></pre>

<p>Long story short, that&#39;s a <code>__cleanup</code> variable definition with some extra sprinkles on top:</p>

<pre><code class="language-C">#define __cleanup(func)			__attribute__((__cleanup__(func)))
#define __free(_name)	__cleanup(__free_##_name)

#define DEFINE_FREE(_name, _type, _free) \
	static __always_inline void __free_##_name(void *p) { _type _T = *(_type *)p; _free; }
</code></pre>

<p>So we can e.g. define a <code>kfree()</code> cleanup type and stick that onto any <code>kmalloc()</code>&#39;d variable to get automagic cleanup without any <code>goto</code>&#39;s. Some languages call that a smart pointer.</p>

<pre><code class="language-C">DEFINE_FREE(kfree, void *, if (_T) kfree(_T))

void *alloc_obj(...)
{
     struct obj *p __free(kfree) = kmalloc(...);
     if (!p)
	return NULL;

     if (!init_obj(p))
	return NULL;

     return_ptr(p); // This does a pointer shuffle to prevent the kfree() from happening
}
</code></pre>

<p>I won&#39;t get into the <code>return_ptr()</code> faff, but if you have a look at it and wonder what&#39;s going on, it&#39;s mostly going to be because of having to do the shuffle with no double evaluation. This is relevant: <a href="https://lore.kernel.org/lkml/CAHk-=wiOXePAqytCk6JuiP6MeePL6ksDYptE54hmztiGLYihjA@mail.gmail.com/" rel="nofollow">https://lore.kernel.org/lkml/CAHk-=wiOXePAqytCk6JuiP6MeePL6ksDYptE54hmztiGLYihjA@mail.gmail.com/</a></p>

<h2 id="define-class">DEFINE_CLASS</h2>

<p>This one is pretty much going to be <code>DEFINE_FREE()</code> but with an added quality of life feature in the form of a constructor:</p>

<pre><code> * DEFINE_CLASS(name, type, exit, init, init_args...):
 *	helper to define the destructor and constructor for a type.
 *	@exit is an expression using &#39;_T&#39; -- similar to FREE above.
 *	@init is an expression in @init_args resulting in @type
</code></pre>

<pre><code class="language-C">#define DEFINE_CLASS(_name, _type, _exit, _init, _init_args...)		\
typedef _type class_##_name##_t;					\
static __always_inline void class_##_name##_destructor(_type *p)	\
{ _type _T = *p; _exit; }						\
static __always_inline _type class_##_name##_constructor(_init_args)	\
{ _type t = _init; return t; }

#define CLASS(_name, var)						\
	class_##_name##_t var __cleanup(class_##_name##_destructor) =	\
		class_##_name##_constructor
</code></pre>

<p>You&#39;ll note that yes, it can be expressed purely as a <code>DEFINE_FREE()</code>, but it saves us from a bit of repetition, and will enable us to craft stuff involving locks later on:</p>

<pre><code class="language-C">DEFINE_CLASS(fdget, struct fd, fdput(_T), fdget(fd), int fd)
void foo(void)
{
	fd = ...;
	CLASS(fdget, f)(fd);
	if (fd_empty(f))
		return -EBADF;

	// use &#39;f&#39; without concern
}

DEFINE_FREE(fdput, struct fd *, if (_T) fdput(_T))
void foo(void)
{
	fd = ...;
	struct fd *f __free(fdput) = fdget(fd);
	if (fd_empty(f))
		return -EBADF;

	// use &#39;f&#39; without concern
</code></pre>

<h3 id="futex-hash-bucket-case">Futex_hash_bucket case</h3>

<p>For a more complete example:</p>

<pre><code class="language-C">DEFINE_CLASS(hb, struct futex_hash_bucket *,
	     if (_T) futex_hash_put(_T),
	     futex_hash(key), union futex_key *key);

int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
	union futex_key key = FUTEX_KEY_INIT;
	DEFINE_WAKE_Q(wake_q);
	int ret;

	ret = get_futex_key(uaddr, flags, &amp;key, FUTEX_READ);
	if (unlikely(ret != 0))
		return ret;

	CLASS(hb, hb)(&amp;key);

	/* Make sure we really have tasks to wakeup */
	if (!futex_hb_waiters_pending(hb))
		return ret;

	spin_lock(&amp;hb-&gt;lock);

	/* ... */
}
</code></pre>

<p>Using <code>gcc -E</code> to stop compilation after the preprocessor has expanded all of our fancy macros (*), the resulting code is fairly readable modulo the <code>typedef</code>:</p>

<pre><code class="language-C">typedef struct futex_hash_bucket * class_hb_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_hb_destructor(struct futex_hash_bucket * *p)
{
	struct futex_hash_bucket * _T = *p;
	if (_T)
		futex_hash_put(_T);
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
struct futex_hash_bucket *
class_hb_constructor(union futex_key *key)
{
	struct futex_hash_bucket * t = futex_hash(key);
	return t;
}
</code></pre>

<pre><code class="language-C">int futex_wake(u32 *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
	struct futex_q *this, *next;
	union futex_key key = (union futex_key) { .both = { .ptr = 0ULL } };
	struct wake_q_head wake_q = { ((struct wake_q_node *) 0x01), &amp;wake_q.first };
	int ret;

	if (!bitset)
		return -22;

	ret = get_futex_key(uaddr, flags, &amp;key, FUTEX_READ);
	if (__builtin_expect(!!(ret != 0), 0))
		return ret;

	if ((flags &amp; 0x0100) &amp;&amp; !nr_wake)
		return 0;

	class_hb_t hb __attribute__((__cleanup__(class_hb_destructor))) = class_hb_constructor(&amp;key);


	if (!futex_hb_waiters_pending(hb))
		return ret;

	spin_lock(&amp;hb-&gt;lock);

	/* ... */
}
</code></pre>

<p>(*) I use <code>make V=1</code> on the file I want to expand, copy the big command producing the .o, ditch the <code>-Wp,-MMD,**.o.d</code> part and add a -E to it.</p>

<h2 id="define-guard">DEFINE_GUARD</h2>

<p>For now, ignore the <code>CONDITIONAL</code> and <code>LOCK_PTR</code> stuff, this is only relevant to the scoped &amp; conditional guards which we&#39;ll get to later.</p>

<pre><code class="language-C">#define DEFINE_CLASS_IS_GUARD(_name) \
	__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \
	__DEFINE_GUARD_LOCK_PTR(_name, _T)

#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
	DEFINE_CLASS(_name, _type, if (!__GUARD_IS_ERR(_T)) { _unlock; }, ({ _lock; _T; }), _type _T); \
	DEFINE_CLASS_IS_GUARD(_name)

#define guard(_name) \
	CLASS(_name, __UNIQUE_ID(guard))
</code></pre>

<p>So it&#39;s a <code>CLASS</code> with a constructor and destructor, but the added bonus is the automagic <code>__cleanup</code> variable definition.</p>

<p>Why is that relevant? Well, consider locks. You don&#39;t declare a variable for a lock acquisition &amp; release, you manipulate an already-allocated object (e.g. a mutex). However, no variable declaration means no <code>__cleanup</code>. So this just declares a variable to slap <code>__cleanup</code> onto it and have an automagic out-of-scope cleanup callback.</p>

<p>Let&#39;s have a look at an example in the thermal subsystem with a mutex critical section:</p>

<pre><code class="language-C">DEFINE_GUARD(cooling_dev, struct thermal_cooling_device *, mutex_lock(&amp;_T-&gt;lock),
	     mutex_unlock(&amp;_T-&gt;lock))

static ssize_t
cur_state_store(struct device *dev, struct device_attribute *attr,
		const char *buf, size_t count)
{
	struct thermal_cooling_device *cdev = to_cooling_device(dev);
	unsigned long state;
	int result;
	/* ... */
	if (state &gt; cdev-&gt;max_state)
		return -EINVAL;

	guard(cooling_dev)(cdev);

	result = cdev-&gt;ops-&gt;set_cur_state(cdev, state);
	if (result)
		return result;
</code></pre>

<p>The preprocessor output looks like so:</p>

<pre><code class="language-C">typedef struct thermal_cooling_device * class_cooling_dev_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) void
class_cooling_dev_destructor(struct thermal_cooling_device * *p)
{
	struct thermal_cooling_device * _T = *p;
	if (!({
				unsigned long _rc = (unsigned long)(_T);
				__builtin_expect(!!((_rc - 1) &gt;= -4095 - 1), 0);
			})) {
		mutex_unlock(&amp;_T-&gt;lock);
	};
}
static inline __attribute__((__gnu_inline__))
__attribute__((__unused__)) __attribute__((no_instrument_function))
__attribute__((__always_inline__)) struct thermal_cooling_device *
class_cooling_dev_constructor(struct thermal_cooling_device * _T)
{
	struct thermal_cooling_device * t =
		({ mutex_lock(&amp;_T-&gt;lock); _T; });
	return t;
}

static __attribute__((__unused__)) const bool class_cooling_dev_is_conditional = false;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) void *
class_cooling_dev_lock_ptr(class_cooling_dev_t *_T)
{
	void *_ptr = (void *)(unsigned long)*(_T);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) int
class_cooling_dev_lock_err(class_cooling_dev_t *_T)
{
	long _rc = (unsigned long)*(_T);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) &gt;= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}
</code></pre>

<pre><code class="language-C">static ssize_t
cur_state_store(struct device *dev, struct device_attribute *attr,
		const char *buf, size_t count)
{
	struct thermal_cooling_device *cdev = ({ void *__mptr = (void *)(dev); _Static_assert(__builtin_types_compatible_p(typeof(*(dev)), typeof(((struct thermal_cooling_device *)0)-&gt;device)) || __builtin_types_compatible_p(typeof(*(dev)), typeof(void)), &#34;pointer type mismatch in container_of()&#34;); ((struct thermal_cooling_device *)(__mptr - __builtin_offsetof(struct thermal_cooling_device, device))); });
	unsigned long state;
	int result;

	if (sscanf(buf, &#34;%ld\n&#34;, &amp;state) != 1)
		return -22;

	if ((long)state &lt; 0)
		return -22;

	if (state &gt; cdev-&gt;max_state)
		return -22;

	class_cooling_dev_t __UNIQUE_ID_guard_435 __attribute__((__cleanup__(class_cooling_dev_destructor))) = class_cooling_dev_constructor(cdev);

	result = cdev-&gt;ops-&gt;set_cur_state(cdev, state);
	if (result)
		return result;

	thermal_cooling_device_stats_update(cdev, state);

	return count;
}
</code></pre>

<h2 id="define-lock-guard">DEFINE_LOCK_GUARD</h2>

<p>Okay, we have sort-of-smart pointers, classes, guards for locks, what&#39;s next? Well, certain locks need more than just a pointer for the lock &amp; unlock operations. For instance, the scheduler&#39;s runqueue locks need both a <code>struct rq</code> pointer and a <code>struct rq_flags</code> pointer.</p>

<p>So <code>LOCK_GUARD</code>&#39;s are going to be enhanced <code>GUARD</code>&#39;s manipulating a composite type instead of a single pointer:</p>

<pre><code class="language-C">#define __DEFINE_UNLOCK_GUARD(_name, _type, _unlock, ...)		\
typedef struct {							\
	_type *lock;							\
	__VA_ARGS__;							\
} class_##_name##_t;							\
</code></pre>

<p>Note that there is also the “no pointer” special case, which is when there is no accessible type for the manipulated lock – think <code>preempt_disable()</code>, <code>migrate_disable()</code>, <code>rcu_read_lock()</code>; Just like for <code>GUARD</code>, we still declare a variable to slap <code>__cleanup</code> onto it.</p>

<h3 id="let-s-look-at-the-rcu-case">Let&#39;s look at the RCU case:</h3>

<pre><code class="language-C">DEFINE_LOCK_GUARD_0(rcu,
	do {
		rcu_read_lock();
		/*
		 * sparse doesn&#39;t call the cleanup function,
		 * so just release immediately and don&#39;t track
		 * the context. We don&#39;t need to anyway, since
		 * the whole point of the guard is to not need
		 * the explicit unlock.
		 */
		__release(RCU);
	} while (0),
	rcu_read_unlock())
</code></pre>

<pre><code class="language-C">void wake_up_if_idle(int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	guard(rcu)();
	if (is_idle_task(rcu_dereference(rq-&gt;curr))) {
		// ....
	}
}
</code></pre>

<pre><code class="language-C">static __attribute__((__unused__)) const bool class_rcu_is_conditional = false;
typedef struct {
	void *lock;
} class_rcu_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_rcu_destructor(class_rcu_t *_T)
{
	if (!({
				unsigned long _rc = ( unsigned long)(_T-&gt;lock);
				__builtin_expect(!!((_rc - 1) &gt;= -4095 - 1), 0);
			})) {
		rcu_read_unlock();
	}
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void *
class_rcu_lock_ptr(class_rcu_t *_T)
{
	void *_ptr = (void *)( unsigned long)*(&amp;_T-&gt;lock);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
int
class_rcu_lock_err(class_rcu_t *_T)
{
	long _rc = ( unsigned long)*(&amp;_T-&gt;lock);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) &gt;= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
class_rcu_t
class_rcu_constructor(void)
{
	class_rcu_t _t = { .lock = (void*)1 }, *_T __attribute__((__unused__)) = &amp;_t;
	do {
		rcu_read_lock();
		(void)0; // __release(RCU); just for sparse, see comment in definition
	} while (0);
	return _t;
}
</code></pre>

<pre><code class="language-C">void wake_up_if_idle(int cpu)
{
	struct rq *rq = (&amp;(*({ do { const void __seg_gs *__vpp_verify = (typeof((&amp;(runqueues)) + 0))((void *)0); (void)__vpp_verify; } while (0); ({ unsigned long __ptr; __asm__ (&#34;&#34; : &#34;=r&#34;(__ptr) : &#34;0&#34;((__typeof_unqual__(*((&amp;(runqueues)))) *)(( unsigned long)((&amp;(runqueues)))))); (typeof((__typeof_unqual__(*((&amp;(runqueues)))) *)(( unsigned long)((&amp;(runqueues)))))) (__ptr + (((__per_cpu_offset[((cpu))])))); }); })));
	class_rcu_t __UNIQUE_ID_guard_1486 __attribute__((__cleanup__(class_rcu_destructor))) =
		class_rcu_constructor();

	if (is_idle_task(...) {
		// ...
	}
}
</code></pre>

<h3 id="let-s-look-at-the-runqueue-lock">Let&#39;s look at the runqueue lock:</h3>

<pre><code class="language-C">DEFINE_LOCK_GUARD_1(rq_lock_irqsave, struct rq,
		    rq_lock_irqsave(_T-&gt;lock, &amp;_T-&gt;rf),
		    rq_unlock_irqrestore(_T-&gt;lock, &amp;_T-&gt;rf),
		    struct rq_flags rf)
</code></pre>

<pre><code class="language-C">static void sched_balance_update_blocked_averages(int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	guard(rq_lock_irqsave)(rq);
	update_rq_clock(rq);
	__sched_balance_update_blocked_averages(rq);
}
</code></pre>

<pre><code class="language-C">static __attribute__((__unused__)) const bool class_rq_lock_irqsave_is_conditional = false;

typedef struct {
	struct rq *lock;
	struct rq_flags rf;
} class_rq_lock_irqsave_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_rq_lock_irqsave_destructor(class_rq_lock_irqsave_t *_T)
{
	if (!({
				unsigned long _rc = ( unsigned long)(_T-&gt;lock);
				__builtin_expect(!!((_rc - 1) &gt;= -4095 - 1), 0);
			})) {
		rq_unlock_irqrestore(_T-&gt;lock, &amp;_T-&gt;rf);
	}
}
static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void *
class_rq_lock_irqsave_lock_ptr(class_rq_lock_irqsave_t *_T)
{
	void *_ptr = (void *)( unsigned long)*(&amp;_T-&gt;lock);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}
static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
int
class_rq_lock_irqsave_lock_err(class_rq_lock_irqsave_t *_T)
{
	long _rc = ( unsigned long)*(&amp;_T-&gt;lock);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) &gt;= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
class_rq_lock_irqsave_t
class_rq_lock_irqsave_constructor(struct rq *l)
{
	class_rq_lock_irqsave_t _t = { .lock = l }, *_T = &amp;_t;
	rq_lock_irqsave(_T-&gt;lock, &amp;_T-&gt;rf);
	return _t;
}
</code></pre>

<pre><code class="language-C">static void sched_balance_update_blocked_averages(int cpu)
{
 struct rq *rq = (&amp;(*({ do { const void __seg_gs *__vpp_verify = (typeof((&amp;(runqueues)) + 0))((void *)0); (void)__vpp_verify; } while (0); ({ unsigned long __ptr; __asm__ (&#34;&#34; : &#34;=r&#34;(__ptr) : &#34;0&#34;((__typeof_unqual__(*((&amp;(runqueues)))) *)(( unsigned long)((&amp;(runqueues)))))); (typeof((__typeof_unqual__(*((&amp;(runqueues)))) *)(( unsigned long)((&amp;(runqueues)))))) (__ptr + (((__per_cpu_offset[((cpu))])))); }); })));

 class_rq_lock_irqsave_t __UNIQUE_ID_guard_1377
	 __attribute__((__cleanup__(class_rq_lock_irqsave_destructor))) =
	 class_rq_lock_irqsave_constructor(rq);

 update_rq_clock(rq);
 __sched_balance_update_blocked_averages(rq);
}
</code></pre>

<h2 id="scopes">SCOPES</h2>

<p>Scope creation is slightly different for classes and guards, but follow the same principle.</p>

<h3 id="class">Class</h3>

<pre><code class="language-C">#define __scoped_class(_name, var, _label, args...)        \
	for (CLASS(_name, var)(args); ; ({ goto _label; })) \
		if (0) {                                   \
_label:                                                    \
			break;                             \
		} else

#define scoped_class(_name, var, args...) \
	__scoped_class(_name, var, __UNIQUE_ID(label), args)
</code></pre>

<p>That for+if+goto trinity looks a bit unholy at first, but let&#39;s look at what the requirements are for a macro that lets us create a new scope:
– create a new scope
– declare the <code>__cleanup</code> variable in that new scope
– make the macro usable either with a single statement, or with curly braces</p>

<p>A for loop gives us the declaration and the scope. However that for loop needs to run once, and it&#39;d be a shame to have to declare a loop counter. The “run exactly once” mechanism is thus encoded in the form of the if+goto.</p>

<p>Consider:</p>

<pre><code>	for (CLASS(_name, var)(args); ; ({ goto _label; }))
		if (0) {
_label:
			break;
		} else {
		   stmt;
		}
</code></pre>

<p>The execution order will be:</p>

<pre><code>CLASS(_name, var)(args);
stmt;
goto _label;
break;
</code></pre>

<p>We thus save ourselves the need for an extra variable at the cost of mild code reader confusion, a common trick used in the kernel.</p>

<h3 id="guards">Guards</h3>

<p>For guard scopes, we find the same for+if+goto construct but with some added checks. For regular (unconditional) guards, this is pretty much the same as for <code>CLASS</code>&#39;es:</p>

<pre><code class="language-C">/*
 * Helper macro for scoped_guard().
 *
 * Note that the &#34;!__is_cond_ptr(_name)&#34; part of the condition ensures that
 * compiler would be sure that for the unconditional locks the body of the
 * loop (caller-provided code glued to the else clause) could not be skipped.
 * It is needed because the other part - &#34;__guard_ptr(_name)(&amp;scope)&#34; - is too
 * hard to deduce (even if could be proven true for unconditional locks).
 */
#define __scoped_guard(_name, _label, args...)				\
	for (CLASS(_name, scope)(args);					\
	     __guard_ptr(_name)(&amp;scope) || !__is_cond_ptr(_name);	\
	     ({ goto _label; }))					\
		if (0) {						\
_label:									\
			break;						\
		} else

#define scoped_guard(_name, args...)	\
	__scoped_guard(_name, __UNIQUE_ID(label), args)
</code></pre>

<p>For conditional guards, we mainly factor in the fact that the constructor can “fail”. This is relevant for e.g. trylocks where the lock acquisition isn&#39;t guaranteed to succeed.</p>

<pre><code class="language-C">#define __scoped_cond_guard(_name, _fail, _label, args...)		\
	for (CLASS(_name, scope)(args); true; ({ goto _label; }))	\
		if (!__guard_ptr(_name)(&amp;scope)) {			\
			BUILD_BUG_ON(!__is_cond_ptr(_name));		\
			_fail;						\
_label:									\
			break;						\
		} else

#define scoped_cond_guard(_name, _fail, args...)	\
	__scoped_cond_guard(_name, _fail, __UNIQUE_ID(label), args)
</code></pre>

<p>So in the end, that <code>__DEFINE_CLASS_IS_CONDITIONAL()</code> faff is there:
– To help optimize unconditional guard scopes
– To ensure conditional guard scopes are used correctly (i.e. the lock acquisition failure is expected)</p>

<h2 id="debuggability">Debuggability</h2>

<p>You&#39;ll note that while guards delete an entire class of error associated with goto&#39;s, they shuffle the code around.</p>

<p>From my experimentation, if you put the constructor and the destructor on a separate line in the <code>CLASS</code>/<code>GUARD</code> definition, you&#39;ll at least be able to tell them apart during a splat:</p>

<pre><code class="language-C">DEFINE_LOCK_GUARD_1(raw_spinlock_irqsave_bug, raw_spinlock_t,
		    raw_spin_lock_irqsave(_T-&gt;lock, _T-&gt;flags),
spinlock.h:571:	    raw_spin_unlock_irqrestore_bug(_T-&gt;lock, _T-&gt;flags),
		    unsigned long flags)

int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	        ...
core.c:4108	scoped_guard (raw_spinlock_irqsave_bug, &amp;p-&gt;pi_lock) {
	        }
	        ...
}
</code></pre>

<pre><code>[    0.216287] kernel BUG at ./include/linux/spinlock.h:571!
[    0.217115] Oops: invalid opcode: 0000 [#1] SMP PTI
[    0.217285] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.217285] RIP: 0010:try_to_wake_up (./include/linux/spinlock.h:569 (discriminator 6) kernel/sched/core.c:4108 (discriminator 6))
[    0.217285] Call Trace:
[    0.217285]  &lt;TASK&gt;
[    0.217285]  ? __pfx_kthread_worker_fn (kernel/kthread.c:966)
[    0.217285]  __kthread_create_on_node (kernel/kthread.c:535)
[    0.217285]  kthread_create_worker_on_node (kernel/kthread.c:1043 (discriminator 1) kernel/kthread.c:1073 (discriminator 1))
[    0.217285]  ? vprintk_emit (kernel/printk/printk.c:4625 kernel/printk/printk.c:2433)
[    0.217285]  workqueue_init (kernel/workqueue.c:7873 kernel/workqueue.c:7922)
[    0.217285]  kernel_init_freeable (init/main.c:1675)
[    0.217285]  ? __pfx_kernel_init (init/main.c:1570)
[    0.217285]  kernel_init (init/main.c:1580)
[    0.217285]  ret_from_fork (arch/x86/kernel/process.c:164)
[    0.217285]  ? __pfx_kernel_init (init/main.c:1570)
[    0.217285]  ret_from_fork_asm (arch/x86/entry/entry_64.S:259)
[    0.217285]  &lt;/TASK&gt;
</code></pre>

<h2 id="tl-dr">TL;DR</h2>
<ul><li><p>DEFINE_FREE()</p>
<ul><li>Sort-of-smart pointer</li>
<li>Definition tied to the freeing function, e.g. <code>DEFINE_KFREE(kfree,...)</code></li></ul></li>

<li><p>DEFINE_CLASS()</p>
<ul><li>Like <code>DEFINE_FREE()</code> but with factorized initialization.</li></ul></li>

<li><p>DEFINE_GUARD()</p>
<ul><li>Like <code>DEFINE_CLASS()</code> but you don&#39;t need the underlying variable</li>
<li>e.g. locks don&#39;t require declaring a variable, you just lock and unlock them.</li></ul></li>

<li><p>DEFINE_LOCK_GUARD()</p>
<ul><li>Like <code>DEFINE_GUARD()</code> but when a single pointer isn&#39;t sufficient for lock/unlock operations.</li>
<li>Also for “special” locks with no underlying type such as RCU, preempt or <code>migrate_disable</code>.</li></ul></li></ul>
]]></content:encoded>
      <author>Valentin Schneider</author>
      <guid>https://people.kernel.org/read/a/afa73uhe3s</guid>
      <pubDate>Fri, 30 Jan 2026 10:39:06 +0000</pubDate>
    </item>
    <item>
      <title>Tracking kernel development with korgalore</title>
      <link>https://people.kernel.org/monsieuricon/tracking-kernel-development-with-korgalore</link>
      <description>&lt;![CDATA[TLDR: use korgalore to bypass mailing list delivery problems&#xA;&#xA;If you&#39;re a Gmail or Outlook user and you&#39;re subscribed to high-volume mailing lists, you&#39;re probably routinely missing mail. Korgalore is a tool that monitors mailing lists via lore.kernel.org and can import mail directly into your inbox so you don&#39;t miss any of it. You can also couple korgalore with lei for powerful filtering features that can reduce the firehose to what you&#39;d actually find useful.&#xA;&#xA;The problem with the &#34;big 3&#34;&#xA;&#xA;If you&#39;re a user of Gmail or Outlook trying to participate in Linux kernel development, you&#39;re probably aware that it&#39;s... not great. Truth is, it&#39;s nearly impossible these days to run a technical mailing list and expect that it will be successfully delivered to the &#34;big 3&#34; consumer-grade mailbox providers -- Gmail, Outlook, or Yahoo.&#xA;&#xA;There are many reasons for it, and the primary one is that technical mail looks nothing like 99.99% of the mail traffic that their filters are trained on, and therefore when a technical message arrives, especially if it includes a patch, the automation thinks it&#39;s likely spam or something potentially unsafe. If you&#39;re not checking your junk folder daily, you&#39;re probably missing a lot of legitimate email.&#xA;&#xA;Worst of all, if you&#39;re trying to subscribe to a high-volume mailing list using gmail or outlook, you can forget it -- you will hit delivery quotas almost instantly. Our outgoing mail nodes routinely hit queues of 100,000+ messages, all because of &#34;temporary delivery quotas&#34; trying to deliver mail to gmail subscribers.&#xA;&#xA;Korgalore is a tool that can help. It fetches messages directly from public-inbox archives (like lore.kernel.org) and delivers them directly to your mailbox, bypassing all the problematic mail routing that causes messages to go missing.&#xA;&#xA;How korgalore helps&#xA;&#xA;We cannot fix email delivery, but we can sidestep it entirely. Public-inbox archives like lore.kernel.org store all mailing list traffic in git repositories. In its simplest configuration, korgalore can shallow-clone these repositories directly and upload any new messages straight to your mailbox using the provider&#39;s API.&#xA;&#xA;This approach has several advantages:&#xA;&#xA;Nothing gets lost — you get every message that was posted to the list&#xA;You control the labels/folders — organize messages however you want&#xA;Works with your existing workflow — messages appear in your regular inbox&#xA;&#xA;Korgalore currently supports these delivery targets:&#xA;&#xA;Gmail (via API with OAuth2)&#xA;Microsoft 365 (via IMAP with OAuth2)&#xA;Generic IMAP servers&#xA;JMAP servers (Fastmail, etc.)&#xA;Local maildir&#xA;Pipe to external command (e.g. so you can feed it to fetchmail)&#xA;&#xA;Installing korgalore&#xA;&#xA;The easiest way to install korgalore is via pipx:&#xA;&#xA;$ pipx install korgalore&#xA;[...]&#xA;$ kgl --version&#xA;kgl, version 0.4&#xA;&#xA;For the GUI application, you&#39;ll also need GTK and AppIndicator libraries. On Fedora:&#xA;&#xA;$ sudo dnf install python3-gobject gtk3 libappindicator-gtk3&#xA;$ pipx install &#39;korgalore[gui]&#39;&#xA;&#xA;Getting started with Gmail&#xA;&#xA;This is the hardest part of the process, because Google makes it unreasonably hard to get API access to your own inbox. It&#39;s like they don&#39;t want you to even try it.&#xA;&#xA;Getting API access credentials&#xA;&#xA;You will need to start by getting OAUTH2 client credentials.&#xA;&#xA;If you are a kernel maintainer with an active kernel.org account, you can run the following command to get what you&#39;ll need directly from us:&#xA;&#xA;$ ssh git@gitolite.kernel.org get-kgl-creds&#xA;&#xA;If you&#39;re not a kernel maintainer, then I&#39;m afraid you&#39;re going to have to jump through a bajillion hoops. The process is described on this page:&#xA;&#xA;Korgalore quickstart&#xA;&#xA;Authenticating with Google&#xA;&#xA;Once you have the json file with your credentials, run kgl edit-config. An editor will open with the following content:&#xA;&#xA;Targets &#xA;&#xA;[targets.personal]&#xA;type = &#39;gmail&#39;&#xA;credentials = &#39;~/.config/korgalore/credentials.json&#39;&#xA;token = &#39;~/.config/korgalore/token.json&#39;&#xA;&#xA;Deliveries &#xA;&#xA;[deliveries.lkml]&#xA;feed = &#39;https://lore.kernel.org/lkml&#39;&#xA;target = &#39;personal&#39;&#xA;labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]&#xA;&#xA;Just save it for now without any edits, but make a note where the credentials path is. Save the json file you got from Google (or from us) to that location: ~/.config/korgalore/credentials.json.&#xA;&#xA;Next, authenticate with your Gmail account:&#xA;&#xA;$ kgl auth personal&#xA;&#xA;This opens a browser window for OAuth2 authentication, so it needs to run on your workstation, because it will need to talk to localhost to complete the authentication.&#xA;&#xA;Once you have obtained the token, it is stored locally and refreshed automatically unless revalidation is required (once a week for &#34;testing&#34; applications).&#xA;&#xA;Configure a delivery&#xA;&#xA;Let&#39;s say you want to subscribe to the netdev list.&#xA;&#xA;Edit the configuration file again:&#xA;&#xA;$ kgl edit-config&#xA;&#xA;Add a delivery that maps the netdev feed to your Gmail account:&#xA;&#xA;[deliveries.netdev]&#xA;feed = &#39;https://lore.kernel.org/netdev&#39;&#xA;target = &#39;personal&#39;&#xA;labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]&#xA;&#xA;Run kgl pull once after this to initialize the subscription. No mail will be delivered on the first run, so it&#39;s just like subscribing to a real list.&#xA;&#xA;$ kgl pull&#xA;Updating feeds  [####################################]  1/1&#xA;Initialized new feed: netdev&#xA;Pull complete with no updates.&#xA;&#xA;You can add any list hosted on lore.kernel.org (or on any other public-inbox server) as a separate delivery.&#xA;&#xA;Periodic pulls&#xA;&#xA;Next time you run kgl pull you will see something like this, assuming there is new mail to be delivered:&#xA;&#xA;$ kgl pull&#xA;Updating feeds  [####################################]  1/1&#xA;Delivering to personal  [####################################]  2/2&#xA;Pull complete with updates:&#xA;  netdev: 2&#xA;&#xA;If all went well, messages will appear in your Gmail inbox.&#xA;&#xA;Targets other than gmail&#xA;&#xA;Korgalore will happily deliver to the following targets:&#xA;&#xA;Gmail&#xA;Outlook 365&#xA;Generic IMAP&#xA;JMAP (Fastmail)&#xA;Maildir&#xA;Pipe&#xA;&#xA;Refer to the following documentation page on configuration details:&#xA;&#xA;Korgalore configuration&#xA;&#xA;Yanking a random thread&#xA;&#xA;See a thread on lore that you just really want to answer? You can yank it into your inbox by just pasting the URL to the message you want (or use --thread for the whole thread):&#xA;&#xA;$ kgl yank --thread https://lore.kernel.org/netdev/CAFfOh4cX0+L=ieAJF7QBvH-dDYsHnTUuN4gApguqxVpWyy2g@mail.gmail.com&#xA;Found 5 messages in thread&#xA;Uploading thread  [####################################]  5/5&#xA;Successfully uploaded 5 messages from thread&#xA;&#xA;Doing a lot more with lei&#xA;&#xA;The lei tool is the client-side utility for querying and interacting with public-inbox servers. It should be installable on most distributions these days. For Fedora:&#xA;&#xA;$ sudo dnf install lei&#xA;&#xA;This will pull in a large number of Perl dependencies, but they are all fairly tiny.&#xA;&#xA;Yank and track a thread&#xA;&#xA;Sometimes you don&#39;t want to follow an entire list, just a specific hot topic discussion. The track command lets you yank a thread and then receive any follow-ups to it. Korgalore lets you do that easily:&#xA;&#xA;$ kgl track add https://lore.kernel.org/lkml/20260116.feegh2ohQuae@digikod.net/&#xA;Creating lei search for thread: 20260116.feegh2ohQuae@digikod.net&#xA;Populating lei search repository...&#xA;Started tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6&#xA;Now tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6&#xA;Target: personal, Labels: INBOX, UNREAD&#xA;Delivering 5 messages to target...&#xA;Delivered 5 messages.&#xA;&#xA;This creates a persistent search that monitors lore.kernel.org for replies to that thread. New messages are automatically delivered during regular pull operations.&#xA;&#xA;Tired of tracking a thread? Find it with kgl track list and then stop following it:&#xA;&#xA;$ kgl track stop track-bcd0dc0604fd&#xA;&#xA;Threads automatically expire after 30 days of inactivity, but can be resumed if the discussion picks up again.&#xA;&#xA;Tracking messages for a specific subsystem&#xA;&#xA;If you&#39;re a maintainer, you can track your entire subsystem using the track-subsystem command. This parses the kernel&#39;s MAINTAINERS file and creates queries for all relevant mailing list traffic:&#xA;&#xA;$ kgl track-subsystem -m MAINTAINERS &#39;SELINUX SECURITY MODULE&#39;&#xA;Found subsystem: SELINUX SECURITY MODULE&#xA;Creating mailinglist query: l:selinux.vger.kernel.org AND d:7.days.ago..&#xA;Creating patches query: (dfn:... OR dfn:... [...]) AND d:7.days.ago..&#xA;Created 2 lei queries for subsystem &#34;SELINUX SECURITY MODULE&#34;&#xA;Configuration written to: /home/user/.config/korgalore/conf.d/selinuxsecuritymodule.toml&#xA;Target: personal, Labels: INBOX, UNREAD&#xA;&#xA;This effectively subscribes you to the selinux mailing list, plus creates a query that will match the patches touching that subsystem, using the patterns defined in MAINTAINERS.&#xA;&#xA;The next time you run kgl pull, it will upload the last 7 days of messages matching both queries:&#xA;&#xA;$ kgl pull&#xA;Updating feeds  [####################################]  3/3&#xA;Delivering to personal  [####################################]  37/37&#xA;Pull complete with updates:&#xA;  selinuxsecuritymodule-mailinglist: 33&#xA;  selinuxsecuritymodule-patches: 4&#xA;&#xA;Arbitrary lei queries&#xA;&#xA;Korgalore will happily follow arbitrary lei queries that you have defined. For example, if you want to receive a copy of all mail sent by a co-maintainer, you can run the following:&#xA;&#xA;$ lei q --only https://lore.kernel.org/all \&#xA;    -o v2:/home/user/.lei/comaintainer-spying \&#xA;    f:torvalds@linux-foundation.org AND d:7.days.ago..&#xA;&#xA;Then you can add the following section to korgalore.toml:&#xA;&#xA;[deliveries.comaintainer-spying]&#xA;feed = &#39;lei:/home/user/.lei/comaintainer-spying&#39;&#xA;target = &#39;personal&#39;&#xA;labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]&#xA;&#xA;Filtering unwanted senders&#xA;&#xA;Korgalore doesn&#39;t come with complicated filtering -- lei is much more suited for that purpose. However, if there is someone whose mail you absolutely never want to see, you can add them to the bozofilter.&#xA;&#xA;$ kgl bozofilter --add bozo@example.com --reason &#34;off-topic noise&#34;&#xA;&#xA;Blocked messages are silently skipped during delivery.&#xA;&#xA;Using the GUI taskbar app for background syncing&#xA;&#xA;For day-to-day use, the GUI application runs in your system tray and syncs automatically:&#xA;&#xA;$ kgl gui&#xA;&#xA;The GUI provides:&#xA;&#xA;Automatic background syncing at configurable intervals&#xA;Manual &#34;Sync Now&#34; when you want immediate updates&#xA;&#34;Yank&#34; dialog to fetch specific messages by URL or Message-ID&#xA;Network awareness — pauses sync when offline and resumes when connected&#xA;Re-authentication prompts when OAuth tokens expire&#xA;Quick editing of the config or the bozofilter&#xA;&#xA;Here are a couple of videos demonstrating the gui app in action:&#xA;&#xA;Korgalore with Gmail&#xA;Korgalore with Outlook&#xA;&#xA;Documentation and source&#xA;&#xA;Full documentation is available at:&#xA;&#xA;    https://korgalore.docs.kernel.org/&#xA;&#xA;Source repository:&#xA;&#xA;    https://git.kernel.org/pub/scm/utils/korgalore/korgalore.git&#xA;&#xA;If you run into issues or have feature requests, please send them to tools@kernel.org.]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="tldr-use-korgalore-to-bypass-mailing-list-delivery-problems">TLDR: use korgalore to bypass mailing list delivery problems</h2>

<p>If you&#39;re a Gmail or Outlook user and you&#39;re subscribed to high-volume mailing lists, you&#39;re probably routinely missing mail. Korgalore is a tool that monitors mailing lists via lore.kernel.org and can import mail directly into your inbox so you don&#39;t miss any of it. You can also couple korgalore with lei for powerful filtering features that can reduce the firehose to what you&#39;d actually find useful.</p>

<h2 id="the-problem-with-the-big-3">The problem with the “big 3”</h2>

<p>If you&#39;re a user of Gmail or Outlook trying to participate in Linux kernel development, you&#39;re probably aware that it&#39;s... not great. Truth is, it&#39;s nearly impossible these days to run a technical mailing list and expect that it will be successfully delivered to the “big 3” consumer-grade mailbox providers — Gmail, Outlook, or Yahoo.</p>

<p>There are many reasons for it, and the primary one is that technical mail looks nothing like 99.99% of the mail traffic that their filters are trained on, and therefore when a technical message arrives, especially if it includes a patch, the automation thinks it&#39;s likely spam or something potentially unsafe. If you&#39;re not checking your junk folder daily, you&#39;re probably missing a lot of legitimate email.</p>

<p>Worst of all, if you&#39;re trying to subscribe to a high-volume mailing list using gmail or outlook, you can forget it — you will hit delivery quotas almost instantly. Our outgoing mail nodes routinely hit queues of 100,000+ messages, all because of “temporary delivery quotas” trying to deliver mail to gmail subscribers.</p>

<p>Korgalore is a tool that can help. It fetches messages directly from public-inbox archives (like lore.kernel.org) and delivers them directly to your mailbox, bypassing all the problematic mail routing that causes messages to go missing.</p>

<h2 id="how-korgalore-helps">How korgalore helps</h2>

<p>We cannot fix email delivery, but we can sidestep it entirely. Public-inbox archives like lore.kernel.org store all mailing list traffic in git repositories. In its simplest configuration, korgalore can shallow-clone these repositories directly and upload any new messages straight to your mailbox using the provider&#39;s API.</p>

<p>This approach has several advantages:</p>
<ul><li><strong>Nothing gets lost</strong> — you get every message that was posted to the list</li>
<li><strong>You control the labels/folders</strong> — organize messages however you want</li>
<li><strong>Works with your existing workflow</strong> — messages appear in your regular inbox</li></ul>

<p>Korgalore currently supports these delivery targets:</p>
<ul><li>Gmail (via API with OAuth2)</li>
<li>Microsoft 365 (via IMAP with OAuth2)</li>
<li>Generic IMAP servers</li>
<li>JMAP servers (Fastmail, etc.)</li>
<li>Local maildir</li>
<li>Pipe to external command (e.g. so you can feed it to fetchmail)</li></ul>

<h2 id="installing-korgalore">Installing korgalore</h2>

<p>The easiest way to install korgalore is via pipx:</p>

<pre><code>$ pipx install korgalore
[...]
$ kgl --version
kgl, version 0.4
</code></pre>

<p>For the GUI application, you&#39;ll also need GTK and AppIndicator libraries. On Fedora:</p>

<pre><code>$ sudo dnf install python3-gobject gtk3 libappindicator-gtk3
$ pipx install &#39;korgalore[gui]&#39;
</code></pre>

<h2 id="getting-started-with-gmail">Getting started with Gmail</h2>

<p>This is the hardest part of the process, because <strong>Google makes it unreasonably hard to get API access to your own inbox</strong>. It&#39;s like they don&#39;t want you to even try it.</p>

<h3 id="getting-api-access-credentials">Getting API access credentials</h3>

<p>You will need to start by getting OAUTH2 client credentials.</p>

<p>If you are a kernel maintainer with an active kernel.org account, you can run the following command to get what you&#39;ll need directly from us:</p>

<pre><code>$ ssh git@gitolite.kernel.org get-kgl-creds
</code></pre>

<p>If you&#39;re not a kernel maintainer, then I&#39;m afraid you&#39;re going to have to jump through a bajillion hoops. The process is described on this page:</p>
<ul><li><a href="https://korgalore.docs.kernel.org/en/latest/quickstart.html#step-2-set-up-gmail-api-credentials" rel="nofollow">Korgalore quickstart</a></li></ul>

<h3 id="authenticating-with-google">Authenticating with Google</h3>

<p>Once you have the json file with your credentials, run <code>kgl edit-config</code>. An editor will open with the following content:</p>

<pre><code class="language-toml">### Targets ###

[targets.personal]
type = &#39;gmail&#39;
credentials = &#39;~/.config/korgalore/credentials.json&#39;
# token = &#39;~/.config/korgalore/token.json&#39;

### Deliveries ###

# [deliveries.lkml]
# feed = &#39;https://lore.kernel.org/lkml&#39;
# target = &#39;personal&#39;
# labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]
</code></pre>

<p>Just save it for now without any edits, but make a note where the credentials path is. Save the json file you got from Google (or from us) to that location: <code>~/.config/korgalore/credentials.json</code>.</p>

<p>Next, authenticate with your Gmail account:</p>

<pre><code>$ kgl auth personal
</code></pre>

<p>This opens a browser window for OAuth2 authentication, so it needs to run on your workstation, because it will need to talk to localhost to complete the authentication.</p>

<p>Once you have obtained the token, it is stored locally and refreshed automatically unless revalidation is required (once a week for “testing” applications).</p>

<h3 id="configure-a-delivery">Configure a delivery</h3>

<p>Let&#39;s say you want to subscribe to the netdev list.</p>

<p>Edit the configuration file again:</p>

<pre><code>$ kgl edit-config
</code></pre>

<p>Add a delivery that maps the netdev feed to your Gmail account:</p>

<pre><code class="language-toml">[deliveries.netdev]
feed = &#39;https://lore.kernel.org/netdev&#39;
target = &#39;personal&#39;
labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]
</code></pre>

<p>Run <code>kgl pull</code> once after this to initialize the subscription. No mail will be delivered on the first run, so it&#39;s just like subscribing to a real list.</p>

<pre><code>$ kgl pull
Updating feeds  [####################################]  1/1
Initialized new feed: netdev
Pull complete with no updates.
</code></pre>

<p>You can add any list hosted on lore.kernel.org (or on any other public-inbox server) as a separate delivery.</p>

<h3 id="periodic-pulls">Periodic pulls</h3>

<p>Next time you run <code>kgl pull</code> you will see something like this, assuming there is new mail to be delivered:</p>

<pre><code>$ kgl pull
Updating feeds  [####################################]  1/1
Delivering to personal  [####################################]  2/2
Pull complete with updates:
  netdev: 2
</code></pre>

<p>If all went well, messages will appear in your Gmail inbox.</p>

<h2 id="targets-other-than-gmail">Targets other than gmail</h2>

<p>Korgalore will happily deliver to the following targets:</p>
<ul><li>Gmail</li>
<li>Outlook 365</li>
<li>Generic IMAP</li>
<li>JMAP (Fastmail)</li>
<li>Maildir</li>
<li>Pipe</li></ul>

<p>Refer to the following documentation page on configuration details:</p>
<ul><li><a href="https://korgalore.docs.kernel.org/en/latest/configuration.html" rel="nofollow">Korgalore configuration</a></li></ul>

<h2 id="yanking-a-random-thread">Yanking a random thread</h2>

<p>See a thread on lore that you just really want to answer? You can <code>yank</code> it into your inbox by just pasting the URL to the message you want (or use <code>--thread</code> for the whole thread):</p>

<pre><code>$ kgl yank --thread https://lore.kernel.org/netdev/CAFfO_h4cX0+L=ieA_JF7QBvH-dDYsHnTUuN4gApguqxVpWyy2g@mail.gmail.com
Found 5 messages in thread
Uploading thread  [####################################]  5/5
Successfully uploaded 5 messages from thread
</code></pre>

<h2 id="doing-a-lot-more-with-lei">Doing a lot more with lei</h2>

<p>The <code>lei</code> tool is the client-side utility for querying and interacting with public-inbox servers. It should be installable on most distributions these days. For Fedora:</p>

<pre><code>$ sudo dnf install lei
</code></pre>

<p>This will pull in a large number of Perl dependencies, but they are all fairly tiny.</p>

<h3 id="yank-and-track-a-thread">Yank and track a thread</h3>

<p>Sometimes you don&#39;t want to follow an entire list, just a specific hot topic discussion. The <code>track</code> command lets you yank a thread and then receive any follow-ups to it. Korgalore lets you do that easily:</p>

<pre><code>$ kgl track add https://lore.kernel.org/lkml/20260116.feegh2ohQuae@digikod.net/
Creating lei search for thread: 20260116.feegh2ohQuae@digikod.net
Populating lei search repository...
Started tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6
Now tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6
Target: personal, Labels: INBOX, UNREAD
Delivering 5 messages to target...
Delivered 5 messages.
</code></pre>

<p>This creates a persistent search that monitors lore.kernel.org for replies to that thread. New messages are automatically delivered during regular <code>pull</code> operations.</p>

<p>Tired of tracking a thread? Find it with <code>kgl track list</code> and then stop following it:</p>

<pre><code>$ kgl track stop track-bcd0dc0604fd
</code></pre>

<p>Threads automatically expire after 30 days of inactivity, but can be resumed if the discussion picks up again.</p>

<h3 id="tracking-messages-for-a-specific-subsystem">Tracking messages for a specific subsystem</h3>

<p>If you&#39;re a maintainer, you can track your entire subsystem using the <code>track-subsystem</code> command. This parses the kernel&#39;s MAINTAINERS file and creates queries for all relevant mailing list traffic:</p>

<pre><code>$ kgl track-subsystem -m MAINTAINERS &#39;SELINUX SECURITY MODULE&#39;
Found subsystem: SELINUX SECURITY MODULE
Creating mailinglist query: l:selinux.vger.kernel.org AND d:7.days.ago..
Creating patches query: (dfn:... OR dfn:... [...]) AND d:7.days.ago..
Created 2 lei queries for subsystem &#34;SELINUX SECURITY MODULE&#34;
Configuration written to: /home/user/.config/korgalore/conf.d/selinux_security_module.toml
Target: personal, Labels: INBOX, UNREAD
</code></pre>

<p>This effectively subscribes you to the selinux mailing list, plus creates a query that will match the patches touching that subsystem, using the patterns defined in MAINTAINERS.</p>

<p>The next time you run <code>kgl pull</code>, it will upload the last 7 days of messages matching both queries:</p>

<pre><code>$ kgl pull
Updating feeds  [####################################]  3/3
Delivering to personal  [####################################]  37/37
Pull complete with updates:
  selinux_security_module-mailinglist: 33
  selinux_security_module-patches: 4
</code></pre>

<h3 id="arbitrary-lei-queries">Arbitrary lei queries</h3>

<p>Korgalore will happily follow arbitrary lei queries that you have defined. For example, if you want to receive a copy of all mail sent by a co-maintainer, you can run the following:</p>

<pre><code>$ lei q --only https://lore.kernel.org/all \
    -o v2:/home/user/.lei/comaintainer-spying \
    f:torvalds@linux-foundation.org AND d:7.days.ago..
</code></pre>

<p>Then you can add the following section to <code>korgalore.toml</code>:</p>

<pre><code class="language-toml">[deliveries.comaintainer-spying]
feed = &#39;lei:/home/user/.lei/comaintainer-spying&#39;
target = &#39;personal&#39;
labels = [&#39;INBOX&#39;, &#39;UNREAD&#39;]
</code></pre>

<h2 id="filtering-unwanted-senders">Filtering unwanted senders</h2>

<p>Korgalore doesn&#39;t come with complicated filtering — lei is much more suited for that purpose. However, if there is someone whose mail you absolutely never want to see, you can add them to the bozofilter.</p>

<pre><code>$ kgl bozofilter --add bozo@example.com --reason &#34;off-topic noise&#34;
</code></pre>

<p>Blocked messages are silently skipped during delivery.</p>

<h2 id="using-the-gui-taskbar-app-for-background-syncing">Using the GUI taskbar app for background syncing</h2>

<p>For day-to-day use, the GUI application runs in your system tray and syncs automatically:</p>

<pre><code>$ kgl gui
</code></pre>

<p>The GUI provides:</p>
<ul><li>Automatic background syncing at configurable intervals</li>
<li>Manual “Sync Now” when you want immediate updates</li>
<li>“Yank” dialog to fetch specific messages by URL or Message-ID</li>
<li>Network awareness — pauses sync when offline and resumes when connected</li>
<li>Re-authentication prompts when OAuth tokens expire</li>
<li>Quick editing of the config or the bozofilter</li></ul>

<p>Here are a couple of videos demonstrating the gui app in action:</p>
<ul><li><a href="https://youtu.be/wOi_qHSO7KY" rel="nofollow">Korgalore with Gmail</a></li>
<li><a href="https://youtu.be/_MPBXcaKOMU" rel="nofollow">Korgalore with Outlook</a></li></ul>

<h2 id="documentation-and-source">Documentation and source</h2>

<p>Full documentation is available at:</p>

<p>    <a href="https://korgalore.docs.kernel.org/" rel="nofollow">https://korgalore.docs.kernel.org/</a></p>

<p>Source repository:</p>

<p>    <a href="https://git.kernel.org/pub/scm/utils/korgalore/korgalore.git" rel="nofollow">https://git.kernel.org/pub/scm/utils/korgalore/korgalore.git</a></p>

<p>If you run into issues or have feature requests, please send them to tools@kernel.org.</p>
]]></content:encoded>
      <author>Konstantin Ryabitsev</author>
      <guid>https://people.kernel.org/read/a/03qcg074uf</guid>
      <pubDate>Tue, 20 Jan 2026 21:14:39 +0000</pubDate>
    </item>
    <item>
      <title>What on Earth Does Pointer Provenance Have to do With RCU?</title>
      <link>https://people.kernel.org/paulmck/what-on-earth-does-lifetime-end-pointer-zap-have-to-do-with-rcu</link>
      <description>&lt;![CDATA[TL;DR: Unless you are doing very strange things with RCU (read-copy update), not much!!!&#xA;&#xA;So why has the guy most responsible for Linux-kernel spent so much time over the past five years working on the provenance-related lifetime-end pointer zap within the C++ Standards Committee?&#xA;&#xA;But first...&#xA;&#xA;What is Pointer Provenance?&#xA;&#xA;Back in the old days, provenance was for objets d&#39;art and the like, and we did not need them for our pointers, no sirree!!!  Pointers had bits, those bits formed memory addresses, and as often as not we didn&#39;t even need to worry about these addresses being translated.  But life is more complicated now.  On the other hand, computing life is also much bigger, faster, more reliable, and (usually) more productive, so be extremely careful what you wish for from back in the Good Old Days!&#xA;&#xA;These days, pointers have provenance as well as addresses, and this has consequences.  The C++ Standard  (recent draft) states that when an object&#39;s storage duration ends, any pointers to that object become invalid.  For its part, the C Standard states that when an object&#39;s storage duration ends, any pointers to that object become indeterminate.  In both standards, the wording is more precise, but this will serve for our purposes.&#xA;&#xA;For the remainder of this document, we will follow C++ and say &#34;invalid&#34;, which is shorter than &#34;indeterminate&#34;.  We will balance this out by using C-language example code.  Those preferring C++ will be happy to hear that this is the language that I use in my upcoming CPPCON presentation.&#xA;&#xA;Neither standard places any constraints on what a compiler can do with an invalid pointer value, even if all you are doing is loading or storing that value.&#xA;&#xA;Those of us who cut our teeth on assembly language might quite reasonably ask why anyone would even think to make pointers so grossly invalid that you cannot even load or store them.  To see the historical reasons, let&#39;s start by looking at pointer comparisons using this code fragment:&#xA;&#xA;p = kmalloc(...);&#xA;mightkfree(p);         // Pointer might become invalid (AKA &#34;zapped&#34;)&#xA;q = kmalloc(...);       // Assume that the addresses of p and q are equal.&#xA;if (p == q)             // Compiler can optimize as &#34;if (false)&#34;!!!&#xA;    dosomething();&#xA;&#xA;Both p and q contain addresses, but the compiler also keeps track of the fact that their values were obtained from different invocations of kmalloc().  This information forms part of each pointer&#39;s provenance.  This means that p and q have different provenance, which in turn means that the compiler does not need to generate any code for the p == q comparison.  The two pointers&#39; provenance differs, so no matter what the addresses might be, the result cannot be anything other than false.&#xA;&#xA;And this is one motivation for pointer provenance and invalidity:  The results of operations on invalid pointers are not guaranteed, which provides additional opportunities for optimization.  This example perhaps seems a bit silly, but modern compilers can use pointer provenance and invalidity to carry out serious points-to and aliasing analysis.&#xA;&#xA;Yes, you can have hardware provenance.  Examples include ARM MTE, the CHERI research prototype (which last I checked had issues with C++&#39;s requirement that pointers are trivially copiable), and the venerable IBM System i.  Conventional systems provide pointer provenance of a sort via their page tables, which is used by a variety of memory-allocation-use debuggers, for but one example, the efence library.  The pointer-provenance features of ARM MTE and IBM System i are not problematic, but last I checked, the jury was still out on CHERI.&#xA;&#xA;Of course, using invalid (AKA &#34;dangling&#34;) pointers is known to be a bad idea.  So why are we even talking about it???&#xA;&#xA;Why Would Anyone Use Invalid/Dangling Pointers?&#xA;&#xA;Please allow me to introduce you to the famous and frequently re-invented LIFO Push algorithm.  You can find this in many places, but let&#39;s focus on the Linux kernel&#39;s llistaddbatch() and llistdelall() functions.  The former atomically pushes a list of elements on a linked-list stack, and the latter just as atomically removes the entire contents of the stack:&#xA;&#xA;static inline bool llistaddbatch(struct llistnode newfirst,&#xA;                                   struct llistnode newlast,&#xA;                                   struct llisthead head)&#xA;{&#xA;    struct llistnode first = READONCE(head-  first);&#xA;&#xA;    do {&#xA;        newlast-  next = first;&#xA;    } while (!trycmpxchg(&amp;head-  first, &amp;first, newfirst));&#xA;&#xA;    return !first;&#xA;}&#xA;&#xA;static inline struct llistnode llistdelall(struct llisthead head)&#xA;{&#xA;    return xchg(&amp;head-  first, NULL);&#xA;}&#xA;&#xA;As lockless concurrent algorithms go, this one is pretty straightforward.  The llistaddbatch() function reads the list header, fills in the -  next pointer, then does a compare-and-exchange operation to point the list header at the new first element.  The llistdelall() function is even simpler, doing a single atomic exchange operation to NULL out the list header and returning the elements that were previously on the list.  This algorithm also has excellent forward-progress properties: the llistaddbatch() function is lock-free and the llistdelall() function is wait-free.&#xA;&#xA;So what is not to like?&#xA;&#xA;In assembly language, or with a simple compiler, not much.  But more heavily optimized languages have serious pointer-provenance issue with this code.  To see them, consider the following sequence of events:&#xA;&#xA;CPU 0 allocates an llistnode B and passes it via both the newfirst and newlast parameters of llistaddbatch().&#xA;CPU 0 picks up the head-  first pointer and places it in the first local variable, then assigns it to newlast-  next.  This newlast-  next pointer now references llistnode A.&#xA;CPU 1 invokes llistdelall(), which returns a list containing llistnode A.  The caller of llistdelall() processes A and passes it to kfree().&#xA;CPU 0&#39;s newlast-  next pointer is now invalid due to llistnode A having been freed.  But CPU 0 does not know this, though a sufficiently all-knowing compiler just might.&#xA;CPU 1 allocates an llistnode C that happens to have the same address as the old llistnode A.  It passes C  via both the newfirst and newlast parameters of llistaddbatch(), which runs to completion.  The head pointer now points to llistnode C, which happens to have the same address as the now storage-duration-ended llistnode A.   However, the two pointers reference objects created by different memory-allocation calls, and thus have different provenance, and thus are not necessarily equal.&#xA;CPU 0 finally gets around to executing its trycmpxchg(), which will succeed, courtesy of the fact that trycmpxchg() compares only the bits actually represented in the pointer, and not any implicit pointer provenance (and please note that the same is true of both the C and C++ compare-and-exchange operations).  The llist now contains an llistnode B that contains an invalid pointer to dead llistnode A, but whose address happens to reference the shiny new llistnode C.  (We term this invalid pointer a &#34;zombie pointer&#34; because it has in some assembly-language sense come back from the dead.)&#xA;Some CPU invokes llistdelall() and gets back an llist containing an invalid -  next pointer.&#xA;&#xA;One could argue that the Linux-kernel implementation of LIFO Push is simply buggy and should be fixed.  Except that there is no reasonable way to fix it.  Which of course raises the question...&#xA;&#xA;What Are Unreasonable Fixes?&#xA;&#xA;We can protect pointers from invalidity by storing them as integers, but:&#xA;&#xA;Suppose someone has an element that they are passing to a library function.  They should not be required to convert all their -  next pointers to integer just because the library&#39;s developers decide to switch to the LIFO Push algorithm for some obscure internal operation.&#xA;In addition, switching to integer defeats type-checking, because integers are integers no matter what type of pointer they came from.&#xA;We could restore some type-checking capability by wrapping the integer into a differently named struct for each pointer type.  Except that this requires a struct with some particular name to be treated as compatible with pointers of some type corresponding to that name, a notion that current compilers do not support.&#xA;In C++, we could use template metaprogramming to wrap an integer into a class that converts automatically to and from compatibly typed pointers.  But there would then be windows of time in which there was a real pointer, and at that time there would still be the possibility of pointer invalidity.&#xA;All of the above hack-arounds put additional obstacles in the way of developers of concurrent software.&#xA;&#xA;Alternatively, in environments such as the Linux kernel that provides their own memory allocators, we can hide them from the compiler.  But this is not free, in fact, the patch that exposed the Linux-kernel&#39;s memory allocators to the compiler resulted in a small but significant improvement.&#xA;&#xA;However, it is fair to ask...&#xA;&#xA;Why Do We Care About Strange New Algorithms???&#xA;&#xA;Let&#39;s take a look at the history, courtesy of Maged Michael&#39;s diligent software archaeology.&#xA;&#xA;In 1986, R. K. Treiber presented an assembly language implementation of the LIFO Push algorithm in technical report RJ 5118 entitled “Systems Programming: Coping with Parallelism” while at the IBM Almaden Research Center.&#xA;&#xA;In 1975, an assembly language implementation of this same algorithm (except with pop() instead of popall(), but still having the same ABA properties) was presented in the IBM System 370 Principles of Operation as a method for managing a concurrent freelist.&#xA;&#xA;US Patent 3,886,525 was filed in June 1973, just a few months before I wrote my first line of code, and contains a prior-art reference to the LIFO Push algorithm (again with pop() instead of popall()) as follows: “Conditional swapping of a single address is sufficient to program a last-in, first-out single-user-at-a-time sequencing mechanism.”  (If you were to ask a patent attorney, you would likely be told that this 50-year-old patent has long since expired.  Which should be no surprise, given that it is even older than Dennis Ritchie&#39;s setuid Patent 4,135,240.)&#xA;&#xA;All three of these references describe LIFO push as if it was straightforward and well known.&#xA;&#xA;So we don’t know who first invented LIFO Push or when they invented it, but it was well known in 1973.  Which is well over a decade before C was first standardized, more than two decades before C++ was first standardized, and even longer before Rust was even thought of.&#xA;&#xA;And its combination of (relative) simplicity and excellent forward-progress properties just might be why this algorithm was anonymously invented so long ago and why it is so persistently and repeatedly reinvented.  This frequent reinvention puts paid to any notion that LIFO Push is strange.&#xA;&#xA;So sorry, but LIFO Push is neither new nor strange.&#xA;&#xA;Nor is it the only situation where lifetime-end pointer zap causes problems. Please see the &#34;Zap-Susceptible Algorithms&#34; section of P1726R5 (&#34;Pointer lifetime-end zap and provenance, too&#34;) for additional use cases.&#xA;&#xA;So What Do We Do?&#xA;&#xA;The lifetime-end pointer-zap story is not yet over, and we are in fact currently pushing for the changes in four working papers.&#xA;&#xA;Nondeterministic Pointer Provenance&#xA;&#xA;P2434R4 (&#34;Nondeterministic pointer provenance&#34;) is the basis for the other three papers.  It asks that when converting a pointer to an integer and back, the implementation must choose a qualifying pointed-to object (if there is one) whose storage duration began before or concurrently with the conversion back to a pointer.  In particular, the implementation is free to ignore a qualifying pointed-to object when the conversion to pointer happens before the beginning of that object’s storage duration.&#xA;&#xA;The &#34;qualifying&#34; qualifier includes compatible type, as well as sufficiently early and long storage duration.&#xA;&#xA;But why restrict the qualifying pointed-to object&#39;s storage duration to begin before or concurrently with the conversion back to a pointer?&#xA;&#xA;An instructive example by Hans Boehm may be found in P2434R4, which shows that reasonable (and more important, very heavily used) optimizations would be invalidated by this approach.  Several examples that manage to be even more sobering may be found in David Goldblatt&#39;s P3292R0 (&#34;Provenance and Concurrency&#34;).&#xA;&#xA;Pointer Lifetime-End Zap Proposed Solutions: Atomics and Volatile&#xA;&#xA;P2414R10 (&#34;Pointer lifetime-end zap proposed solutions: Atomics and volatile&#34;) is motivated by the observation that atomic pointers are subject to update at any time by any thread, which means that the compiler cannot reasonably do much in the way of optimization.  This paper therefore asks (1) that atomic operations be redefined to yield and to store prospective pointers values and (2) that operations on volatile pointers be defined to yield and to store prospective pointer values.  The effect is as if atomic pointers were stored internally as integers. This includes the “old” pointer passed by reference to compareexchange().&#xA;&#xA;This helps, but is not a full solution because atomic pointers are converted to non-atomic pointers prior to use, at which point they are subject to lifetime-end pointer zap.  And the standard does not even guarantee that a zapped pointer can even be loaded, stored, passed to a function, or returned from a function.  Which brings us to the next paper.&#xA;&#xA;Pointer Lifetime-End Zap Proposed Solutions: Tighten IDB for Invalid Pointers&#xA;&#xA;P3347R4 (&#34;Pointer lifetime-end zap proposed solutions: Tighten IDB for invalid pointers&#34;) therefore asks that all non-comparison non-arithmetic non-dereference computations involving pointers, specifically including normal loads and stores, are fully defined even if the pointers are invalid.  This permits invalid pointers to be loaded, stored, passed as arguments, and returned.  Fully defining comparisons would rule out optimizations, and fully defining arithmetic would be complex and thus far unneeded.  Fully defining dereferencing of invalid pointers would of course be problematic.&#xA;&#xA;If these first three papers are accepted into the standard, the C++ implementation of LIFO Push show above becomes valid code.  This is important because this algorithm has been re-invented many times over the past half century, and is often open coded.  This frequent open coding makes it infeasible to construct tools that find LIFO Push implementations in existing code.&#xA;&#xA;P3790R1: Pointer Lifetime-End Zap Proposed Solutions: Bag-of-Bits Pointer Class&#xA;&#xA;P3790R1 (&#34;Pointer lifetime-end zap proposed solutions: Bag-of-bits pointer class&#34;) asks that (1) the addition to the C++ standard library of the function launderptrbits() that takes a pointer argument and returns a prospective pointer value corresponding to its argument; and (2) the addition to the C++ standard library of the class template std::ptrbitsT that is a pointer-like type that is still usable after the pointed-to object’s lifetime has ended.  Of course, such a pointer still cannot be dereferenced unless there is a live object at that pointer&#39;s address.  Furthermore, some systems, such as ARMv9 with memory tagging extensions (MTE) enabled have provenance as well as address bits in the pointer, and on such systems dereferencing will fail unless the pointer&#39;s provenance bits happen to match those of the pointed-to object.&#xA;&#xA;This function and template class is nevertheless quite useful, for example, it may be used to maintain hash maps keyed by pointers after the pointed-to object&#39;s lifetime has ended.  These can be extremely useful for debugging, especially in cases where the overhead of full-up address sanitizers cannot be tolerated.&#xA;&#xA;Unlike LIFO Push, source-code changes are required for these use cases.  This is unfortunate, but we have thus far been unable to come up with a same-source-code approach.&#xA;&#xA;Those who have participated in standards work (or even open-source work) will understand that the names launderptrbits() and std::ptr_bitsT just might still be subject to bikeshedding.&#xA;&#xA;A Happen Lifetime-End Pointer Zap Ending?&#xA;&#xA;It is still too early to say for certain, but thus far these proposals are making much better progress than did their predecessors.  So who knows?  Perhaps C++29 will address lifetime-end pointer zap.]]&gt;</description>
      <content:encoded><![CDATA[<p>TL;DR: Unless you are doing very strange things with RCU (<a href="https://en.wikipedia.org/wiki/Read-copy-update" rel="nofollow">read-copy update</a>), not much!!!</p>

<p>So why has the guy most responsible for Linux-kernel spent so much time over the past five years working on the provenance-related lifetime-end pointer zap within the C++ Standards Committee?</p>

<p>But first...</p>

<h2 id="what-is-pointer-provenance">What is Pointer Provenance?</h2>

<p>Back in the old days, provenance was for objets d&#39;art and the like, and we did not need them for our pointers, no sirree!!!  Pointers had bits, those bits formed memory addresses, and as often as not we didn&#39;t even need to worry about these addresses being translated.  But life is more complicated now.  On the other hand, computing life is also much bigger, faster, more reliable, and (usually) more productive, so be extremely careful what you wish for from back in the Good Old Days!</p>

<p>These days, pointers have provenance as well as addresses, and this has consequences.  The C++ Standard  (<a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5008.pdf" rel="nofollow">recent draft</a>) states that when an object&#39;s storage duration ends, any pointers to that object become invalid.  For its part, the C Standard states that when an object&#39;s storage duration ends, any pointers to that object become indeterminate.  In both standards, the wording is more precise, but this will serve for our purposes.</p>

<p>For the remainder of this document, we will follow C++ and say “invalid”, which is shorter than “indeterminate”.  We will balance this out by using C-language example code.  Those preferring C++ will be happy to hear that this is the language that I use in my <a href="https://cppcon2025.sched.com/event/27bR6/interesting-upcoming-low-latency-concurrency-and-parallelism-features-from-wroclaw-2024-hagenberg-2025-and-sofia-2025" rel="nofollow">upcoming CPPCON presentation</a>.</p>

<p>Neither standard places any constraints on what a compiler can do with an invalid pointer value, even if all you are doing is loading or storing that value.</p>

<p>Those of us who cut our teeth on assembly language might quite reasonably ask why anyone would even think to make pointers so grossly invalid that you cannot even load or store them.  To see the historical reasons, let&#39;s start by looking at pointer comparisons using this code fragment:</p>

<pre><code class="language-text">p = kmalloc(...);
might_kfree(p);         // Pointer might become invalid (AKA &#34;zapped&#34;)
q = kmalloc(...);       // Assume that the addresses of p and q are equal.
if (p == q)             // Compiler can optimize as &#34;if (false)&#34;!!!
    do_something();
</code></pre>

<p>Both <code>p</code> and <code>q</code> contain addresses, but the compiler also keeps track of the fact that their values were obtained from different invocations of <code>kmalloc()</code>.  This information forms part of each pointer&#39;s provenance.  This means that <code>p</code> and <code>q</code> have different provenance, which in turn means that the compiler does not need to generate any code for the <code>p == q</code> comparison.  The two pointers&#39; provenance differs, so no matter what the addresses might be, the result cannot be anything other than <code>false</code>.</p>

<p>And this is one motivation for pointer provenance and invalidity:  The results of operations on invalid pointers are not guaranteed, which provides additional opportunities for optimization.  This example perhaps seems a bit silly, but modern compilers can use pointer provenance and invalidity to carry out serious points-to and aliasing analysis.</p>

<p>Yes, you can have hardware provenance.  Examples include ARM MTE, the CHERI research prototype (which last I checked had issues with C++&#39;s requirement that pointers are trivially copiable), and the venerable IBM System i.  Conventional systems provide pointer provenance of a sort via their page tables, which is used by a variety of memory-allocation-use debuggers, for but one example, the efence library.  The pointer-provenance features of ARM MTE and IBM System i are not problematic, but last I checked, the jury was still out on CHERI.</p>

<p>Of course, using invalid (AKA “dangling”) pointers is known to be a bad idea.  So why are we even talking about it???</p>

<h2 id="why-would-anyone-use-invalid-dangling-pointers">Why Would Anyone Use Invalid/Dangling Pointers?</h2>

<p>Please allow me to introduce you to the famous and frequently re-invented LIFO Push algorithm.  You can find this in many places, but let&#39;s focus on the Linux kernel&#39;s <code>llist_add_batch()</code> and <code>llist_del_all()</code> functions.  The former atomically pushes a list of elements on a linked-list stack, and the latter just as atomically removes the entire contents of the stack:</p>

<pre><code class="language-text">static inline bool llist_add_batch(struct llist_node *new_first,
                                   struct llist_node *new_last,
                                   struct llist_head *head)
{
    struct llist_node *first = READ_ONCE(head-&gt;first);

    do {
        new_last-&gt;next = first;
    } while (!try_cmpxchg(&amp;head-&gt;first, &amp;first, new_first));

    return !first;
}

static inline struct llist_node *llist_del_all(struct llist_head *head)
{
    return xchg(&amp;head-&gt;first, NULL);
}
</code></pre>

<p>As lockless concurrent algorithms go, this one is pretty straightforward.  The <code>llist_add_batch()</code> function reads the list header, fills in the <code>-&gt;next</code> pointer, then does a compare-and-exchange operation to point the list header at the new first element.  The <code>llist_del_all()</code> function is even simpler, doing a single atomic exchange operation to <code>NULL</code> out the list header and returning the elements that were previously on the list.  This algorithm also has excellent forward-progress properties: the <code>llist_add_batch()</code> function is lock-free and the <code>llist_del_all()</code> function is wait-free.</p>

<p>So what is not to like?</p>

<p>In assembly language, or with a simple compiler, not much.  But more heavily optimized languages have serious pointer-provenance issue with this code.  To see them, consider the following sequence of events:</p>
<ol><li>CPU 0 allocates an <code>llist_node</code> B and passes it via both the <code>new_first</code> and <code>new_last</code> parameters of <code>llist_add_batch()</code>.</li>
<li>CPU 0 picks up the <code>head-&gt;first</code> pointer and places it in the <code>first</code> local variable, then assigns it to <code>new_last-&gt;next</code>.  This <code>new_last-&gt;next</code> pointer now references <code>llist_node</code> A.</li>
<li>CPU 1 invokes <code>llist_del_all()</code>, which returns a list containing <code>llist_node</code> A.  The caller of <code>llist_del_all()</code> processes A and passes it to <code>kfree()</code>.</li>
<li>CPU 0&#39;s <code>new_last-&gt;next</code> pointer is now invalid due to <code>llist_node</code> A having been freed.  But CPU 0 does not know this, though a sufficiently all-knowing compiler just might.</li>
<li>CPU 1 allocates an <code>llist_node</code> C that happens to have the same address as the old <code>llist_node</code> A.  It passes C  via both the <code>new_first</code> and <code>new_last</code> parameters of <code>llist_add_batch()</code>, which runs to completion.  The <code>head</code> pointer now points to <code>llist_node</code> C, which happens to have the same address as the now storage-duration-ended <code>llist_node</code> A.   However, the two pointers reference objects created by different memory-allocation calls, and thus have different provenance, and thus are not necessarily equal.</li>
<li>CPU 0 finally gets around to executing its <code>try_cmpxchg()</code>, which will succeed, courtesy of the fact that <code>try_cmpxchg()</code> compares only the bits actually represented in the pointer, and not any implicit pointer provenance (and please note that the same is true of both the C and C++ compare-and-exchange operations).  The <code>llist</code> now contains an <code>llist_node</code> B that contains an invalid pointer to dead <code>llist_node</code> A, but whose address happens to reference the shiny new <code>llist_node</code> C.  (We term this invalid pointer a “zombie pointer” because it has in some assembly-language sense come back from the dead.)</li>
<li>Some CPU invokes <code>llist_del_all()</code> and gets back an <code>llist</code> containing an invalid <code>-&gt;next</code> pointer.</li></ol>

<p>One could argue that the Linux-kernel implementation of LIFO Push is simply buggy and should be fixed.  Except that there is no reasonable way to fix it.  Which of course raises the question...</p>

<h2 id="what-are-unreasonable-fixes">What Are Unreasonable Fixes?</h2>

<p>We can protect pointers from invalidity by storing them as integers, but:</p>
<ol><li>Suppose someone has an element that they are passing to a library function.  They should not be required to convert all their <code>-&gt;next</code> pointers to integer just because the library&#39;s developers decide to switch to the LIFO Push algorithm for some obscure internal operation.</li>
<li>In addition, switching to integer defeats type-checking, because integers are integers no matter what type of pointer they came from.</li>
<li>We could restore some type-checking capability by wrapping the integer into a differently named struct for each pointer type.  Except that this requires a struct with some particular name to be treated as compatible with pointers of some type corresponding to that name, a notion that current compilers do not support.</li>
<li>In C++, we could use template metaprogramming to wrap an integer into a class that converts automatically to and from compatibly typed pointers.  But there would then be windows of time in which there was a real pointer, and at that time there would still be the possibility of pointer invalidity.</li>
<li>All of the above hack-arounds put additional obstacles in the way of developers of concurrent software.</li></ol>

<p>Alternatively, in environments such as the Linux kernel that provides their own memory allocators, we can hide them from the compiler.  But this is not free, in fact, the patch that exposed the Linux-kernel&#39;s memory allocators to the compiler resulted in a small but significant improvement.</p>

<p>However, it is fair to ask...</p>

<h2 id="why-do-we-care-about-strange-new-algorithms">Why Do We Care About Strange New Algorithms???</h2>

<p>Let&#39;s take a look at the history, courtesy of Maged Michael&#39;s diligent software archaeology.</p>

<p>In 1986, R. K. Treiber presented an assembly language implementation of the LIFO Push algorithm in technical report RJ 5118 entitled “Systems Programming: Coping with Parallelism” while at the IBM Almaden Research Center.</p>

<p>In 1975, an <a href="https://x.com/MagedMMichael/status/1946916675355398596/photo/1" rel="nofollow">assembly language implementation of this same algorithm</a> (except with pop() instead of popall(), but still having the same ABA properties) was presented in the IBM System 370 Principles of Operation as a method for managing a concurrent freelist.</p>

<p><a href="https://patents.google.com/patent/US3886525" rel="nofollow">US Patent 3,886,525</a> was filed in June 1973, just a few months before I wrote my first line of code, and contains a prior-art reference to the LIFO Push algorithm (again with pop() instead of popall()) as follows: “Conditional swapping of a single address is sufficient to program a last-in, first-out single-user-at-a-time sequencing mechanism.”  (If you were to ask a patent attorney, you would likely be told that this 50-year-old patent has long since expired.  Which should be no surprise, given that it is even older than Dennis Ritchie&#39;s setuid <a href="https://patents.google.com/patent/US4135240A/en" rel="nofollow">Patent 4,135,240</a>.)</p>

<p>All three of these references describe LIFO push as if it was straightforward and well known.</p>

<p>So we don’t know who first invented LIFO Push or when they invented it, but it was well known in 1973.  Which is well over a decade before C was first standardized, more than two decades before C++ was first standardized, and even longer before Rust was even thought of.</p>

<p>And its combination of (relative) simplicity and excellent forward-progress properties just might be why this algorithm was anonymously invented so long ago and why it is so persistently and repeatedly reinvented.  This frequent reinvention puts paid to any notion that LIFO Push is strange.</p>

<p>So sorry, but LIFO Push is neither new nor strange.</p>

<p>Nor is it the only situation where lifetime-end pointer zap causes problems. Please see the “Zap-Susceptible Algorithms” section of <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1726r5.pdf" rel="nofollow">P1726R5 (“Pointer lifetime-end zap and provenance, too”)</a> for additional use cases.</p>

<h2 id="so-what-do-we-do">So What Do We Do?</h2>

<p>The lifetime-end pointer-zap story is not yet over, and we are in fact currently pushing for the changes in four working papers.</p>

<h3 id="nondeterministic-pointer-provenance">Nondeterministic Pointer Provenance</h3>

<p><a href="https://isocpp.org/files/papers/P2434R4.html" rel="nofollow">P2434R4 (“Nondeterministic pointer provenance”)</a> is the basis for the other three papers.  It asks that when converting a pointer to an integer and back, the implementation must choose a qualifying pointed-to object (if there is one) whose storage duration began before or concurrently with the conversion back to a pointer.  In particular, the implementation is free to ignore a qualifying pointed-to object when the conversion to pointer happens before the beginning of that object’s storage duration.</p>

<p>The “qualifying” qualifier includes compatible type, as well as sufficiently early and long storage duration.</p>

<p>But why restrict the qualifying pointed-to object&#39;s storage duration to begin before or concurrently with the conversion back to a pointer?</p>

<p>An instructive example by Hans Boehm may be found in P2434R4, which shows that reasonable (and more important, very heavily used) optimizations would be invalidated by this approach.  Several examples that manage to be even more sobering may be found in David Goldblatt&#39;s <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html" rel="nofollow">P3292R0 (“Provenance and Concurrency”)</a>.</p>

<h3 id="pointer-lifetime-end-zap-proposed-solutions-atomics-and-volatile">Pointer Lifetime-End Zap Proposed Solutions: Atomics and Volatile</h3>

<p><a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p2414r10.pdf" rel="nofollow">P2414R10 (“Pointer lifetime-end zap proposed solutions: Atomics and volatile”)</a> is motivated by the observation that atomic pointers are subject to update at any time by any thread, which means that the compiler cannot reasonably do much in the way of optimization.  This paper therefore asks (1) that atomic operations be redefined to yield and to store prospective pointers values and (2) that operations on volatile pointers be defined to yield and to store prospective pointer values.  The effect is as if atomic pointers were stored internally as integers. This includes the “old” pointer passed by reference to compare_exchange().</p>

<p>This helps, but is not a full solution because atomic pointers are converted to non-atomic pointers prior to use, at which point they are subject to lifetime-end pointer zap.  And the standard does not even guarantee that a zapped pointer can even be loaded, stored, passed to a function, or returned from a function.  Which brings us to the next paper.</p>

<h3 id="pointer-lifetime-end-zap-proposed-solutions-tighten-idb-for-invalid-pointers">Pointer Lifetime-End Zap Proposed Solutions: Tighten IDB for Invalid Pointers</h3>

<p><a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3347r4.pdf" rel="nofollow">P3347R4 (“Pointer lifetime-end zap proposed solutions: Tighten IDB for invalid pointers”)</a> therefore asks that all non-comparison non-arithmetic non-dereference computations involving pointers, specifically including normal loads and stores, are fully defined even if the pointers are invalid.  This permits invalid pointers to be loaded, stored, passed as arguments, and returned.  Fully defining comparisons would rule out optimizations, and fully defining arithmetic would be complex and thus far unneeded.  Fully defining dereferencing of invalid pointers would of course be problematic.</p>

<p>If these first three papers are accepted into the standard, the C++ implementation of LIFO Push show above becomes valid code.  This is important because this algorithm has been re-invented many times over the past half century, and is often open coded.  This frequent open coding makes it infeasible to construct tools that find LIFO Push implementations in existing code.</p>

<h3 id="p3790r1-pointer-lifetime-end-zap-proposed-solutions-bag-of-bits-pointer-class">P3790R1: Pointer Lifetime-End Zap Proposed Solutions: Bag-of-Bits Pointer Class</h3>

<p><a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3790r1.pdf" rel="nofollow">P3790R1 (“Pointer lifetime-end zap proposed solutions: Bag-of-bits pointer class”)</a> asks that (1) the addition to the C++ standard library of the function <code>launder_ptr_bits()</code> that takes a pointer argument and returns a prospective pointer value corresponding to its argument; and (2) the addition to the C++ standard library of the class template <code>std::ptr_bits&lt;T&gt;</code> that is a pointer-like type that is still usable after the pointed-to object’s lifetime has ended.  Of course, such a pointer still cannot be dereferenced unless there is a live object at that pointer&#39;s address.  Furthermore, some systems, such as ARMv9 with memory tagging extensions (MTE) enabled have provenance as well as address bits in the pointer, and on such systems dereferencing will fail unless the pointer&#39;s provenance bits happen to match those of the pointed-to object.</p>

<p>This function and template class is nevertheless quite useful, for example, it may be used to maintain hash maps keyed by pointers after the pointed-to object&#39;s lifetime has ended.  These can be extremely useful for debugging, especially in cases where the overhead of full-up address sanitizers cannot be tolerated.</p>

<p>Unlike LIFO Push, source-code changes are required for these use cases.  This is unfortunate, but we have thus far been unable to come up with a same-source-code approach.</p>

<p>Those who have participated in standards work (or even open-source work) will understand that the names <code>launder_ptr_bits()</code> and <code>std::ptr_bits&lt;T&gt;</code> just might still be subject to bikeshedding.</p>

<h2 id="a-happen-lifetime-end-pointer-zap-ending">A Happen Lifetime-End Pointer Zap Ending?</h2>

<p>It is still too early to say for certain, but thus far these proposals are making much better progress than did their predecessors.  So who knows?  Perhaps C++29 will address lifetime-end pointer zap.</p>
]]></content:encoded>
      <author>paulmck</author>
      <guid>https://people.kernel.org/read/a/cc7qk9g0eb</guid>
      <pubDate>Wed, 06 Aug 2025 23:34:12 +0000</pubDate>
    </item>
    <item>
      <title>Speaking at Kernel Recipes: Short Notice</title>
      <link>https://people.kernel.org/paulmck/speaking-at-kernel-recipes-short-notice</link>
      <description>&lt;![CDATA[This is part of the Kernel Recipes 2025 blog series.&#xA;&#xA;The other posts in this series help with small improvements over a long time.  But what do you do if you only have a few weeks until your presentation?  Yes, it is best to avoid procrastination, but sometimes you simply don&#39;t have all that much notice.&#xA;&#xA;First, have a very clear picture of what you want the audience to gain from your presentation.  A carefully chosen and tight focus will save you time that might otherwise have been wasted on irrelevant details.&#xA;&#xA;Second, do dry-run presentations, preferably to people who won&#39;t be shy about giving you honest feedback.  If your dry-run audience has shy people, you can ask them questions to see if they picked up on the key points of your presentation.  If you cannot scare up a human audience on short notice, record your presentation (on your smartphone if nothing else) and review it.  In the old pre-smartphone days, we would do our audience-free dry runs in front of a mirror, which can still be useful, for example, if your smartphone&#39;s battery is empty.&#xA;&#xA;Third, repeat the important portions of your presentation, which usually includes the opening, the conclusion, and any surprise &#34;reveals&#34; in the middle of the presentation.  If it is an important presentation (but aren&#39;t they all?), do about 20 repetitions of the important portions.  If it is an extremely important presentation, dry-run the entire presentation about 20 times.  Yes, this can take time, but on the other hand, most of my extremely important presentations were quite short, on the order of 3-5 minutes.&#xA;&#xA;Fourth and finally, get a good night&#39;s sleep before the day of the presentation.]]&gt;</description>
      <content:encoded><![CDATA[<p>This is part of the <a href="https://people.kernel.org/paulmck/kernel-recipes-2025" rel="nofollow">Kernel Recipes 2025</a> blog series.</p>

<p>The other posts in this series help with small improvements over a long time.  But what do you do if you only have a few weeks until your presentation?  Yes, it is best to avoid procrastination, but sometimes you simply don&#39;t have all that much notice.</p>

<p>First, have a <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-know-your-destination" rel="nofollow">very clear picture</a> of what you want the audience to gain from your presentation.  A carefully chosen and tight focus will save you time that might otherwise have been wasted on irrelevant details.</p>

<p>Second, do dry-run presentations, preferably to people who won&#39;t be shy about giving you honest feedback.  If your dry-run audience has shy people, you can ask them questions to see if they picked up on the key points of your presentation.  If you cannot scare up a human audience on short notice, record your presentation (on your smartphone if nothing else) and review it.  In the old pre-smartphone days, we would do our audience-free dry runs in front of a mirror, which can still be useful, for example, if your smartphone&#39;s battery is empty.</p>

<p>Third, repeat the important portions of your presentation, which usually includes the opening, the conclusion, and any surprise “reveals” in the middle of the presentation.  If it is an important presentation (but aren&#39;t they all?), do about 20 repetitions of the important portions.  If it is an extremely important presentation, dry-run the entire presentation about 20 times.  Yes, this can take time, but on the other hand, most of my extremely important presentations were quite short, on the order of 3-5 minutes.</p>

<p>Fourth and finally, get a good night&#39;s sleep before the day of the presentation.</p>
]]></content:encoded>
      <author>paulmck</author>
      <guid>https://people.kernel.org/read/a/k9eudrvv3u</guid>
      <pubDate>Thu, 06 Mar 2025 19:51:32 +0000</pubDate>
    </item>
    <item>
      <title>Speaking at Kernel Recipes: My Twisted Path</title>
      <link>https://people.kernel.org/paulmck/speaking-at-kernel-recipes-my-twisted-path</link>
      <description>&lt;![CDATA[This is part of the Kernel Recipes 2025 blog series.&#xA;&#xA;I have been consciously working on speaking skills for more than half a century.  This section lists a few of the experiences along the way.  My hope is that this motivates you to take the easier and faster approaches laid out in the rest of this blog series.&#xA;&#xA;Comic Relief&#xA;&#xA;A now-disgraced comedian who was immensely popular in the 1960s was said to have learned his craft at school.  They said that he discovered that if he could make the schoolyard bullies laugh, they would often forget about roughing him up.  I tried the same approach, though with just barely enough success to persist.  Part of my problem was that I spent most of my time focusing on academic skills, which certainly proved to be a wise choice longer term, but did limit the time available to improve my comedic capabilities.  I was also limited by my not-so-wise insistence on taking myself too seriously.  Choices, choices!&#xA;&#xA;My classmates often told very funny jokes, and I firmly believed that making up jokes was a cognitive skill, and I just as firmly believed (and with some reason) that I was a cognitive standout.  If they could do it, so could I!!!&#xA;&#xA;But for a very long time, my jokes were extremely weak compared to theirs.&#xA;&#xA;Until one day, I told a joke that everyone laughed at.  Hard.  For a long time.  (And no, I do not remember that joke, but then again, it was a joke targeted towards seventh graders and you most likely are not in seventh grade.)&#xA;&#xA;Once they recovered, one of them asked “What show did you see that on?”&#xA;&#xA;Suddenly the awful truth dawned on me.  My classmates were not making up these jokes.  They were seeing them on television, and rushing to be the first to repeat them the next day.  Why was this not obvious to me?  Because my family did not have a television.&#xA;&#xA;My surprise did not prevent me from replying “The Blank Wall”.  Which was the honest truth: I had in fact been staring at a blank wall the previous evening while composing my first successful joke.&#xA;&#xA;The next day, my classmates asked me what channel “The Blank Wall” was on.  I of course gave evasive answers, but in a few minutes they figured out that I meant a literal blank wall.  They were not impressed with my attitude.  You saw jokes on television, after all, and no one in their right mind would even try to make one up!&#xA;&#xA;I also did some grade-school acting, though my big role was Jonathan Brewster in a seventh-grade production of “Arsenic and Old Lace” rather than anything comedic.  The need to work prevented my acting in any high-school plays, though to be fair it is not clear that my acting abilities would have kept up with those of my classmates in any case.&#xA;&#xA;Besides, those working in retail can attest that carefully deployed humor can be extremely useful.  So my high-school grocery-store job likely provided me with more and better experience than the high-school plays could possibly have done.  At least that is what I keep telling myself!&#xA;&#xA;Speech Team&#xA;&#xA;For reasons that were never quite clear to me, the high-school speech-team coach asked me to try out.  I probably would have ignored her, but I well recalled my father telling me that those who have nothing to say, but can say it well, will often do better than those who have something to say but cannot say it.  So, against my better 13-year-old judgment, I signed up.&#xA;&#xA;I did quite well in extemporaneous speech during my first year due to my relatively deep understanding of the science behind the hot topic of that time, namely the energy crisis.  During later years, the hot topics reverted to the usual political and evening-news fare, so the remaining three years were good practice, but did not result in wins.  Until the end of my senior year, when the coach suggested that I try radio commentary, which had the great advantage of hiding my horribly geeky teenaged face from the judges.  I did quite well, qualifying for district-level competition on the strength of my first-ever radio-commentary speech.&#xA;&#xA;But I can only be thankful that my 17-year-old self decided to go to an engineering university as opposed to seeking employment at a local radio station.&#xA;&#xA;University Coursework&#xA;&#xA;I tested out of Freshman English Composition, but I did take a couple of courses on technical writing and technical presentation.  A ca. 1980 mechanical-engineering presentation on ground-loop heat pumps featured my first use of cartoons in a technical presentation, courtesy of a teammate who knew a professional cartoonist.  The four of us were quite proud of having kept the class’s attention during the full duration of our talk, which took place only a few days before the start of Christmas holidays.&#xA;&#xA;1980s and 1990s Presentations&#xA;&#xA;I did impromptu work-related presentations for my contract-programming work in the early 1980s.  In the late 1980s, I joined a research institute where I was expected to do formal presentations, including at academic venues.  I joined a startup in 1990, where I continued academic presentations, but focused mainly on internal training presentations.&#xA;&#xA;Toastmasters&#xA;&#xA;I became a founding member of a local Toastmasters club in 1993, and during the next seven years received CTM (“Competent Toastmaster) and ATM (“Advanced Toastmaster”) certifications.  There is very likely a Toastmasters club near you, and you can search here: https://www.toastmasters.org/.&#xA;&#xA;The purpose of Toastmasters is to help people develop public-speaking skills in a friendly environment.  The members of the club help each other, evaluating each others’ short speeches and providing topics for even shorter impromptu speeches.  The CTM and ATM certifications each have a manual that guides the member through a series of different types of speeches.  For example, the 1990s CTM manual starts with a 4-6-minute speech in which the member introduces themselves.  This has the benefit of ensuring that the speaker is expert on the topic, though I have come across an amnesiac who was an exception that proves this rule.&#xA;&#xA;For me, the best of Toastmasters was “table topics”, in which someone is designated to bring a topic to the next meeting.  The topic is called out, and people are expected to volunteer to give a short speech (a minute or two) on that topic.  This is excellent preparation for those times when someone calls you out during a meeting.&#xA;&#xA;Benchmarking&#xA;&#xA;By the year 2000, I felt very good about my speaking ability.  I was aware of some shortcomings, for example, I had difficulty with audiences larger than about 100 people, but was doing quite well, both in my own estimation and that of others.  In short, it was time to benchmark myself against a professional speaker.&#xA;&#xA;In that year, I attended an event whose keynote was given by none other than one of the least articulate of the US Presidents, George H. W. Bush.  Now, Bush’s speaking abilities might have been unfairly compared to the larger-than-life capabilities of his predecessor (Ronald Reagan, AKA “The Great Communicator”) and his successor (Bill Clinton, whose command of people skills is the stuff of legends).  In contrast, here is Ann Richards’s assessment of Bush’s skills: “born with a silver foot in his mouth”.&#xA;&#xA;As noted above, I had just completed seven years in Toastmasters, so I was more than ready to do a Toastmasters-style evaluation of Bush’s keynote.  I would record all the defects in this speech and email it to my Toastmasters group for their amusement.&#xA;&#xA;Except that it didn’t turn out that way.&#xA;&#xA;Bush gave a one-hour speech during which he did everything that I knew how to do, and did it effortlessly.  Not only that, there were instances where he clearly expected a reaction from the audience, and got that reaction.  I was watching him like a hawk the whole time and had absolutely no idea how he had made it happen.&#xA;&#xA;Bush might well have been the most inarticulate of the US Presidents, but he was incomparably better than this software developer will ever be.&#xA;&#xA;But that does not mean that I cannot continue to improve.  In fact, I can now do a better job of presenting that Bush can.  Not just due to my having spent the intervening decades practicing (practice makes perfect!), but mostly due to the fact that Bush has since passed away.&#xA;&#xA;Linux Community&#xA;&#xA;I joined the Linux community in 2001, where I faced large and diverse audiences.  It quickly became obvious that I needed to apply my youthful Warner Brothers lessons, especially given that I was presenting things like RCU to audiences that were mostly innocent of any knowledge of or experience in concurrency.&#xA;&#xA;This experience also gave me much-needed practice dealing with larger audiences, in a few cases, on the order of 1,000.&#xA;&#xA;So I continue to improve, but there is much more for me to learn.]]&gt;</description>
      <content:encoded><![CDATA[<p>This is part of the <a href="https://people.kernel.org/paulmck/kernel-recipes-2025" rel="nofollow">Kernel Recipes 2025</a> blog series.</p>

<p>I have been consciously working on speaking skills for more than half a century.  This section lists a few of the experiences along the way.  My hope is that this motivates you to take the easier and faster approaches laid out in the rest of this blog series.</p>

<h2 id="comic-relief">Comic Relief</h2>

<p>A now-disgraced comedian who was immensely popular in the 1960s was said to have learned his craft at school.  They said that he discovered that if he could make the schoolyard bullies laugh, they would often forget about roughing him up.  I tried the same approach, though with just barely enough success to persist.  Part of my problem was that I spent most of my time focusing on academic skills, which certainly proved to be a wise choice longer term, but did limit the time available to improve my comedic capabilities.  I was also limited by my not-so-wise insistence on taking myself too seriously.  Choices, choices!</p>

<p>My classmates often told very funny jokes, and I firmly believed that making up jokes was a cognitive skill, and I just as firmly believed (and with some reason) that I was a cognitive standout.  If they could do it, so could I!!!</p>

<p>But for a very long time, my jokes were extremely weak compared to theirs.</p>

<p>Until one day, I told a joke that everyone laughed at.  Hard.  For a long time.  (And no, I do not remember that joke, but then again, it was a joke targeted towards seventh graders and you most likely are not in seventh grade.)</p>

<p>Once they recovered, one of them asked “What show did you see that on?”</p>

<p>Suddenly the awful truth dawned on me.  My classmates were not making up these jokes.  They were seeing them on television, and rushing to be the first to repeat them the next day.  Why was this not obvious to me?  Because my family did not have a television.</p>

<p>My surprise did not prevent me from replying “The Blank Wall”.  Which was the honest truth: I had in fact been staring at a blank wall the previous evening while composing my first successful joke.</p>

<p>The next day, my classmates asked me what channel “The Blank Wall” was on.  I of course gave evasive answers, but in a few minutes they figured out that I meant a literal blank wall.  They were not impressed with my attitude.  You saw jokes on television, after all, and no one in their right mind would even try to make one up!</p>

<p>I also did some grade-school acting, though my big role was Jonathan Brewster in a seventh-grade production of “Arsenic and Old Lace” rather than anything comedic.  The need to work prevented my acting in any high-school plays, though to be fair it is not clear that my acting abilities would have kept up with those of my classmates in any case.</p>

<p>Besides, those working in retail can attest that carefully deployed humor can be extremely useful.  So my high-school grocery-store job likely provided me with more and better experience than the high-school plays could possibly have done.  At least that is what I keep telling myself!</p>

<h2 id="speech-team">Speech Team</h2>

<p>For reasons that were never quite clear to me, the high-school speech-team coach asked me to try out.  I probably would have ignored her, but I well recalled my father telling me that those who have nothing to say, but can say it well, will often do better than those who have something to say but cannot say it.  So, against my better 13-year-old judgment, I signed up.</p>

<p>I did quite well in extemporaneous speech during my first year due to my relatively deep understanding of the science behind the hot topic of that time, namely the energy crisis.  During later years, the hot topics reverted to the usual political and evening-news fare, so the remaining three years were good practice, but did not result in wins.  Until the end of my senior year, when the coach suggested that I try radio commentary, which had the great advantage of hiding my horribly geeky teenaged face from the judges.  I did quite well, qualifying for district-level competition on the strength of my first-ever radio-commentary speech.</p>

<p>But I can only be thankful that my 17-year-old self decided to go to an engineering university as opposed to seeking employment at a local radio station.</p>

<h2 id="university-coursework">University Coursework</h2>

<p>I tested out of Freshman English Composition, but I did take a couple of courses on technical writing and technical presentation.  A ca. 1980 mechanical-engineering presentation on ground-loop heat pumps featured my first use of cartoons in a technical presentation, courtesy of a teammate who knew a professional cartoonist.  The four of us were quite proud of having kept the class’s attention during the full duration of our talk, which took place only a few days before the start of Christmas holidays.</p>

<h2 id="1980s-and-1990s-presentations">1980s and 1990s Presentations</h2>

<p>I did impromptu work-related presentations for my contract-programming work in the early 1980s.  In the late 1980s, I joined a research institute where I was expected to do formal presentations, including at academic venues.  I joined a startup in 1990, where I continued academic presentations, but focused mainly on internal training presentations.</p>

<h2 id="toastmasters">Toastmasters</h2>

<p>I became a founding member of a local Toastmasters club in 1993, and during the next seven years received CTM (“Competent Toastmaster) and ATM (“Advanced Toastmaster”) certifications.  There is very likely a Toastmasters club near you, and you can search here: <a href="https://www.toastmasters.org/" rel="nofollow">https://www.toastmasters.org/</a>.</p>

<p>The purpose of Toastmasters is to help people develop public-speaking skills in a friendly environment.  The members of the club help each other, evaluating each others’ short speeches and providing topics for even shorter impromptu speeches.  The CTM and ATM certifications each have a manual that guides the member through a series of different types of speeches.  For example, the 1990s CTM manual starts with a 4-6-minute speech in which the member introduces themselves.  This has the benefit of ensuring that the speaker is expert on the topic, though I have come across an amnesiac who was an exception that proves this rule.</p>

<p>For me, the best of Toastmasters was “table topics”, in which someone is designated to bring a topic to the next meeting.  The topic is called out, and people are expected to volunteer to give a short speech (a minute or two) on that topic.  This is excellent preparation for those times when someone calls you out during a meeting.</p>

<h2 id="benchmarking">Benchmarking</h2>

<p>By the year 2000, I felt very good about my speaking ability.  I was aware of some shortcomings, for example, I had difficulty with audiences larger than about 100 people, but was doing quite well, both in my own estimation and that of others.  In short, it was time to benchmark myself against a professional speaker.</p>

<p>In that year, I attended an event whose keynote was given by none other than one of the least articulate of the US Presidents, George H. W. Bush.  Now, Bush’s speaking abilities might have been unfairly compared to the larger-than-life capabilities of his predecessor (Ronald Reagan, AKA “The Great Communicator”) and his successor (Bill Clinton, whose command of people skills is the stuff of legends).  In contrast, here is Ann Richards’s assessment of Bush’s skills: “born with a silver foot in his mouth”.</p>

<p>As noted above, I had just completed seven years in Toastmasters, so I was more than ready to do a Toastmasters-style evaluation of Bush’s keynote.  I would record all the defects in this speech and email it to my Toastmasters group for their amusement.</p>

<p>Except that it didn’t turn out that way.</p>

<p>Bush gave a one-hour speech during which he did everything that I knew how to do, and did it effortlessly.  Not only that, there were instances where he clearly expected a reaction from the audience, and got that reaction.  I was watching him like a hawk the whole time and had absolutely no idea how he had made it happen.</p>

<p>Bush might well have been the most inarticulate of the US Presidents, but he was incomparably better than this software developer will ever be.</p>

<p>But that does not mean that I cannot continue to improve.  In fact, I can now do a better job of presenting that Bush can.  Not just due to my having spent the intervening decades practicing (practice makes perfect!), but mostly due to the fact that Bush has since passed away.</p>

<h2 id="linux-community">Linux Community</h2>

<p>I joined the Linux community in 2001, where I faced large and diverse audiences.  It quickly became obvious that I needed to apply my youthful Warner Brothers lessons, especially given that I was presenting things like RCU to audiences that were mostly innocent of any knowledge of or experience in concurrency.</p>

<p>This experience also gave me much-needed practice dealing with larger audiences, in a few cases, on the order of 1,000.</p>

<p>So I continue to improve, but there is much more for me to learn.</p>
]]></content:encoded>
      <author>paulmck</author>
      <guid>https://people.kernel.org/read/a/j2xrz4hd83</guid>
      <pubDate>Thu, 27 Feb 2025 23:55:17 +0000</pubDate>
    </item>
    <item>
      <title>Speaking at Kernel Recipes: Summary</title>
      <link>https://people.kernel.org/paulmck/speaking-at-kernel-recipes-summary</link>
      <description>&lt;![CDATA[This is part of the Kernel Recipes 2025 blog series.&#xA;&#xA;This blog series has covered why public speaking is important, ways and means, building bridges from your audience to where they need to go, who owns your words,  telling stories, knowing your destination, use of humor, and speaking on short notice.&#xA;&#xA;But if you would rather learn about what I actually did rather than what I advise you to do, please see here.&#xA;&#xA;I close this series by reiterating the value and ubiquity of Toastmasters and the usefulness of both dry runs and reviewing videos of your past talks.&#xA;&#xA;Best of everything in your presentations!&#xA;&#xA;Acknowledgments&#xA;&#xA;And last, but definitely not least, a big &#34;thank you&#34; (in chronological order) to Anne Nicolas, Willy Tarreau, Steven Rostedt, Gregory Price, and Michael Opendacker for their careful review of early versions of this series.]]&gt;</description>
      <content:encoded><![CDATA[<p>This is part of the <a href="https://people.kernel.org/paulmck/kernel-recipes-2025" rel="nofollow">Kernel Recipes 2025</a> blog series.</p>

<p>This blog series has covered <a href="https://people.kernel.org/paulmck/kernel-recipes-2025" rel="nofollow">why public speaking is important</a>, <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes" rel="nofollow">ways and means</a>, <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-build-a-bridge" rel="nofollow">building bridges from your audience to where they need to go</a>, <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-who-owns-your-words" rel="nofollow">who owns your words</a>,  <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-tell-a-story" rel="nofollow">telling stories</a>, <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-know-your-destination" rel="nofollow">knowing your destination</a>, <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-use-humor-but-carefully" rel="nofollow">use of humor</a>, and <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-short-notice" rel="nofollow">speaking on short notice</a>.</p>

<p>But if you would rather learn about what I actually did rather than what I advise you to do, please see <a href="https://people.kernel.org/paulmck/speaking-at-kernel-recipes-my-twisted-path" rel="nofollow">here</a>.</p>

<p>I close this series by reiterating the value and ubiquity of <a href="https://www.toastmasters.org/" rel="nofollow">Toastmasters</a> and the usefulness of both dry runs and reviewing videos of your past talks.</p>

<p>Best of everything in your presentations!</p>

<h3 id="acknowledgments">Acknowledgments</h3>

<p>And last, but definitely not least, a big “thank you” (in chronological order) to Anne Nicolas, Willy Tarreau, Steven Rostedt, Gregory Price, and Michael Opendacker for their careful review of early versions of this series.</p>
]]></content:encoded>
      <author>paulmck</author>
      <guid>https://people.kernel.org/read/a/c06p982oky</guid>
      <pubDate>Thu, 27 Feb 2025 20:14:19 +0000</pubDate>
    </item>
    <item>
      <title>Speaking at Kernel Recipes: Use Humor, But Carefully</title>
      <link>https://people.kernel.org/paulmck/speaking-at-kernel-recipes-use-humor-but-carefully</link>
      <description>&lt;![CDATA[This is part of the Kernel Recipes 2025 blog series.&#xA;&#xA;Humor is both difficult and dangerous, especially in a large and diverse group such as the audience for Kernel Recipes.  My advice is to do many formal presentations before attempting much in the way of humor.&#xA;&#xA;This section will nevertheless talk about use of humor in technical presentations.&#xA;&#xA;One issue is that audience members have a wide range of languages and dialects, and a given joke in (say) American English might not go over well to (say) Welsh English speakers.  And it might be completely mangled in translation to another language.  For example, during a 1980s visit to China, George Bush Senior is said to have quipped “We are oriented to the Orient.”  This translates to something like ”我们面向东方”, which translates back to something like “We face East”, completely destroying Bush’s oriented/Orient pun.  So what did the poor translator say?  “是笑话,笑吧”, which translates to something like “It is a joke,  laugh.”&#xA;&#xA;So if you tell jokes, keep translations to other cultures and languages firmly in mind.  (To be fair, this is advice that I could do well to better heed myself!)&#xA;&#xA;In addition, jokes make fun of some person or group or are based on what is considered to be abnormal, excessive, or unacceptable, all of which differ greatly across cultures.  Besides which, given a large and diverse audience such as that of Kernel Recipes, there will almost certainly be someone in attendance who identifies with the person or group in question or who has strong feelings about the joke’s implications about abnormality, excessiveness, or unacceptability.  That someone just might have a strong negative reaction.  And this should be absolutely no surprise, given that humor is used with great effect as a weapon in social conflicts.&#xA;&#xA;In my youth, there were outgroups that were frequently the butt of jokes.  These were often groups that were not represented in my small community, but were just as often a single-person outgroup made up of some hapless fellow student.  Then as now, the most cruel jokes all too often get the best laughs.&#xA;&#xA;Yet humor can also make a speech much more enjoyable.  So what is a speaker to do?&#xA;&#xA;Outgroups are often used, with technical talks making jokes at the expense of managers, salespeople, marketing departments, lawyers, users, and occasionally even an especially incompetent techie.  But these jokes always eventually find their way to the outgroup in question, sometimes with devastating consequences to the hapless speaker.&#xA;&#xA;It is better to tell jokes where you yourself are the butt of the joke.  This can be difficult at first: Let’s face it, most of us would prefer to be taken seriously.  However, becoming comfortable with this is well worth the effort.  For one thing, once you have demonstrated a willingness to make a joke at your own expense, the audience will usually be much more willing to accept their own shortcomings and need for improvement.  Such an audience will usually also be more willing to learn, and the best technical talks are after all those that audiences learn from.&#xA;&#xA;What jokes should you tell on yourself?  I paraphrase advice from the late humorist Patrick McManus:  The worst day of your life will make the audience laugh the hardest.&#xA;&#xA;That said, you need to make sure that the audience can relate to the challenges you faced on that day.  For example, my interactions with the legal profession would likely seem strange and irrelevant to a general audience.  However, almost all members of a Kernel Recipes audience will have chased down a difficult bug, so a story about some idiotic mistake I made while chasing down an RCU bug will likely resonate.  And this might be one way of entertaining a general audience while providing needed information to those wanting an RCU deep dive.&#xA;&#xA;Or maybe you can figure out how to work some bathroom humor into your talk.  Who is the butt of this joke?  You decide!  ;-)&#xA;&#xA;Adding humor to your talk often does not come for free.  Time spent telling jokes is not available for presenting on technology.  This tradeoff can be tricky: Too much humor makes for a lightweight talk, and too little for a dry talk.  Especially if you are just starting out, I strongly advise you to err in the direction of dryness.  Instead, make your technical content be the source of your audience’s excitement.&#xA;&#xA;Use of humor in technical talks is both difficult and dangerous, but careful use of humor can be a very powerful public-speaking tool.&#xA;&#xA;Perhaps some day I, too, will master the use of humor.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>This is part of the <a href="https://people.kernel.org/paulmck/kernel-recipes-2025" rel="nofollow">Kernel Recipes 2025</a> blog series.</p>

<p>Humor is both difficult and dangerous, especially in a large and diverse group such as the audience for Kernel Recipes.  My advice is to do many formal presentations before attempting much in the way of humor.</p>

<p>This section will nevertheless talk about use of humor in technical presentations.</p>

<p>One issue is that audience members have a wide range of languages and dialects, and a given joke in (say) American English might not go over well to (say) Welsh English speakers.  And it might be completely mangled in translation to another language.  For example, during a 1980s visit to China, George Bush Senior is said to have quipped “We are oriented to the Orient.”  This translates to something like ”我们面向东方”, which translates back to something like “We face East”, completely destroying Bush’s oriented/Orient pun.  So what did the poor translator say?  “是笑话,笑吧”, which translates to something like “It is a joke,  laugh.”</p>

<p>So if you tell jokes, keep translations to other cultures and languages firmly in mind.  (To be fair, this is advice that I could do well to better heed myself!)</p>

<p>In addition, jokes make fun of some person or group or are based on what is considered to be abnormal, excessive, or unacceptable, all of which differ greatly across cultures.  Besides which, given a large and diverse audience such as that of Kernel Recipes, there will almost certainly be someone in attendance who identifies with the person or group in question or who has strong feelings about the joke’s implications about abnormality, excessiveness, or unacceptability.  That someone just might have a strong negative reaction.  And this should be absolutely no surprise, given that humor is used with great effect as a weapon in social conflicts.</p>

<p>In my youth, there were outgroups that were frequently the butt of jokes.  These were often groups that were not represented in my small community, but were just as often a single-person outgroup made up of some hapless fellow student.  Then as now, the most cruel jokes all too often get the best laughs.</p>

<p>Yet humor can also make a speech much more enjoyable.  So what is a speaker to do?</p>

<p>Outgroups are often used, with technical talks making jokes at the expense of managers, salespeople, marketing departments, lawyers, users, and occasionally even an especially incompetent techie.  But these jokes always eventually find their way to the outgroup in question, sometimes with devastating consequences to the hapless speaker.</p>

<p>It is better to tell jokes where you yourself are the butt of the joke.  This can be difficult at first: Let’s face it, most of us would prefer to be taken seriously.  However, becoming comfortable with this is well worth the effort.  For one thing, once you have demonstrated a willingness to make a joke at your own expense, the audience will usually be much more willing to accept their own shortcomings and need for improvement.  Such an audience will usually also be more willing to learn, and the best technical talks are after all those that audiences learn from.</p>

<p>What jokes should you tell on yourself?  I paraphrase advice from the late humorist <a href="https://en.wikipedia.org/wiki/Patrick_F._McManus" rel="nofollow">Patrick McManus</a>:  The worst day of your life will make the audience laugh the hardest.</p>

<p>That said, you need to make sure that the audience can relate to the challenges you faced on that day.  For example, my interactions with the legal profession would likely seem strange and irrelevant to a general audience.  However, almost all members of a Kernel Recipes audience will have chased down a difficult bug, so a story about some idiotic mistake I made while chasing down an RCU bug will likely resonate.  And this might be one way of entertaining a general audience while providing needed information to those wanting an RCU deep dive.</p>

<p>Or maybe you can figure out how to work some <a href="https://www.youtube.com/watch?v=RjJG3LitNJQ&amp;list=PLQ8PmP_dnN7Ida3J3tzO-yqrxuF-yEQtI&amp;index=12" rel="nofollow">bathroom humor</a> into your talk.  Who is the butt of this joke?  You decide!  ;–)</p>

<p>Adding humor to your talk often does not come for free.  Time spent telling jokes is not available for presenting on technology.  This tradeoff can be tricky: Too much humor makes for a lightweight talk, and too little for a dry talk.  Especially if you are just starting out, I strongly advise you to err in the direction of dryness.  Instead, make your technical content be the source of your audience’s excitement.</p>

<p>Use of humor in technical talks is both difficult and dangerous, but careful use of humor can be a very powerful public-speaking tool.</p>

<p>Perhaps some day I, too, will master the use of humor.</p>
]]></content:encoded>
      <author>paulmck</author>
      <guid>https://people.kernel.org/read/a/n42j9i5bpo</guid>
      <pubDate>Thu, 27 Feb 2025 19:49:34 +0000</pubDate>
    </item>
    <item>
      <title>Finding a kernel bug triggered by systemd from userspace</title>
      <link>https://people.kernel.org/linusw/finding-a-kernel-bug-triggered-by-systemd-from-userspace</link>
      <description>&lt;![CDATA[As I was working my way toward a mergeable version of generic entry for ARM32, there was an especially nasty bug that I could not for my life iron out: when booting Debian for armhf I just kept running into a boot splat, while everything else worked fine. It would look something like this:&#xA;&#xA;8&lt;--- cut here ---&#xA;Unable to handle kernel paging request at virtual address eaffff76 when execute&#xA;[eaffff76] pgd=eae1141e(bad)&#xA;Internal error: Oops: 8000000d [#1] SMP ARM&#xA;CPU: 0 UID: 997 PID: 304 Comm: sd-resolve Not tainted 6.13.0-rc1+ #22&#xA;Hardware name: ARM-Versatile Express&#xA;PC is at 0xeaffff76&#xA;LR is at _invokesyscallret+0x0/0x18&#xA;pc : [eaffff76]    lr : [80100a68]    psr: a0030013&#xA;sp : fbc11f68  ip : fbc11e78  fp : 76539420&#xA;r10: 10c5387d  r9 : 841f4ec0  r8 : 80100284&#xA;r7 : ffffffff  r6 : 7653941c  r5 : 76cb6000  r4 : 00000000&#xA;r3 : 00000000  r2 : 00000000  r1 : 00080003  r0 : ffffff9f&#xA;Flags: NzCv  IRQs on  FIQs on  Mode SVC32  ISA ARM  Segment none&#xA;Control: 10c5387d  Table: 8222006a  DAC: 00000051&#xA;Register r0 information: non-paged memory&#xA;Register r1 information: non-paged memory&#xA;Register r2 information: NULL pointer&#xA;Register r3 information: NULL pointer&#xA;Register r4 information: NULL pointer&#xA;Register r5 information: non-paged memory&#xA;Register r6 information: non-paged memory&#xA;Register r7 information: non-paged memory&#xA;Register r8 information: non-slab/vmalloc memory&#xA;Register r9 information: slab taskstruct start 841f4ec0 pointer offset 0 size 2240&#xA;Register r10 information: non-paged memory&#xA;Register r11 information: non-paged memory&#xA;Register r12 information: 2-page vmalloc region starting at 0xfbc10000 allocated at copyprocess+0x150/0xd88&#xA;Process sd-resolve (pid: 304, stack limit = 0xbab1c12b)&#xA;Stack: (0xfbc11f68 to 0xfbc12000)&#xA;1f60:                   00000000 76cb6000 fbc11fb0 ffffffff 80100284 80cdd330&#xA;1f80: 80100284 841f4ec0 10c5387d 80111eac 00000000 76cb6000 7653941c 00000119&#xA;1fa0: 76539420 80100280 00000000 76cb6000 ffffff9f 00080003 00000000 00000000&#xA;1fc0: 00000000 76cb6000 7653941c 00000119 76539c48 76539c44 76539b6c 76539420&#xA;1fe0: 76f3a450 765392c4 76c72a4d 76c60108 20030030 00000010 00000000 00000000&#xA;Call trace: &#xA;Code: 00000000 00000000 00000000 00000000 (00000000) &#xA;---[ end trace 0000000000000000 ]---&#xA;&#xA;The paging request means that we are in kernel mode, and we have tried to page in a page that does not exist, such as reading from random uninitialized memory somewhere. If this was userspace, we would get a &#34;segmentation fault&#34;. So this is a pretty common error in C programs.&#xA;&#xA;Notice the following: no call trace. This always happens when you least of all want it: how am I supposed to know how we got here?&#xA;&#xA;But the die() splat has this helpful information: PID: 304 Comm: sd-resolve which reads: this was caused by the process initiated by the command sd-resolve executing as PID 304. But sd-resolve is just something systemd fires temporarily when bringing up some other service, so it must be part of a service. Luckily we also have dmesg:&#xA;&#xA;       Starting systemd-timesyncd… - Network Time Synchronization...&#xA;8&lt;--- cut here ---&#xA;Unable to handle kernel paging request at virtual address eaffff76 when execute&#xA;&#xA;Aha it&#39;s the NTP service. We can verify that this process is causing the mess by issuing:&#xA;&#xA;And indeed, we get a second reproducable splat. OK great, let&#39;s use ftrace to tell us what happened.&#xA;&#xA;The excellent article Secrets of the Ftrace function tracer tells us that it&#39;s as simple as echoing the PID of the process into &#xA;There is a problem though: this process is by its very nature transient. I don&#39;t know the PID! After some googling it turns out you can ask systemd what PID a certain service is running as:&#xA;&#xA;So ... we can echo this into &#xA;After some fooling around with trying to restart the service in one window, then quickly switching to another window and start the trace while the restart is happening, I realized what everyone should already know: never put a person to do a machine&#39;s job.&#xA;&#xA;I had to write a script that would restart the service, start the trace, wait for the restart to finish and then stop the trace. It ended up looking like this:&#xA;&#xA;!/bin/bash&#xA;TRACEDIR=/sys/kernel/debug/tracing&#xA;SERVICE=systemd-timesyncd&#xA;TRACEFILE=/root/trace.dat&#xA;&#xA;echo 0   ${TRACEDIR}/tracingon&#xA;echo &#34;function&#34;   ${TRACEDIR}/currenttracer&#xA;(systemctl restart ${SERVICE})&amp; PID=$!&#xA;echo 1   ${TRACEDIR}/tracingon&#xA;echo &#34;Wait for restart to commence&#34;&#xA;wait &#34;${PID}&#34;&#xA;echo 0   ${TRACEDIR}/tracingon&#xA;echo &#34;Restarted &#34;&#xA;trace-cmd extract -o ${TRACEFILE}&#xA;scp ${TRACEFILE} linus@169.254.1.2:/tmp/trace.dat&#xA;&#xA;This does what we want: turn off the tracing, activate the function tracer, restart the systemd-timesyncd service and capture it&#39;s PID, start tracing, wait for the restart to commence, then extract the trace and copy it to the development system. No need to figure out the PID of the forked sd-resolve, just capture everything: this window will be small enough that we can capture all the relevant trace information.&#xA;&#xA;After this I brought up kernelshark to inspect the resulting tracefile &#xA;Kernelshark looking at logs&#xA;&#xA;We search for the die() invocation (here I had a prinfo() added with the word &#34;CRASH&#34; as well). Sure enough there it is and the task is indeed sd-resolve. But what happens before? We need to know why we crashed here.&#xA;&#xA;For that we need to re-run the trace but now with the function\graph tracer so we can see the program flow, the indentation helps us to follow the messiness:&#xA;&#xA;Kernelshark looking at logs for function graph&#xA;&#xA;So we switch to the function\graph view and start off from die(), then we just move upward: first we find a prefetch abort, and that is what we already knew: we are getting a page fault, in kernel mode, for a page that does not have a backing storage.&#xA;&#xA;So browse backwards, and:&#xA;&#xA;Kernelshark looking at logs for function graph&#xA;&#xA;Aha. We are invoking some syscall, and that doesn&#39;t really work. So what kind of odd syscall can we invoke that just crash on us like that? We instrument invokesyscall() to print that out for us, but not for all invocations but just for the task we are interested in, and we know that is sd-resolve:&#xA;&#xA;if (!strcmp(current-  comm, &#34;sd-resolve&#34;))&#xA;   pr_info(&#34;%s invoke syscall %d\n&#34;, current-  comm, scno);&#xA;&#xA;We take advantage of the fact that the global variable current in the kernel always points to the currently active task. And the field &#xA;Then we run this and oh:&#xA;&#xA;[  OK  ] Started systemd-timesyncd.…0m - Network Time Synchronization.&#xA;sd-resolve invoke syscall 291&#xA;sd-resolve invoke syscall -1&#xA;8&lt;--- cut here ---&#xA;Unable to handle kernel paging request at virtual address e7f001f2 when execute&#xA;[e7f001f2] pgd=e7e1141e(bad)&#xA;&#xA;So system call -1 was invoked. This may seem weird to you, but it is actually a legal value: when tracing system calls (such as with strace) the kernel will filter system calls, and the filter will return -1 to indicate that the system call should not even be taken, &#34;skipped&#34;, and my new generic entry code was not taking this properly into account, and the low-level assembly tried to vector -1 into a table and it failed miserably, vectoring us out in the unknown.&#xA;&#xA;At this point I could quickly patch up the code and call it a day.&#xA;&#xA;I have no idea why sd-resolve turns on system call tracing by default, because it obviously does. It could be related to some seccomp security features that are being called in BPF programs prior to every system call in the same code path? I think those need to intercept the system calls anyway. Not particularly efficient, but I suppose quite secure.]]&gt;</description>
      <content:encoded><![CDATA[<p>As I was working my way toward a mergeable version of <a href="https://lore.kernel.org/linux-arm-kernel/20250107-arm-generic-entry-v3-0-4e5f3c15db2d@linaro.org/" rel="nofollow">generic entry for ARM32</a>, there was an especially nasty bug that I could not for my life iron out: when booting Debian for armhf I just kept running into a boot splat, while everything else worked fine. It would look something like this:</p>

<pre><code>8&lt;--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute
[eaffff76] *pgd=eae1141e(bad)
Internal error: Oops: 8000000d [#1] SMP ARM
CPU: 0 UID: 997 PID: 304 Comm: sd-resolve Not tainted 6.13.0-rc1+ #22
Hardware name: ARM-Versatile Express
PC is at 0xeaffff76
LR is at __invoke_syscall_ret+0x0/0x18
pc : [&lt;eaffff76&gt;]    lr : [&lt;80100a68&gt;]    psr: a0030013
sp : fbc11f68  ip : fbc11e78  fp : 76539420
r10: 10c5387d  r9 : 841f4ec0  r8 : 80100284
r7 : ffffffff  r6 : 7653941c  r5 : 76cb6000  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : 00080003  r0 : ffffff9f
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 8222006a  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: NULL pointer
Register r5 information: non-paged memory
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: non-slab/vmalloc memory
Register r9 information: slab task_struct start 841f4ec0 pointer offset 0 size 2240
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: 2-page vmalloc region starting at 0xfbc10000 allocated at copy_process+0x150/0xd88
Process sd-resolve (pid: 304, stack limit = 0xbab1c12b)
Stack: (0xfbc11f68 to 0xfbc12000)
1f60:                   00000000 76cb6000 fbc11fb0 ffffffff 80100284 80cdd330
1f80: 80100284 841f4ec0 10c5387d 80111eac 00000000 76cb6000 7653941c 00000119
1fa0: 76539420 80100280 00000000 76cb6000 ffffff9f 00080003 00000000 00000000
1fc0: 00000000 76cb6000 7653941c 00000119 76539c48 76539c44 76539b6c 76539420
1fe0: 76f3a450 765392c4 76c72a4d 76c60108 20030030 00000010 00000000 00000000
Call trace: 
Code: 00000000 00000000 00000000 00000000 (00000000) 
---[ end trace 0000000000000000 ]---
</code></pre>

<p>The paging request means that we are in kernel mode, and we have tried to page in a page that does not exist, such as reading from random uninitialized memory somewhere. If this was userspace, we would get a “segmentation fault”. So this is a pretty common error in C programs.</p>

<p>Notice the following: no call trace. This always happens when you least of all want it: how am I supposed to know how we got here?</p>

<p>But the die() splat has this helpful information: <strong>PID: 304 Comm: sd-resolve</strong> which reads: this was caused by the process initiated by the command <strong>sd-resolve</strong> executing as PID 304. But sd-resolve is just something systemd fires temporarily when bringing up some other service, so it must be part of a service. Luckily we also have dmesg:</p>

<pre><code>       Starting systemd-timesyncd… - Network Time Synchronization...
8&lt;--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute
</code></pre>

<p>Aha it&#39;s the NTP service. We can verify that this process is causing the mess by issuing:</p>

<p><code>systemctl restart systemd-timesyncd</code></p>

<p>And indeed, we get a second reproducable splat. OK great, let&#39;s use <em>ftrace</em> to tell us what happened.</p>

<p>The excellent article <a href="https://lwn.net/Articles/370423/" rel="nofollow">Secrets of the Ftrace function tracer</a> tells us that it&#39;s as simple as echoing the PID of the process into <code>set_ftrace_pid</code> in the kernel debug/tracing sysfs filetree.</p>

<p>There is a problem though: this process is by its very nature transient. I don&#39;t know the PID! After some googling it turns out you can ask systemd what PID a certain service is running as:</p>

<p><code>systemctl show --property MainPID --value systemd-timesyncd</code></p>

<p>So ... we can echo this into <code>set_ftrace_pid</code>, and then start the trace, but the service is transient, how can I restart the service, obtain the PID, start tracing and wait for the service to finish restarting and then  end the trace? No that&#39;s a tall order.</p>

<p>After some fooling around with trying to restart the service in one window, then quickly switching to another window and start the trace while the restart is happening, I realized what everyone should already know: never put a person to do a machine&#39;s job.</p>

<p>I had to write a script that would restart the service, start the trace, wait for the restart to finish and then stop the trace. It ended up looking like this:</p>

<pre><code>#!/bin/bash
TRACEDIR=/sys/kernel/debug/tracing
SERVICE=systemd-timesyncd
TRACEFILE=/root/trace.dat

echo 0 &gt; ${TRACEDIR}/tracing_on
echo &#34;function&#34; &gt; ${TRACEDIR}/current_tracer
(systemctl restart ${SERVICE})&amp; PID=$!
echo 1 &gt; ${TRACEDIR}/tracing_on
echo &#34;Wait for restart to commence&#34;
wait &#34;${PID}&#34;
echo 0 &gt; ${TRACEDIR}/tracing_on
echo &#34;Restarted &#34;
trace-cmd extract -o ${TRACEFILE}
scp ${TRACEFILE} linus@169.254.1.2:/tmp/trace.dat
</code></pre>

<p>This does what we want: turn off the tracing, activate the function tracer, restart the systemd-timesyncd service and capture it&#39;s PID, start tracing, wait for the restart to commence, then extract the trace and copy it to the development system. No need to figure out the PID of the forked sd-resolve, just capture everything: this window will be small enough that we can capture all the relevant trace information.</p>

<p>After this I brought up <a href="https://kernelshark.org/" rel="nofollow">kernelshark</a> to inspect the resulting tracefile <code>trace.dat</code> (right-click open in new tab/window to see the details):</p>

<p><img src="https://dflund.se/~triad/images/crash1.jpg" alt="Kernelshark looking at logs"></p>

<p>We search for the die() invocation (here I had a pr_info() added with the word “CRASH” as well). Sure enough there it is and the task is indeed <em>sd-resolve</em>. But what happens before? We need to know <em>why</em> we crashed here.</p>

<p>For that we need to re-run the trace but now with the <strong>function_graph</strong> tracer so we can see the program flow, the indentation helps us to follow the messiness:</p>

<p><img src="https://dflund.se/~triad/images/crash2.jpg" alt="Kernelshark looking at logs for function graph"></p>

<p>So we switch to the function_graph view and start off from die(), then we just move upward: first we find a prefetch abort, and that is what we already knew: we are getting a page fault, in kernel mode, for a page that does not have a backing storage.</p>

<p>So browse backwards, and:</p>

<p><img src="https://dflund.se/~triad/images/crash3.jpg" alt="Kernelshark looking at logs for function graph"></p>

<p>Aha. We are invoking some syscall, and that doesn&#39;t really work. So what kind of odd syscall can we invoke that just crash on us like that? We instrument invoke_syscall() to print that out for us, but not for all invocations but just for the task we are interested in, and we know that is <strong>sd-resolve</strong>:</p>

<pre><code>if (!strcmp(current-&gt;comm, &#34;sd-resolve&#34;))
   pr_info(&#34;%s invoke syscall %d\n&#34;, current-&gt;comm, scno);
</code></pre>

<p>We take advantage of the fact that the global variable <strong>current</strong> in the kernel always points to the currently active task. And the field <code>.comm</code> should contain the command name that it was invoked with, such as “sd-resolve”.</p>

<p>Then we run this and oh:</p>

<pre><code>[  OK  ] Started systemd-timesyncd.…0m - Network Time Synchronization.
sd-resolve invoke syscall 291
sd-resolve invoke syscall -1
8&lt;--- cut here ---
Unable to handle kernel paging request at virtual address e7f001f2 when execute
[e7f001f2] *pgd=e7e1141e(bad)
</code></pre>

<p>So <strong>system call -1</strong> was invoked. This may seem weird to you, but it is actually a legal value: when tracing system calls (such as with <strong>strace</strong>) the kernel will filter system calls, and the filter will return <strong>-1</strong> to indicate that the system call should not even be taken, “skipped”, and my new generic entry code was not taking this properly into account, and the low-level assembly tried to vector -1 into a table and it failed miserably, vectoring us out in the unknown.</p>

<p>At this point I could quickly patch up the code and call it a day.</p>

<p>I have no idea why sd-resolve turns on system call tracing by default, because it obviously does. It could be related to some seccomp security features that are being called in BPF programs prior to every system call in the same code path? I think those need to intercept the system calls anyway. Not particularly efficient, but I suppose quite secure.</p>
]]></content:encoded>
      <author>linusw</author>
      <guid>https://people.kernel.org/read/a/fapfgctnlw</guid>
      <pubDate>Sun, 02 Feb 2025 16:05:45 +0000</pubDate>
    </item>
    <item>
      <title>netdev in 2024</title>
      <link>https://people.kernel.org/kuba/netdev-in-2024</link>
      <description>&lt;![CDATA[Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.&#xA;&#xA;Another busy year has passed so let us punctuate the never ending stream of development with a retrospective of our accomplishments over the last 12 months. The previous, 2023 retrospective has covered changes from Linux v6.3 to v6.8, for 2024 we will cover Linux v6.9 to v6.13, one fewer, as Linux releases don’t align with calendar years. We will focus on the work happening directly on the netdev mailing list, having neither space nor expertise to do justice to developments within sub-subsystems like WiFi, Bluetooth, BPF etc.&#xA;Core&#xA;After months of work and many patch revisions we have finally merged support for Device Memory TCP, which allows TCP payloads to be placed directly in accelerator (GPU, TPU, etc.) or user space memory while still using the kernel stack for all protocol (header) processing (v6.12). The immediate motivation for this work is obviously the GenAI boom, but some of the components built to enable Device Memory TCP, for example queue control API (v6.10), should be more broadly applicable.&#xA;&#xA;The second notable area of development was busy polling. Additions to the epoll API allow enabling and configuring network busy polling on a per-epoll-instance basis, making the feature far easier to deploy in a single application (v6.9). Even more significant was the addition of a NAPI suspension mechanism which allows for efficient and automatic switching between busy polling and IRQ-driven operation, as most real life applications are not constantly running under highest load (v6.12). Once again the work was preceded by paying off technical debt, it is now possible to configure individual NAPI instances rather than an entire network interface (v6.13).&#xA;&#xA;Work on relieving the rtnl\lock pressure has continued throughout the year. The rtnl\lock is often mentioned as one of the biggest global locks in the kernel, as it protects all of the network configuration and state. The efforts can be divided into two broad categories - converting read operations to rely on RCU protection or other fine grained locking (v6.9, v6.10), and splitting the lock into per-network namespace locks (preparations for which started in v6.13).&#xA;&#xA;Following discussions during last year’s LPC, the Real Time developers have contributed changes which make network processing more RT-friendly by allowing all packet processing to be executed in dedicated threads, instead of the softirq thread (v6.10). They also replaced implicit Bottom Half protection (the fact that code in BH context can’t be preempted, or migrated between CPUs) with explicit local locks (v6.11).&#xA;&#xA;The routing stack has seen a number of small additions for ECMP forwarding, which underpins all modern datacenter network fabrics. ECMP routing can now maintain per-path statistics to allow detecting unbalanced use of paths (v6.9), and to reseed the hashing key to remediate the poor traffic distribution (v6.11). The weights used in ECMP’s consistent hashing have been widened from 8 bits to 16 bits (v6.12).&#xA;&#xA;The ability to schedule sending packets at a particular time in the future has been extended to survive  network namespace traversal (v6.9), and now supports using the TAI clock as a reference (v6.11). We also gained the ability to explicitly supply the timestamp ID via a cmsg during a sendmsg call (v6.13).&#xA;&#xA;The number of “drop reasons”, helping to easily identify and trace packet loss in the stack is steadily increasing. Reason codes are now also provided when TCP RST packets are generated (v6.10).&#xA;Protocols&#xA;The protocol development wasn’t particularly active in 2024. As we close off the year 3 large protocol patch sets are being actively reviewed, but let us not steal 2025’s thunder, and limit ourselves to changes present in Linus’s tree by the end of 2024.&#xA;&#xA;AF\UNIX socket family has a new garbage collection algorithm (v6.10). Since AF\UNIX supports file descriptor passing, sockets can hold references to each other, forming reference cycles etc. The old home grown algorithm which was a constant source of bugs has been replaced by one with more theoretical backing (Tarjan’s algorithm).&#xA;&#xA;TCP SYN cookie generation and validation can now be performed from the TC subsystem hooks, enabling scaling out SYN flood handling across multiple machines (v6.9). User space can peek into data queued to a TCP socket at a specified offset (v6.10). It is also now possible to set min\rto for all new sockets using a sysctl, a patch which was reportedly maintained downstream by multiple hyperscalers for years (v6.11).&#xA;&#xA;UDP segmentation now works even if the underlying device doesn’t support checksum offload, e.g. TUN/TAP (v6.11). A new hash table was added for connected UDP sockets (4-tuple based), significantly speeding-up connected socket lookup (v6.13).&#xA;&#xA;MPTCP gained TCP\NOTSENT\_LOWAT support (v6.9), and automatic tracking of destinations which blackhole MPTCP traffic (6.12).&#xA;&#xA;IPsec stack now adheres to RFC 4301 when it comes to forwarding ICMP Error messages (v6.9).&#xA;&#xA;Bonding driver supports independent control state machine in addition to the traditional coupled one, per IEEE 802.1AX-2008 5.4.15 (v6.9).&#xA;&#xA;The GTP protocol gained IPv6 support (v6.10).&#xA;&#xA;The High-availability Seamless Redundancy (HSR) protocol implementation gained the ability to work as a proxy node connecting non-HSR capable node to an HSR network (RedBOX mode) (v6.11).&#xA;&#xA;The netconsole driver can attach arbitrary metadata to the log messages (v6.9).&#xA;&#xA;The work on making Netlink easier to interface with in modern languages continued. The Netlink protocol descriptions in YAML can now express Netlink “polymorphism” (v6.9), i.e. a situation where parsing of one attribute depends on the value of another attribute (e.g. link type determines how link attributes are parsed). 7 new specs have been added, as well as a lot of small spec and code generation improvements. Sadly we still only have bindings/codegen for C, C++ and Python.&#xA;Device APIs&#xA;The biggest addition to the device-facing APIs in 2024 was the HW traffic shaping interface (v6.13). Over the years we have accumulated a plethora of single-vendor, single-use case rate control APIs. The new API promises to express most use cases, ultimately unifying the configuration from the user perspective. The immediate use for the new API is rate limiting traffic from a group of Tx queues. Somewhat related to this work was the revamp of the RSS context API which allows directing Rx traffic to a group of queues (v6.11, v6.12, v6.13). Together the HW rate limiting and RSS context APIs will hopefully allow container networking to leverage HW capabilities, without the need for complex full offloads.&#xA;&#xA;A new API for reporting device statistics has been created (qstat) within the netdev netlink family (v6.9). It allows reporting more detailed driver-level stats than old interfaces, and breaking down the stats by Rx/Tx queue.&#xA;&#xA;Packet processing in presence of TC classifier offloads has been sped up, the software processing is now fully skipped if all rules are installed in HW-only mode (v6.10).&#xA;&#xA;Ethtool gained support for flashing firmware to SFP modules, and configuring thresholds used by automatic IRQ moderation (v6.11). The most significant change to ethtool APIs in 2024 was, however, the ability to interact with multiple PHYs for a single network interface (v6.12).&#xA;&#xA;Work continues on adding configuration interfaces for supplying power over network wiring. Ethtool APIs have been extended with Power over Ethernet (PoE) support (v6.10). The APIs have been extended to allow reporting more information about the devices and failure reasons, as well as setting power limits (v6.11).&#xA;&#xA;Configuration of Energy Efficient Ethernet is being reworked because the old API did not have enough bits to cover new link modes (2.5GE, 5GE), but we also used this as an opportunity to share more code between drivers (especially those using phylib), and encourage more uniform behavior (v6.9).&#xA;Testing&#xA;2024 was the year of improving our testing. We spent the previous winter break building out an automated testing system, and have been running the full suite of networking selftests on all code merged since January. The pre-merge tests are catching roughly one bug a day.&#xA;&#xA;We added a handful of simple libraries and infrastructure for writing tests in Python, crucially allowing easy use of Netlink YAML bindings, and supporting tests for NIC drivers (v6.10).&#xA;&#xA;Later in the year we added native integration of packetdrill tests into kselftest, and started importing batches of tests from the packetdrill library (v6.12).&#xA;Community and process&#xA;The maintainers, developers and  community members have met at two conferences, the netdev track at Linux Plumbers and netconf in Vienna, and the netdev.conf 0x18 conference in Santa Clara.&#xA;&#xA;We have removed the historic requirement for special formatting of multi-line comments in netdev (although it is still the preferred style), documented our guidance on the use of automatic resource cleanup, as well as sending cleanup patches (such as “fixing” checkpatch warnings in existing code).&#xA;&#xA;In April, we announced the redefinition of the “Supported” status for NIC drivers, to try to nudge vendors towards more collaboration and better testing. Whether this change has the desired effect remains to be seen.&#xA;&#xA;Last but not least Andrew Lunn and Simon Horman have joined the netdev maintainer group.&#xA;&#xA;Full list of networking PRs this year (links)&#xA;6.9: https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org&#xA;6.10: https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org&#xA;6.11: https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org&#xA;6.12: https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org&#xA;6.13: https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p><em>Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.</em></p>

<p>Another busy year has passed so let us punctuate the never ending stream of development with a retrospective of our accomplishments over the last 12 months. The previous, 2023 retrospective has covered changes from Linux v6.3 to v6.8, for 2024 we will cover Linux v6.9 to v6.13, one fewer, as Linux releases don’t align with calendar years. We will focus on the work happening directly on the netdev mailing list, having neither space nor expertise to do justice to developments within sub-subsystems like WiFi, Bluetooth, BPF etc.</p>

<h2 id="core">Core</h2>

<p>After months of work and many patch revisions we have finally merged support for Device Memory TCP, which allows TCP payloads to be placed directly in accelerator (GPU, TPU, etc.) or user space memory while still using the kernel stack for all protocol (header) processing (v6.12). The immediate motivation for this work is obviously the GenAI boom, but some of the components built to enable Device Memory TCP, for example queue control API (v6.10), should be more broadly applicable.</p>

<p>The second notable area of development was busy polling. Additions to the epoll API allow enabling and configuring network busy polling on a per-epoll-instance basis, making the feature far easier to deploy in a single application (v6.9). Even more significant was the addition of a NAPI suspension mechanism which allows for efficient and automatic switching between busy polling and IRQ-driven operation, as most real life applications are not constantly running under highest load (v6.12). Once again the work was preceded by paying off technical debt, it is now possible to configure individual NAPI instances rather than an entire network interface (v6.13).</p>

<p>Work on relieving the rtnl_lock pressure has continued throughout the year. The rtnl_lock is often mentioned as one of the biggest global locks in the kernel, as it protects all of the network configuration and state. The efforts can be divided into two broad categories – converting read operations to rely on RCU protection or other fine grained locking (v6.9, v6.10), and splitting the lock into per-network namespace locks (preparations for which started in v6.13).</p>

<p>Following discussions during last year’s LPC, the Real Time developers have contributed changes which make network processing more RT-friendly by allowing all packet processing to be executed in dedicated threads, instead of the softirq thread (v6.10). They also replaced implicit Bottom Half protection (the fact that code in BH context can’t be preempted, or migrated between CPUs) with explicit local locks (v6.11).</p>

<p>The routing stack has seen a number of small additions for ECMP forwarding, which underpins all modern datacenter network fabrics. ECMP routing can now maintain per-path statistics to allow detecting unbalanced use of paths (v6.9), and to reseed the hashing key to remediate the poor traffic distribution (v6.11). The weights used in ECMP’s consistent hashing have been widened from 8 bits to 16 bits (v6.12).</p>

<p>The ability to schedule sending packets at a particular time in the future has been extended to survive  network namespace traversal (v6.9), and now supports using the TAI clock as a reference (v6.11). We also gained the ability to explicitly supply the timestamp ID via a cmsg during a sendmsg call (v6.13).</p>

<p>The number of “drop reasons”, helping to easily identify and trace packet loss in the stack is steadily increasing. Reason codes are now also provided when TCP RST packets are generated (v6.10).</p>

<h2 id="protocols">Protocols</h2>

<p>The protocol development wasn’t particularly active in 2024. As we close off the year 3 large protocol patch sets are being actively reviewed, but let us not steal 2025’s thunder, and limit ourselves to changes present in Linus’s tree by the end of 2024.</p>

<p>AF_UNIX socket family has a new garbage collection algorithm (v6.10). Since AF_UNIX supports file descriptor passing, sockets can hold references to each other, forming reference cycles etc. The old home grown algorithm which was a constant source of bugs has been replaced by one with more theoretical backing (Tarjan’s algorithm).</p>

<p>TCP SYN cookie generation and validation can now be performed from the TC subsystem hooks, enabling scaling out SYN flood handling across multiple machines (v6.9). User space can peek into data queued to a TCP socket at a specified offset (v6.10). It is also now possible to set min_rto for all new sockets using a sysctl, a patch which was reportedly maintained downstream by multiple hyperscalers for years (v6.11).</p>

<p>UDP segmentation now works even if the underlying device doesn’t support checksum offload, e.g. TUN/TAP (v6.11). A new hash table was added for connected UDP sockets (4-tuple based), significantly speeding-up connected socket lookup (v6.13).</p>

<p>MPTCP gained TCP_NOTSENT_LOWAT support (v6.9), and automatic tracking of destinations which blackhole MPTCP traffic (6.12).</p>

<p>IPsec stack now adheres to RFC 4301 when it comes to forwarding ICMP Error messages (v6.9).</p>

<p>Bonding driver supports independent control state machine in addition to the traditional coupled one, per IEEE 802.1AX-2008 5.4.15 (v6.9).</p>

<p>The GTP protocol gained IPv6 support (v6.10).</p>

<p>The High-availability Seamless Redundancy (HSR) protocol implementation gained the ability to work as a proxy node connecting non-HSR capable node to an HSR network (RedBOX mode) (v6.11).</p>

<p>The netconsole driver can attach arbitrary metadata to the log messages (v6.9).</p>

<p>The work on making Netlink easier to interface with in modern languages continued. The Netlink protocol descriptions in YAML can now express Netlink “polymorphism” (v6.9), i.e. a situation where parsing of one attribute depends on the value of another attribute (e.g. link type determines how link attributes are parsed). 7 new specs have been added, as well as a lot of small spec and code generation improvements. Sadly we still only have bindings/codegen for C, C++ and Python.</p>

<h2 id="device-apis">Device APIs</h2>

<p>The biggest addition to the device-facing APIs in 2024 was the HW traffic shaping interface (v6.13). Over the years we have accumulated a plethora of single-vendor, single-use case rate control APIs. The new API promises to express most use cases, ultimately unifying the configuration from the user perspective. The immediate use for the new API is rate limiting traffic from a group of Tx queues. Somewhat related to this work was the revamp of the RSS context API which allows directing Rx traffic to a group of queues (v6.11, v6.12, v6.13). Together the HW rate limiting and RSS context APIs will hopefully allow container networking to leverage HW capabilities, without the need for complex full offloads.</p>

<p>A new API for reporting device statistics has been created (qstat) within the netdev netlink family (v6.9). It allows reporting more detailed driver-level stats than old interfaces, and breaking down the stats by Rx/Tx queue.</p>

<p>Packet processing in presence of TC classifier offloads has been sped up, the software processing is now fully skipped if all rules are installed in HW-only mode (v6.10).</p>

<p>Ethtool gained support for flashing firmware to SFP modules, and configuring thresholds used by automatic IRQ moderation (v6.11). The most significant change to ethtool APIs in 2024 was, however, the ability to interact with multiple PHYs for a single network interface (v6.12).</p>

<p>Work continues on adding configuration interfaces for supplying power over network wiring. Ethtool APIs have been extended with Power over Ethernet (PoE) support (v6.10). The APIs have been extended to allow reporting more information about the devices and failure reasons, as well as setting power limits (v6.11).</p>

<p>Configuration of Energy Efficient Ethernet is being reworked because the old API did not have enough bits to cover new link modes (2.5GE, 5GE), but we also used this as an opportunity to share more code between drivers (especially those using phylib), and encourage more uniform behavior (v6.9).</p>

<h2 id="testing">Testing</h2>

<p>2024 was the year of improving our testing. We spent the previous winter break building out an automated testing system, and have been running the full suite of networking selftests on all code merged since January. The pre-merge tests are catching roughly one bug a day.</p>

<p>We added a handful of simple libraries and infrastructure for writing tests in Python, crucially allowing easy use of Netlink YAML bindings, and supporting tests for NIC drivers (v6.10).</p>

<p>Later in the year we added native integration of packetdrill tests into kselftest, and started importing batches of tests from the packetdrill library (v6.12).</p>

<h2 id="community-and-process">Community and process</h2>

<p>The maintainers, developers and  community members have met at two conferences, the netdev track at <a href="https://www.youtube.com/@LinuxPlumbersConference" rel="nofollow">Linux Plumbers</a> and <a href="https://netdev.bots.linux.dev/netconf/2024/index.html" rel="nofollow">netconf</a> in Vienna, and the <a href="https://netdevconf.info/0x18/" rel="nofollow">netdev.conf 0x18</a> conference in Santa Clara.</p>

<p>We have <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82b8000c28b56b014ce52a1f1581bef4af148681" rel="nofollow">removed</a> the historic requirement for special formatting of multi-line comments in netdev (although it is still the preferred style), documented our guidance on the use of <a href="https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#using-device-managed-and-cleanup-h-constructs" rel="nofollow">automatic resource cleanup</a>, as well as sending <a href="https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#clean-up-patches" rel="nofollow">cleanup patches</a> (such as “fixing” checkpatch warnings in existing code).</p>

<p>In April, we <a href="https://lore.kernel.org/20240425114200.3effe773@kernel.org" rel="nofollow">announced</a> the redefinition of the “Supported” status for NIC drivers, to try to nudge vendors towards more collaboration and better testing. Whether this change has the desired effect remains to be seen.</p>

<p>Last but not least Andrew Lunn and Simon Horman have joined the netdev maintainer group.</p>

<h3 id="full-list-of-networking-prs-this-year-links">Full list of networking PRs this year (links)</h3>

<p>6.9: <a href="https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org" rel="nofollow">https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org</a>
6.10: <a href="https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org" rel="nofollow">https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org</a>
6.11: <a href="https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org" rel="nofollow">https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org</a>
6.12: <a href="https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org" rel="nofollow">https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org</a>
6.13: <a href="https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com" rel="nofollow">https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com</a></p>
]]></content:encoded>
      <author>Jakub Kicinski</author>
      <guid>https://people.kernel.org/read/a/um69vfk877</guid>
      <pubDate>Tue, 07 Jan 2025 18:25:08 +0000</pubDate>
    </item>
    <item>
      <title>Colliding with the SHA prefix of Linux&#39;s initial Git commit</title>
      <link>https://people.kernel.org/kees/colliding-with-the-sha-prefix-of-linuxs-initial-git-commit</link>
      <description>&lt;![CDATA[Or, how to break all the tools that parse the &#34;Fixes:&#34; tag&#xA;Kees Cook&#xA;&#xA;There was a recent discussion about how Linux&#39;s &#34;Fixes&#34; tag, which traditionally uses the 12 character commit SHA prefix, has an ever increasing chance of collisions. There are already 11-character collisions, and Geert wanted to raise the minimum short id to 16 characters. This was met with push-back for various reasons. One aspect that bothered me was some people still treating this like a theoretical &#34;maybe in the future&#34; problem. To clear up that problem, I generated a 12-character prefix collision against the start of Git history, commit 1da177e4c3f4 (&#34;Linux-2.6.12-rc2&#34;), which shows up in &#34;Fixes&#34; tags very often:&#xA;&#xA;$ git log --no-merges --oneline --grep &#39;Fixes: 1da177e4c3f4&#39; | wc -l&#xA;590&#xA;&#xA;Tools like linux-next&#39;s &#34;Fixes tag checker&#34;, the Linux CNA&#39;s commit parser, and my own CVE lifetime analysis scripts do programmatic analysis of the &#34;Fixes&#34; tag and had no support for collisions (even shorter existing collisions).&#xA;&#xA;So, in an effort to fix these tools, I broke them with commit 1da177e4c3f4 (&#34;docs: git SHA prefixes are for humans&#34;):&#xA;&#xA;$ git show 1da177e4c3f4&#xA;error: short object ID 1da177e4c3f4 is ambiguous&#xA;hint: The candidates are:&#xA;hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2&#xA;hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans&#xA;&#xA;This is not yet in the upstream Linux tree, for fear of breaking countless other tools out in the wild. But it can serve as a test commit for those that want to get this fixed ahead of any future collisions (or this commit actually landing).&#xA;&#xA;Lots of thanks to the lucky-commit project, which will grind trailing commit message whitespace in an attempt to find collisions. Doing the 12-character prefix collision took about 6 hours on my OpenCL-enabled RTX 3080 GPU.&#xA;&#xA;For any questions, comments, etc, see this thread.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h3 id="or-how-to-break-all-the-tools-that-parse-the-fixes-tag">Or, how to break all the tools that parse the “Fixes:” tag</h3>

<h2 id="kees-cook-mailto-kees-kernel-org"><a href="mailto:kees@kernel.org" rel="nofollow">Kees Cook</a></h2>

<p>There was a recent discussion about how Linux&#39;s “Fixes” tag, which traditionally uses the 12 character commit SHA prefix, has an <a href="https://lore.kernel.org/lkml/cover.1733421037.git.geert+renesas@glider.be/" rel="nofollow">ever increasing chance of collisions</a>. There are already 11-character collisions, and Geert wanted to raise the minimum short id to 16 characters. This was met with push-back for various reasons. One aspect that bothered me was some people still treating this like a theoretical “maybe in the future” problem. To clear up that problem, I generated a 12-character prefix collision against the start of Git history, commit <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1da177e4c3f41524e886b7f1b8a0c1fc7321cac2" rel="nofollow">1da177e4c3f4 (“Linux-2.6.12-rc2”)</a>, which shows up in “Fixes” tags very often:</p>

<pre><code>$ git log --no-merges --oneline --grep &#39;Fixes: 1da177e4c3f4&#39; | wc -l
590
</code></pre>

<p>Tools like <code>linux-next</code>&#39;s “<a href="https://github.com/kees/kernel-tools/blob/trunk/helpers/check_fixes" rel="nofollow">Fixes tag checker</a>”, the Linux CNA&#39;s <a href="https://git.kernel.org/pub/scm/linux/security/vulns.git/tree/scripts/dyad" rel="nofollow">commit parser</a>, and my own <a href="https://github.com/kees/ubuntu-cve-tracker/blob/lifetime/scripts/report-kernel-bug-lifetime.py" rel="nofollow">CVE lifetime analysis scripts</a> do programmatic analysis of the “Fixes” tag and had no support for collisions (even shorter existing collisions).</p>

<p>So, in an effort to fix <a href="https://github.com/kees/kernel-tools/commit/5bf6a1e71df59a230ea0e138a82cdf3c5e8f349d" rel="nofollow">these</a> <a href="https://git.kernel.org/pub/scm/linux/security/vulns.git/commit/scripts/dyad?id=3e558e5c01f76cf3246cb82b1f32d9c6a7937c1e" rel="nofollow">tools</a>, I broke them with commit <a href="https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=dev/collide/v6.13-rc2/12-char&amp;id=1da177e4c3f47e316f6a4fdee9ae6a714293eed7" rel="nofollow">1da177e4c3f4 (“docs: git SHA prefixes are for humans”)</a>:</p>

<pre><code>$ git show 1da177e4c3f4
error: short object ID 1da177e4c3f4 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans
</code></pre>

<p>This is not yet in the upstream Linux tree, for fear of breaking countless other tools out in the wild. But it can serve as a test commit for those that want to get this fixed ahead of any future collisions (or this commit actually landing).</p>

<p>Lots of thanks to the <a href="https://github.com/not-an-aardvark/lucky-commit" rel="nofollow">lucky-commit</a> project, which will grind trailing commit message whitespace in an attempt to find collisions. Doing the 12-character prefix collision took about 6 hours on my <a href="https://developer.nvidia.com/opencl" rel="nofollow">OpenCL</a>-enabled RTX 3080 GPU.</p>

<p>For any questions, comments, etc, see <a href="https://fosstodon.org/@kees/113743400654323146" rel="nofollow">this thread</a>.</p>
]]></content:encoded>
      <author>kees</author>
      <guid>https://people.kernel.org/read/a/gah065bwp2</guid>
      <pubDate>Mon, 30 Dec 2024 19:18:53 +0000</pubDate>
    </item>
    <item>
      <title>New ARM32 Security Features in v6.10</title>
      <link>https://people.kernel.org/linusw/new-arm32-security-features-in-v6-10</link>
      <description>&lt;![CDATA[In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:&#xA;&#xA;PAN for LPAE `CONFIGCPUTTBR0PAN`&#xA;KCFI on ARM32 `CONFIGCFICLANG`&#xA;&#xA;As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!&#xA;&#xA;A good rundown of these and other historically interesting security features can be found in Russell Currey&#39;s abridged history of kernel hardening which sums up what has been done up to now in a very approachable form.&#xA;&#xA;PAN for LPAE&#xA;&#xA;PAN is an abbreviation for the somewhat grammatically incorrect Privileged Access Never.&#xA;&#xA;The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions get\from\user() and put\to\user(). Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.&#xA;&#xA;In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.&#xA;&#xA;The ARM32 hardware has for a few years had a config option named `CONFIGSWDOMAINPAN` which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.&#xA;&#xA;For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.&#xA;&#xA;As of the patch originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate translation table base registers (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).&#xA;&#xA;By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace - unless explicitly enabled in get|put\[from|to]\user() - we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.&#xA;&#xA;KCFI on ARM32&#xA;&#xA;The Kernel Control Flow Integrity is a &#34;forward edge control flow checker&#34;, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.&#xA;&#xA;KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag `LLVM=1` to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.&#xA;&#xA;Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option `-fsanitize=kcfi` else the option will be disabled.&#xA;&#xA;The patch set is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special `SYMTYPEDFUNCSTART() and SYMFUNC_END()` macros, inserting KCFI hashes also before functions written in assembly.&#xA;&#xA;The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.]]&gt;</description>
      <content:encoded><![CDATA[<p>In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:</p>
<ul><li>PAN for LPAE <code>CONFIG_CPU_TTBR0_PAN</code></li>
<li>KCFI on ARM32 <code>CONFIG_CFI_CLANG</code></li></ul>

<p>As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!</p>

<p>A good rundown of these and other historically interesting security features can be found in Russell Currey&#39;s <a href="https://www.youtube.com/watch?v=n7oUA2b15P8" rel="nofollow">abridged history of kernel hardening</a> which sums up what has been done up to now in a very approachable form.</p>

<h2 id="pan-for-lpae">PAN for LPAE</h2>

<p>PAN is an abbreviation for the somewhat grammatically incorrect <em>Privileged Access Never</em>.</p>

<p>The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions <strong>get_from_user()</strong> and <strong>put_to_user()</strong>. Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.</p>

<p>In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.</p>

<p>The ARM32 hardware has for a few years had a config option named <code>CONFIG_SW_DOMAIN_PAN</code> which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.</p>

<p>For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.</p>

<p>As of the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm/Kconfig?id=7af5b901e84743c608aae90cb0e429702812c324" rel="nofollow">patch</a> originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate <em>translation table base registers</em> (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).</p>

<p>By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace – unless explicitly enabled in <strong>get|put_[from|to]_user()</strong> – we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.</p>

<h2 id="kcfi-on-arm32">KCFI on ARM32</h2>

<p>The <a href="https://lwn.net/Articles/898040/" rel="nofollow">Kernel Control Flow Integrity</a> is a “forward edge control flow checker”, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.</p>

<p>KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag <code>LLVM=1</code> to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.</p>

<p>Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option <code>-fsanitize=kcfi</code> else the option will be disabled.</p>

<p>The <a href="https://lore.kernel.org/linux-arm-kernel/20240423-arm32-cfi-v8-0-08f10f5d9297@linaro.org/" rel="nofollow">patch set</a> is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special <code>SYM_TYPED_FUNC_START()</code> and <code>SYM_FUNC_END()</code> macros, inserting KCFI hashes also before functions written in assembly.</p>

<p>The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.</p>
]]></content:encoded>
      <author>linusw</author>
      <guid>https://people.kernel.org/read/a/b5yhtdlfic</guid>
      <pubDate>Wed, 04 Dec 2024 07:29:57 +0000</pubDate>
    </item>
    <item>
      <title>How to use the new counted_by attribute in C (and Linux)</title>
      <link>https://people.kernel.org/gustavoars/how-to-use-the-new-counted_by-attribute-in-c-and-linux</link>
      <description>&lt;![CDATA[&#xA;&#xA;The counted\by attribute&#xA;&#xA;The countedby attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the _builtindynamicobjectsize() built-in function. In user-space, this extra level of security is enabled by -DFORTIFYSOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.&#xA;&#xA;Here is an example of a flexible array annotated with this attribute:&#xA;&#xA;struct boundedflexstruct {&#xA;    ...&#xA;    sizet count;&#xA;    struct foo flexarray[] attribute((_countedby_(count)));&#xA;};&#xA;&#xA;In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.&#xA;&#xA;In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use builtindynamicobjectsize() (CONFIG\FORTIFY\SOURCE). As well as through the array-bounds sanitizer (CONFIG\UBSAN\BOUNDS).&#xA;&#xA;The \\counted\by() macro&#xA;&#xA;In the kernel we wrap the countedby attribute in the _countedby() macro, as shown below.&#xA;&#xA;if _hasattribute(_countedby)&#xA;define countedby(member)  attribute((countedby(member)))&#xA;else&#xA;define countedby(member)&#xA;endif&#xA;c8248faf3ca27 (&#34;Compiler Attributes: counted\by: Adjust name...&#34;)&#xA;&#xA;And with this we have been annotating flexible-array members across the whole kernel tree over the last year.&#xA;&#xA;diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h&#xA;index 5f8b871d79afac..6b3c778815f09e 100644&#xA;--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h&#xA;+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h&#xA;@@ -82,7 +82,7 @@ struct schedclass {&#xA; &#xA; struct schedtable {      / per port scheduling table /&#xA; &#x9;u8 schedsize;&#xA;struct schedclass tab[];&#xA;struct schedclass tab[] countedby(schedsize);&#xA; };&#xA;ceba9725fb45 (&#34;cxgb4: Annotate struct sched\table with ...&#34;)&#xA;&#xA;However, as we are about to see, not all _countedby() annotations are always as straightforward as the one above.&#xA;&#xA;\\counted\by() annotations in the kernel&#xA;&#xA;There are a number of requirements to properly use the countedby attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.&#xA;&#xA;diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c&#xA;index dac7eb77799bd1..68960ae9898713 100644&#xA;--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c&#xA;+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c&#xA;@@ -33,7 +33,7 @@ struct brcmffwehqueueitem {&#xA; &#x9;u8 ifaddr[ETHALEN];&#xA; &#x9;struct brcmfeventmsgbe emsg;&#xA; &#x9;u32 datalen;&#xA;u8 data[];&#xA;u8 data[] countedby(datalen);&#xA; };&#xA; &#xA; /&#xA;@@ -418,17 +418,17 @@ void brcmffwehprocessevent(struct brcmfpub drvr,&#xA; &#x9;    datalen + sizeof(eventpacket)   packetlen)&#xA; &#x9;&#x9;return;&#xA; &#xA;event = kzalloc(sizeof(event) + datalen, gfp);&#xA;event = kzalloc(structsize(event, data, datalen), gfp);&#xA; &#x9;if (!event)&#xA; &#x9;&#x9;return;&#xA; &#xA;event-  datalen = datalen;&#xA; &#x9;event-  code = code;&#xA; &#x9;event-  ifidx = eventpacket-  msg.ifidx;&#xA; &#xA; &#x9;/ use memcpy to get aligned event message /&#xA; &#x9;memcpy(&amp;event-  emsg, &amp;eventpacket-  msg, sizeof(event-  emsg));&#xA; &#x9;memcpy(event-  data, data, datalen);&#xA;event-  datalen = datalen;&#xA; &#x9;memcpy(event-  ifaddr, eventpacket-  eth.hdest, ETHALEN);&#xA; &#xA; &#x9;brcmffwehqueueevent(fweh, event);&#xA;&#xA;62d19b358088 (&#34;wifi: brcmfmac: fweh: Add \\counted\by...&#34;)&#xA;&#xA;In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event-  datalen = datalen had to be moved to before calling memcpy(event-  data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with countedby annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.&#xA;&#xA;In the kernel, we&#39;ve been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:&#xA;&#xA;6dc445c19050 (&#34;clk: bcm: rpi: Assign -  num before accessing...&#34;)&#xA;&#xA;9368cdf90f52 (&#34;clk: bcm: dvp: Assign -  num before accessing...&#34;)&#xA;&#xA;Another common issue is when the counter is updated inside a loop. See the patch below.&#xA;&#xA;diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c&#xA;index 8993028709ecfb..e8f1d30a8d73c5 100644&#xA;--- a/drivers/net/wireless/ath/wil6210/cfg80211.c&#xA;+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c&#xA;@@ -892,10 +892,8 @@ static int wilcfg80211scan(struct wiphy wiphy,&#xA; &#x9;struct wil6210priv wil = wiphytowil(wiphy);&#xA; &#x9;struct wirelessdev wdev = request-  wdev;&#xA; &#x9;struct wil6210vif vif = wdevtovif(wil, wdev);&#xA;struct {&#xA;struct wmistartscancmd cmd;&#xA;u16 chnl[4];&#xA;} packed cmd;&#xA;DEFINEFLEX(struct wmistartscancmd, cmd,&#xA;channellist, numchannels, 4);&#xA; &#x9;uint i, n;&#xA; &#x9;int rc;&#xA; &#xA;@@ -977,9 +975,8 @@ static int wilcfg80211scan(struct wiphy wiphy,&#xA; &#x9;vif-  scanrequest = request;&#xA; &#x9;modtimer(&amp;vif-  scantimer, jiffies + WIL6210SCANTO);&#xA; &#xA;memset(&amp;cmd, 0, sizeof(cmd));&#xA;cmd.cmd.scantype = WMIACTIVESCAN;&#xA;cmd.cmd.numchannels = 0;&#xA;cmd-  scantype = WMIACTIVESCAN;&#xA;cmd-  numchannels = 0;&#xA; &#x9;n = min(request-  nchannels, 4U);&#xA; &#x9;for (i = 0; i &lt; n; i++) {&#xA; &#x9;&#x9;int ch = request-  channels[i]-  hwvalue;&#xA;@@ -991,7 +988,8 @@ static int wilcfg80211scan(struct wiphy wiphy,&#xA; &#x9;&#x9;&#x9;continue;&#xA; &#x9;&#x9;}&#xA; &#x9;&#x9;/ 0-based channel indexes /&#xA;cmd.cmd.channellist[cmd.cmd.numchannels++].channel = ch - 1;&#xA;cmd-  numchannels++;&#xA;cmd-  channellist[cmd-  numchannels - 1].channel = ch - 1;&#xA; &#x9;&#x9;wildbgmisc(wil, &#34;Scan for ch %d  : %d MHz\n&#34;, ch,&#xA; &#x9;&#x9;&#x9;     request-  channels[i]-  centerfreq);&#xA; &#x9;}&#xA;@@ -1007,16 +1005,15 @@ static int wilcfg80211scan(struct wiphy wiphy,&#xA; &#x9;if (rc)&#xA; &#x9;&#x9;goto outrestore;&#xA; &#xA;if (wil-  discoverymode &amp;&amp; cmd.cmd.scantype == WMIACTIVESCAN) {&#xA;cmd.cmd.discoverymode = 1;&#xA;if (wil-  discoverymode &amp;&amp; cmd-  scantype == WMIACTIVESCAN) {&#xA;cmd-  discoverymode = 1;&#xA; &#x9;&#x9;wildbgmisc(wil, &#34;active scan with discoverymode=1\n&#34;);&#xA; &#x9;}&#xA; &#xA; &#x9;if (vif-  mid == 0)&#xA; &#x9;&#x9;wil-  radiowdev = wdev;&#xA; &#x9;rc = wmisend(wil, WMISTARTSCANCMDID, vif-  mid,&#xA;&amp;cmd, sizeof(cmd.cmd) +&#xA;cmd.cmd.numchannels  sizeof(cmd.cmd.channellist[0]));&#xA;cmd, structsize(cmd, channellist, cmd-  numchannels));&#xA; &#xA; outrestore:&#xA; &#x9;if (rc) {&#xA;diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h&#xA;index 71bf2ae27a984f..b47606d9068c8b 100644&#xA;--- a/drivers/net/wireless/ath/wil6210/wmi.h&#xA;+++ b/drivers/net/wireless/ath/wil6210/wmi.h&#xA;@@ -474,7 +474,7 @@ struct wmistartscancmd {&#xA; &#x9;struct {&#xA; &#x9;&#x9;u8 channel;&#xA; &#x9;&#x9;u8 reserved;&#xA;} channellist[];&#xA;} channellist[] _countedby(numchannels);&#xA; } packed;&#xA; &#xA; #define WMIMAXPNOSSIDNUM&#x9;(16)&#xA;&#xA;34c34c242a1b (&#34;wifi: wil6210: cfg80211: Use \\counted\by...&#34;)&#xA;&#xA;The patch above does a bit more than merely annotating the flexible array with the _countedby() macro, but that&#39;s material for a future post. For now, let&#39;s focus on the following excerpt.&#xA;&#xA;cmd.cmd.scantype = WMIACTIVESCAN;&#xA;cmd.cmd.numchannels = 0;&#xA;cmd-  scantype = WMIACTIVESCAN;&#xA;cmd-  numchannels = 0;&#xA; &#x9;n = min(request-  nchannels, 4U);&#xA; &#x9;for (i = 0; i &lt; n; i++) {&#xA; &#x9;&#x9;int ch = request-  channels[i]-  hwvalue;&#xA;@@ -991,7 +988,8 @@ static int wilcfg80211scan(struct wiphy wiphy,&#xA; &#x9;&#x9;&#x9;continue;&#xA; &#x9;&#x9;}&#xA; &#x9;&#x9;/ 0-based channel indexes /&#xA;cmd.cmd.channellist[cmd.cmd.numchannels++].channel = ch - 1;&#xA;cmd-  numchannels++;&#xA;cmd-  channellist[cmd-  numchannels - 1].channel = ch - 1;&#xA; &#x9;&#x9;wildbgmisc(wil, &#34;Scan for ch %d  : %d MHz\n&#34;, ch,&#xA; &#x9;&#x9;&#x9;     request-  channels[i]-  centerfreq);&#xA; &#x9;}&#xA; ...&#xA;--- a/drivers/net/wireless/ath/wil6210/wmi.h&#xA;+++ b/drivers/net/wireless/ath/wil6210/wmi.h&#xA;@@ -474,7 +474,7 @@ struct wmistartscancmd {&#xA; &#x9;struct {&#xA; &#x9;&#x9;u8 channel;&#xA; &#x9;&#x9;u8 reserved;&#xA;} channellist[];&#xA;} channellist[] countedby(numchannels);&#xA; } packed;&#xA;&#xA;Notice that in this case, numchannels is our counter, and it&#39;s set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channellist[cmd.cmd.numchannels++]. The issue is that once channellist was annotated with the countedby() macro, the compiler enforces dynamic array indexing of channellist to stay below numchannels. Since numchannels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.&#xA;&#xA;As shown in the patch, the solution is to increment numchannels before accessing the array, and then access the array through an index adjustment below numchannels.&#xA;&#xA;Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.&#xA;&#xA;diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h&#xA;index 38eb7ec86a1a65..21ebd70f3dcc97 100644&#xA;--- a/include/net/bluetooth/hci.h&#xA;+++ b/include/net/bluetooth/hci.h&#xA;@@ -2143,7 +2143,7 @@ struct hcicplesetcigparams {&#xA; &#x9;_le16  clatency;&#xA; &#x9;_le16  platency;&#xA; &#x9;_u8    numcis;&#xA;struct hcicisparams cis[];&#xA;struct hcicisparams cis[] _countedby(numcis);&#xA; } packed;&#xA;&#xA;@@ -1722,34 +1717,33 @@ static int hcilecreatebig(struct hciconn conn, struct btisoqos qos)&#xA; &#xA; static int setcigparamssync(struct hcidev hdev, void data)&#xA; {&#xA; ...&#xA;&#xA;u8 auxnumcis = 0;&#xA; &#x9;u8 cisid;&#xA; ...&#xA;&#xA; &#x9;for (cisid = 0x00; cisid &lt; 0xf0 &amp;&amp;&#xA;pdu.cp.numcis &lt; ARRAYSIZE(pdu.cis); cisid++) {&#xA;auxnumcis  pdu-numcis; cisid++) {&#xA; &#x9;&#x9;struct hcicisparams cis;&#xA; &#xA; &#x9;&#x9;conn = hciconnhashlookupcis(hdev, NULL, 0, cigid, cisid);&#xA;@@ -1758,7 +1752,7 @@ static int setcigparamssync(struct hcidev hdev, void data)&#xA; &#xA; &#x9;&#x9;qos = &amp;conn-  isoqos;&#xA; &#xA;cis = &amp;pdu.cis[pdu.cp.numcis++];&#xA;cis = &amp;pdu-  cis[auxnumcis++];&#xA; &#x9;&#x9;cis-  cisid = cisid;&#xA; &#x9;&#x9;cis-  csdu  = cputole16(conn-  isoqos.ucast.out.sdu);&#xA; &#x9;&#x9;cis-  psdu  = cputole16(conn-  isoqos.ucast.in.sdu);&#xA;@@ -1769,14 +1763,14 @@ static int setcigparamssync(struct hcidev hdev, void data)&#xA; &#x9;&#x9;cis-  crtn  = qos-  ucast.out.rtn;&#xA; &#x9;&#x9;cis-  prtn  = qos-  ucast.in.rtn;&#xA; &#x9;}&#xA;pdu-  numcis = auxnumcis;&#xA; &#xA; ...&#xA;&#xA;ea9e148c803b (&#34;Bluetooth: hci\conn: Use \\counted\by() and...&#34;)&#xA;&#xA;Again, the entire patch does more than merely annotate the flexible-array member, but let&#39;s just focus on how auxnumcis is used to access flexible array pdu-  cis[].&#xA;&#xA;In this case, the counter is numcis. As in our previous example, originally, the counter is used to directly access the flexible array: &amp;pdu.cis[pdu.cp.numcis++]. However, the patch above introduces a new variable auxnumcis to be used instead of the counter: &amp;pdu-  cis[auxnumcis++]. The counter is then updated after the loop: pdu-  numcis = auxnumcis.&#xA;&#xA;Both solutions are acceptable, so use whichever is convenient for you. :)&#xA;&#xA;Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:&#xA;&#xA;\[PATCH\] wifi: iwlwifi: mvm: Fix \counted\by usage in cfg80211\wowlan\nd\*&#xA;&#xA;In a future post, I&#39;ll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don&#39;t do it!&#xA;&#xA;Latest version: How to use the new countedby attribute in C (and Linux)]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://embeddedor.com/blog/wp-content/uploads/2024/06/Screenshot-from-2024-06-19-09-18-59-700x234.png" alt=""></p>

<h2 id="the-counted-by-attribute">The counted_by attribute</h2>

<p>The <code>counted_by</code> attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a <code>flexible-array member</code> with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the <code>array bounds sanitizer</code> and the <code>__builtin_dynamic_object_size()</code> built-in function. In user-space, this extra level of security is enabled by <a href="https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level" rel="nofollow"><code>-D_FORTIFY_SOURCE=3</code></a>. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.</p>

<p>Here is an example of a flexible array annotated with this attribute:</p>

<pre><code class="language-C">struct bounded_flex_struct {
    ...
    size_t count;
    struct foo flex_array[] __attribute__((__counted_by__(count)));
};
</code></pre>

<p>In the above example, <code>count</code> is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the <em>counter</em>.</p>

<p>In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the <code>memcpy()</code> family of functions, which internally use <code>__builtin_dynamic_object_size()</code> (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).</p>

<h2 id="the-counted-by-macro">The __counted_by() macro</h2>

<p>In the kernel we wrap the <code>counted_by</code> attribute in the <code>__counted_by()</code> macro, as shown below.</p>

<pre><code class="language-C">#if __has_attribute(__counted_by__)
# define __counted_by(member)  __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif
</code></pre>
<ul><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8248faf3ca276ebdf60f003b3e04bf764daba91" rel="nofollow">c8248faf3ca27</a> (“Compiler Attributes: counted_by: Adjust name...“)</li></ul>

<p>And with this we have been <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&amp;q=__counted_by" rel="nofollow">annotating flexible-array members</a> across the whole kernel tree over the last year.</p>

<pre><code class="language-diff">diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };
</code></pre>
<ul><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ceba9725fb4554c3cd07d055332272208b8a052f" rel="nofollow">ceba9725fb45</a> (“cxgb4: Annotate struct sched_table with ...“)</li></ul>

<p>However, as we are about to see, not all <code>__counted_by()</code> annotations are always as straightforward as the one above.</p>

<h2 id="counted-by-annotations-in-the-kernel">__counted_by() annotations in the kernel</h2>

<p>There are <a href="https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653123.html" rel="nofollow">a number of requirements</a> to properly use the <code>counted_by</code> attribute. One crucial requirement is that the <em>counter</em> must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the <em>counter</em>. Below you can see an example of a kernel patch addressing these requirements.</p>

<pre><code class="language-diff">diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) &gt; packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event-&gt;datalen = datalen;
 	event-&gt;code = code;
 	event-&gt;ifidx = event_packet-&gt;msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&amp;event-&gt;emsg, &amp;event_packet-&gt;msg, sizeof(event-&gt;emsg));
 	memcpy(event-&gt;data, data, datalen);
-	event-&gt;datalen = datalen;
 	memcpy(event-&gt;ifaddr, event_packet-&gt;eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);
</code></pre>
<ul><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62d19b35808816dc2bdf5031e5401230f6a915ba" rel="nofollow">62d19b358088</a> (“wifi: brcmfmac: fweh: Add __counted_by...“)</li></ul>

<p>In the patch above, <code>datalen</code> is the <em>counter</em> for the flexible-array member <code>data</code>. Notice how the assignment to the <em>counter</em> <code>event-&gt;datalen = datalen</code> had to be moved to before calling <code>memcpy(event-&gt;data, data, datalen)</code>, this ensures the <em>counter</em> is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to <code>datalen</code> being zeroed out by a previous call to <code>kzalloc()</code>. This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with <code>counted_by</code> annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the <em>counter</em> and the flexible array and ensure they meet the proper requirements.</p>

<p>In the kernel, we&#39;ve been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:</p>
<ul><li><p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6dc445c1905096b2ed4db1a84570375b4e00cc0f" rel="nofollow">6dc445c19050</a> (“clk: bcm: rpi: Assign –&gt;num before accessing...“)</p></li>

<li><p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9368cdf90f52a68120d039887ccff74ff33b4444" rel="nofollow">9368cdf90f52</a> (“clk: bcm: dvp: Assign –&gt;num before accessing...“)</p></li></ul>

<p>Another common issue is when the <em>counter</em> is updated inside a loop. See the patch below.</p>

<pre><code class="language-diff">diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request-&gt;wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif-&gt;scan_request = request;
 	mod_timer(&amp;vif-&gt;scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&amp;cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd-&gt;scan_type = WMI_ACTIVE_SCAN;
+	cmd-&gt;num_channels = 0;
 	n = min(request-&gt;n_channels, 4U);
 	for (i = 0; i &lt; n; i++) {
 		int ch = request-&gt;channels[i]-&gt;hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd-&gt;num_channels++;
+		cmd-&gt;channel_list[cmd-&gt;num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, &#34;Scan for ch %d  : %d MHz\n&#34;, ch,
 			     request-&gt;channels[i]-&gt;center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil-&gt;discovery_mode &amp;&amp; cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil-&gt;discovery_mode &amp;&amp; cmd-&gt;scan_type == WMI_ACTIVE_SCAN) {
+		cmd-&gt;discovery_mode = 1;
 		wil_dbg_misc(wil, &#34;active scan with discovery_mode=1\n&#34;);
 	}
 
 	if (vif-&gt;mid == 0)
 		wil-&gt;radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif-&gt;mid,
-		      &amp;cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd-&gt;num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)
</code></pre>
<ul><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c34c242a1be24cb5da17fe2954c8c71caf815a" rel="nofollow">34c34c242a1b</a> (“wifi: wil6210: cfg80211: Use __counted_by...“)</li></ul>

<p>The patch above does a bit more than merely annotating the flexible array with the <code>__counted_by()</code> macro, but that&#39;s material for a future post. For now, let&#39;s focus on the following excerpt.</p>

<pre><code class="language-diff">-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd-&gt;scan_type = WMI_ACTIVE_SCAN;
+	cmd-&gt;num_channels = 0;
 	n = min(request-&gt;n_channels, 4U);
 	for (i = 0; i &lt; n; i++) {
 		int ch = request-&gt;channels[i]-&gt;hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd-&gt;num_channels++;
+		cmd-&gt;channel_list[cmd-&gt;num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, &#34;Scan for ch %d  : %d MHz\n&#34;, ch,
 			     request-&gt;channels[i]-&gt;center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
</code></pre>

<p>Notice that in this case, <code>num_channels</code> is our <em>counter</em>, and it&#39;s set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: <code>cmd.cmd.channel_list[cmd.cmd.num_channels++]</code>. The issue is that once <code>channel_list</code> was annotated with the <code>__counted_by()</code> macro, the compiler enforces dynamic array indexing of <code>channel_list</code> to stay below <code>num_channels</code>. Since <code>num_channels</code> holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.</p>

<p>As shown in the patch, the solution is to increment <code>num_channels</code> before accessing the array, and then access the array through an index adjustment below <code>num_channels</code>.</p>

<p>Another option is to avoid using the <em>counter</em> as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.</p>

<pre><code class="language-diff">diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id &lt; 0xf0 &amp;&amp;
-	     pdu.cp.num_cis &lt; ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis &lt; pdu-&gt;num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &amp;conn-&gt;iso_qos;
 
-		cis = &amp;pdu.cis[pdu.cp.num_cis++];
+		cis = &amp;pdu-&gt;cis[aux_num_cis++];
 		cis-&gt;cis_id = cis_id;
 		cis-&gt;c_sdu  = cpu_to_le16(conn-&gt;iso_qos.ucast.out.sdu);
 		cis-&gt;p_sdu  = cpu_to_le16(conn-&gt;iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis-&gt;c_rtn  = qos-&gt;ucast.out.rtn;
 		cis-&gt;p_rtn  = qos-&gt;ucast.in.rtn;
 	}
+	pdu-&gt;num_cis = aux_num_cis;
 
 ...
</code></pre>
<ul><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ea9e148c803b24ebbc7a74171f22f42c8fd8d644" rel="nofollow">ea9e148c803b</a> (“Bluetooth: hci_conn: Use __counted_by() and...“)</li></ul>

<p>Again, the entire patch does more than merely annotate the flexible-array member, but let&#39;s just focus on how <code>aux_num_cis</code> is used to access flexible array <code>pdu-&gt;cis[]</code>.</p>

<p>In this case, the <em>counter</em> is <code>num_cis</code>. As in our previous example, originally, the <em>counter</em> is used to directly access the flexible array: <code>&amp;pdu.cis[pdu.cp.num_cis++]</code>. However, the patch above introduces a new variable <code>aux_num_cis</code> to be used instead of the <em>counter</em>: <code>&amp;pdu-&gt;cis[aux_num_cis++]</code>. The <em>counter</em> is then updated after the loop: <code>pdu-&gt;num_cis = aux_num_cis</code>.</p>

<p>Both solutions are acceptable, so use whichever is convenient for you. :)</p>

<p>Here, you can see a recent bugfix for some <a href="https://lore.kernel.org/linux-hardening/20240517153332.18271-1-dmantipov@yandex.ru/" rel="nofollow">buggy</a> annotations that missed the details discussed above:</p>
<ul><li>[<a href="https://lore.kernel.org/linux-hardening/20240619211233.work.355-kees@kernel.org/" rel="nofollow">PATCH</a>] wifi: iwlwifi: mvm: Fix <em>_counted_by usage in cfg80211_wowlan_nd</em>*</li></ul>

<p>In a future post, I&#39;ll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: <strong>don&#39;t do it!</strong></p>

<p>Latest version: <a href="https://embeddedor.com/blog/2024/06/18/how-to-use-the-new-counted_by-attribute-in-c-and-linux/" rel="nofollow">How to use the new counted_by attribute in C (and Linux)</a></p>
]]></content:encoded>
      <author>Gustavo A. R. Silva</author>
      <guid>https://people.kernel.org/read/a/3z9id0kyzl</guid>
      <pubDate>Wed, 17 Jul 2024 00:43:46 +0000</pubDate>
    </item>
    <item>
      <title>Custom message-ids with mutt and coolname</title>
      <link>https://people.kernel.org/monsieuricon/custom-message-ids-with-mutt-and-coolname</link>
      <description>&lt;![CDATA[Message-ID&#39;s are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it&#39;s only natural to want to use memorable ones. Or maybe it&#39;s just me.&#xA;&#xA;Regardless, here&#39;s what I do with neomutt and coolname:&#xA;&#xA;If coolname isn&#39;t yet packaged for your distro, you can install it with pip:&#xA;&#xA;        pip install --user coolname&#xA;    &#xA;Create this file as ~/bin/my-msgid.py:&#xA;&#xA;        #!/usr/bin/python3&#xA;    import sys&#xA;    import random&#xA;    import string&#xA;    import datetime&#xA;    import platform&#xA;&#xA;    from coolname import generateslug&#xA;&#xA;    parts = []&#xA;    parts.append(datetime.datetime.now().strftime(&#39;%Y%m%d&#39;))&#xA;    parts.append(generateslug(3))&#xA;    parts.append(&#39;&#39;.join(random.choices(string.hexdigits, k=6)).lower())&#xA;&#xA;    sys.stdout.write(&#39;-&#39;.join(parts) + &#39;@&#39; + platform.node().split(&#39;.&#39;)[0])&#xA;    &#xA;Create this file as ~/.mutt-fix-msgid:&#xA;&#xA;        my_hdr Message-ID: /path/to/my/bin/my-msgid.py&#xA;    &#xA;Add this to your .muttrc (works with mutt and neomutt):&#xA;&#xA;        send-hook . &#34;source ~/.mutt-fix-msgid&#34;&#xA;    &#xA;Enjoy funky message-id&#39;s like 20240227-flawless-capybara-of-drama-e09653@lemur. :)]]&gt;</description>
      <content:encoded><![CDATA[<p>Message-ID&#39;s are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it&#39;s only natural to want to use memorable ones. Or maybe it&#39;s just me.</p>

<p>Regardless, here&#39;s what I do with neomutt and <a href="https://pypi.org/project/coolname/" rel="nofollow">coolname</a>:</p>
<ol><li><p>If coolname isn&#39;t yet packaged for your distro, you can install it with pip:</p>

<pre><code>pip install --user coolname
</code></pre></li>

<li><p>Create this file as <code>~/bin/my-msgid.py</code>:</p>

<pre><code class="language-python">#!/usr/bin/python3
import sys
import random
import string
import datetime
import platform

from coolname import generate_slug

parts = []
parts.append(datetime.datetime.now().strftime(&#39;%Y%m%d&#39;))
parts.append(generate_slug(3))
parts.append(&#39;&#39;.join(random.choices(string.hexdigits, k=6)).lower())

sys.stdout.write(&#39;-&#39;.join(parts) + &#39;@&#39; + platform.node().split(&#39;.&#39;)[0])
</code></pre></li>

<li><p>Create this file as <code>~/.mutt-fix-msgid</code>:</p>

<pre><code>my_hdr Message-ID: &lt;`/path/to/my/bin/my-msgid.py`&gt;
</code></pre></li>

<li><p>Add this to your <code>.muttrc</code> (works with mutt and neomutt):</p>

<pre><code>send-hook . &#34;source ~/.mutt-fix-msgid&#34;
</code></pre></li>

<li><p>Enjoy funky message-id&#39;s like <code>20240227-flawless-capybara-of-drama-e09653@lemur</code>. :)</p></li></ol>
]]></content:encoded>
      <author>Konstantin Ryabitsev</author>
      <guid>https://people.kernel.org/read/a/vswy7le3vw</guid>
      <pubDate>Tue, 27 Feb 2024 23:30:25 +0000</pubDate>
    </item>
    <item>
      <title>netdev in 2023</title>
      <link>https://people.kernel.org/kuba/netdev-in-2023</link>
      <description>&lt;![CDATA[Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.&#xA;Intro&#xA;The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 - v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.&#xA;Core and protocol stack&#xA;Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM\PIDFD and SO\PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.&#xA;&#xA;v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems - the wireless stack and devlink.&#xA;&#xA;Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.&#xA;&#xA;Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page\pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE\SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page\pool statistics over Netlink. Looking forward - there’s ongoing work to allow page\pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page\pool world - a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.&#xA;&#xA;Number of key networking data structures (skb, netdevice, page\pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.&#xA;&#xA;In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).&#xA;&#xA;Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.&#xA;&#xA;Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.&#xA;&#xA;Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.&#xA;&#xA;The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.&#xA;&#xA;Among changes which may be more immediately visible to users - starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.&#xA;&#xA;The AF\XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF\XDP sockets (checksumming, segmentation) as well as timestamping.&#xA;&#xA;Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far). &#xA;Device APIs&#xA;Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)&#xA;&#xA;In the “ethtool” family - v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames). &#xA;&#xA;After many attempts we have finally gained solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.&#xA;&#xA;In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.&#xA;&#xA;Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.&#xA;&#xA;Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).&#xA;Removed&#xA;Inspired by the returning discussion about code removal at the Maintainer Summit let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act\ipt, act\rsvp, acttcindex, sch\atm, sch\cbq, sch\dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.&#xA;Community &amp; process&#xA;The maintainers, developers and  community members had a chance to meet at the BPF/netdev track at Linux Plumbers in Richmond, and the netdev.conf 0x17 conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small netconf gathering - thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris! &#xA;&#xA;We have made minor improvements to the mailing list development process by allowing a wider set of folks to update patch status using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.&#xA;&#xA;The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.&#xA;&#xA;A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.&#xA;&#xA;Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).&#xA;&#xA;syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.&#xA;Full list of networking PRs this year (links)&#xA;6.3: https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/&#xA;6.4: https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/&#xA;6.5: https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/&#xA;6.6: https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/&#xA;6.7: https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/&#xA;6.8: https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p><em>Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.</em></p>

<h2 id="intro">Intro</h2>

<p>The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 – v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.</p>

<h2 id="core-and-protocol-stack">Core and protocol stack</h2>

<p>Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM_PIDFD and SO_PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.</p>

<p>v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems – the wireless stack and devlink.</p>

<p>Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.</p>

<p>Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page_pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE_SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page_pool statistics over Netlink. Looking forward – there’s ongoing work to allow page_pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page_pool world – a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.</p>

<p>Number of key networking data structures (skb, netdevice, page_pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.</p>

<p>In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).</p>

<p>Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.</p>

<p>Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.</p>

<p>Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.</p>

<p>The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.</p>

<p>Among changes which may be more immediately visible to users – starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.</p>

<p>The AF_XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF_XDP sockets (checksumming, segmentation) as well as timestamping.</p>

<p>Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far).</p>

<h2 id="device-apis">Device APIs</h2>

<p>Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)</p>

<p>In the “ethtool” family – v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames).</p>

<p>After many attempts we have finally <a href="https://www.youtube.com/watch?v=FTlcWWLeJXg" rel="nofollow">gained</a> solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.</p>

<p>In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.</p>

<p>Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.</p>

<p>Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).</p>

<h2 id="removed">Removed</h2>

<p>Inspired by the returning discussion about code removal at the <a href="https://lwn.net/Articles/951846/" rel="nofollow">Maintainer Summit</a> let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act_ipt, act_rsvp, act_tcindex, sch_atm, sch_cbq, sch_dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.</p>

<h2 id="community-process">Community &amp; process</h2>

<p>The maintainers, developers and  community members had a chance to meet at the BPF/netdev track at <a href="https://www.youtube.com/@LinuxPlumbersConference" rel="nofollow">Linux Plumbers</a> in Richmond, and the <a href="https://www.youtube.com/@netdevconf" rel="nofollow">netdev.conf 0x17</a> conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small <a href="https://netdev.bots.linux.dev/netconf/2023/index.html" rel="nofollow">netconf</a> gathering – thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris!</p>

<p>We have made minor improvements to the mailing list development process by allowing a wider set of folks to <a href="https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html#updating-patch-status" rel="nofollow">update patch status</a> using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.</p>

<p>The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.</p>

<p>A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.</p>

<p>Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).</p>

<p>syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.</p>

<h2 id="full-list-of-networking-prs-this-year-links">Full list of networking PRs this year (links)</h2>

<p>6.3: <a href="https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/" rel="nofollow">https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/</a>
6.4: <a href="https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/" rel="nofollow">https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/</a>
6.5: <a href="https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/" rel="nofollow">https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/</a>
6.6: <a href="https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/" rel="nofollow">https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/</a>
6.7: <a href="https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/" rel="nofollow">https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/</a>
6.8: <a href="https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/" rel="nofollow">https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/</a></p>
]]></content:encoded>
      <author>Jakub Kicinski</author>
      <guid>https://people.kernel.org/read/a/2yvhbeu6v5</guid>
      <pubDate>Wed, 17 Jan 2024 14:20:41 +0000</pubDate>
    </item>
    <item>
      <title>Missing prototype warnings in the kernel</title>
      <link>https://people.kernel.org/arnd/missing-prototype-warnings-in-the-kernel</link>
      <description>&lt;![CDATA[Most compilers have an option to warn about  a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the sparse checker introduced -Wdecl back in 2005. Ensuring that&#xA;each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as static where possible to improve inter-function optimizations.&#xA;&#xA;The warnings are not enabled in a default build, but are part of both&#xA;make W=1 and make C=1 build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the W=1 warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.&#xA;&#xA;After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of &#34;make randconfig&#34; builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total. &#xA;&#xA;I uploaded the patches to https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from W=1 to the default set.&#xA;&#xA;The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.&#xA;&#xA;Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the feature request being open since 2017.]]&gt;</description>
      <content:encoded><![CDATA[<p>Most compilers have an option to warn about  a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the <a href="https://sparse.docs.kernel.org/en/latest/" rel="nofollow">sparse</a> checker introduced -Wdecl back in 2005. Ensuring that
each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as <em>static</em> where possible to improve inter-function optimizations.</p>

<p>The warnings are not enabled in a default build, but are part of both
<em>make W=1</em> and <em>make C=1</em> build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the <em>W=1</em> warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.</p>

<p>After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of “make randconfig” builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total.</p>

<p>I uploaded the patches to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes</a> and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from <em>W=1</em> to the default set.</p>

<p>The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.</p>

<p>Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65213" rel="nofollow">feature request</a> being open since 2017.</p>
]]></content:encoded>
      <author>arnd</author>
      <guid>https://people.kernel.org/read/a/z38rhdljx2</guid>
      <pubDate>Mon, 15 May 2023 14:00:16 +0000</pubDate>
    </item>
    <item>
      <title>More development statistics</title>
      <link>https://people.kernel.org/kuba/more-development-statistics</link>
      <description>&lt;![CDATA[The LWN&#39;s development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause. &#xA;&#xA;Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.&#xA;&#xA;Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.&#xA;&#xA;Local stats&#xA;&#xA;With that in mind it&#39;s only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release. &#xA;&#xA;Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list&#39;s history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar - if we care about the kernel community not how much code various corporations can ship into the kernel - mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.&#xA;&#xA;netdev stats&#xA;&#xA;During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven&#39;t received a ton of feedback. If we trust the statistics themselves, however -- the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.&#xA;&#xA;We&#39;ve also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can&#39;t take all the full credit :)&#xA;&#xA;Random example&#xA;&#xA;I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the &#34;developer tenure&#34; stats will be. I calculated those to see whether we have a healthy number of new members.&#xA;&#xA;Time since first commit in the git history for reviewers&#xA; 0- 3mo   |   2 | &#xA; 3- 6mo   |   3 | &#xA;6mo-1yr   |   9 | *&#xA; 1- 2yr   |  23 | ************&#xA; 2- 4yr   |  33 | ##########################&#xA; 4- 6yr   |  43 | ##################################&#xA; 6- 8yr   |  36 | #############################&#xA; 8-10yr   |  40 | ################################&#xA;10-12yr   |  31 | #########################&#xA;12-14yr   |  33 | ##########################&#xA;14-16yr   |  31 | #########################&#xA;16-18yr   |  46 | #####################################&#xA;18-20yr   |  49 | #######################################&#xA;&#xA;Time since first commit in the git history for authors&#xA; 0- 3mo   |  40 | ********************&#xA; 3- 6mo   |  15 | ****&#xA;6mo-1yr   |  23 | *********&#xA; 1- 2yr   |  49 | *****************************&#xA; 2- 4yr   |  47 | ###############################&#xA; 4- 6yr   |  50 | #################################&#xA; 6- 8yr   |  31 | ####################&#xA; 8-10yr   |  33 | #####################&#xA;10-12yr   |  19 | ############&#xA;12-14yr   |  25 | ################&#xA;14-16yr   |  22 | ##############&#xA;16-18yr   |  32 | #####################&#xA;18-20yr   |  31 | ####################&#xA;&#xA;As I shared on the list - the &#34;recent&#34; buckets are sparse for reviewers and more filled for authors, as expected. What I haven&#39;t said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I&#39;ll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I&#39;m hoping we can learn something by tracking its changes over time.&#xA;&#xA;Fin&#xA;&#xA;To summarize - I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer&#39;s time. They let us show appreciation, check our own biases and by carefully selecting the metrics - encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN&#39;s code is also available publicly (search for gitdm, IIRC).]]&gt;</description>
      <content:encoded><![CDATA[<p>The LWN&#39;s development statistics are being published at end of each release cycle for as long as I can remember (<a href="https://lwn.net/Articles/929582/" rel="nofollow">Linux 6.3 stats</a>). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.</p>

<p>Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.</p>

<p>Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.</p>

<h2 id="local-stats">Local stats</h2>

<p>With that in mind it&#39;s only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.</p>

<p>Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list&#39;s history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel <em>community</em> not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.</p>

<h2 id="netdev-stats">netdev stats</h2>

<p>During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who <strong>review patches</strong>, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (<a href="https://lore.kernel.org/all/20221004212721.069dd189@kernel.org/" rel="nofollow">6.1</a>, <a href="https://lore.kernel.org/all/20221215180608.04441356@kernel.org/" rel="nofollow">6.2</a>, <a href="https://lore.kernel.org/all/20230221180756.0964fb2f@kernel.org/" rel="nofollow">6.3</a>, and <a href="https://lore.kernel.org/all/20230428135717.0ba5dc81@kernel.org/" rel="nofollow">6.4</a>), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven&#39;t received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.</p>

<p>We&#39;ve also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can&#39;t take all the full credit :)</p>

<h2 id="random-example">Random example</h2>

<p>I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.</p>

<pre><code>Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################
</code></pre>

<p>As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven&#39;t said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I&#39;ll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I&#39;m hoping we can learn something by tracking its changes over time.</p>

<h2 id="fin">Fin</h2>

<p>To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer&#39;s time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on <a href="https://github.com/kuba-moo/ml-stat/" rel="nofollow">GitHub</a>, FWIW, but using mine may go against the benefits of locality? LWN&#39;s code is also available publicly (search for gitdm, IIRC).</p>
]]></content:encoded>
      <author>Jakub Kicinski</author>
      <guid>https://people.kernel.org/read/a/ynibn73h8c</guid>
      <pubDate>Sat, 29 Apr 2023 00:21:42 +0000</pubDate>
    </item>
    <item>
      <title>The ARM32 Scheduling and Kernelspace/Userspace Boundary</title>
      <link>https://people.kernel.org/linusw/the-arm32-scheduling-and-kernelspace-userspace-boundary</link>
      <description>&lt;![CDATA[As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the init task and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:&#xA;&#xA;How tasks are actually scheduled on ARM32&#xA;How the kernelspace and userspace are actually separated, and thus how we move from one to the other&#xA;&#xA;This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.&#xA;&#xA;Terminology&#xA;&#xA;With tasks we mean processes, threads and kernel threads. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.&#xA;&#xA;Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called kthreadd -- the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.&#xA;&#xA;Processes in userspace are in essence executing computer programs, or objects with an older terminology, giving the origin of expressions such as object file format. The kernel will start very few such processes, but modprobe and init (which always has process ID 1) are notable exceptions. Any other userspace processes are started by init. Processes can fork new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. POSIX threads is usually the way this happens and further abstractions such as the GLib GThread etc exist.&#xA;&#xA;Task pie chart&#xA;A pie chart of tasks according to priority on a certain system produced using CGFreak shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.&#xA;&#xA;The userspace is the commonplace name given to a specific context of execution where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but not from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls &#34;syscalls&#34; or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)&#xA;&#xA;The kernelspace conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.&#xA;&#xA;Background&#xA;&#xA;This separation, along with everything else we take for granted in modern computers and operating systems was created in the first time-sharing systems such as the CTSS running on the IBM 700/7000 series computers in the late 1950ies. The Ferranti Atlas Computer) in 1962-1967 and its supervisor program followed shortly after these. The Atlas invented nifty features such as virtual memory and memory-mapped I/O, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.&#xA;&#xA;The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the Motorola 68000 CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program code or TEXT segments and a DATA segments.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.&#xA;&#xA;All this said to give you a clear idea why the acronym SVC as in Supervisor Call is used rather than e.g. operating system call  or kernel call which would have been more natural. This naming is historical.&#xA;&#xA;Execution Modes or Levels&#xA;&#xA;We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.&#xA;&#xA;When it comes to the older CPUs in the ARMv4, ARMv5 and ARMv6 range these have a special supervisor mode (SVC mode) and a user mode, and as you could guess these two modes are mapped directly to kernelspace and userspace in Linux. In addition to this there are actually 5 additional exception modes for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the user mode belong to kernelspace.&#xA;&#xA;Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from supervisor mode and this way it is clear who is in control.&#xA;&#xA;The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper secure monitor or just monitor mode.&#xA;&#xA;For reference, these modes in the ARMv8 architecture correspond to &#34;privilege levels&#34;.  Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 &#34;higher&#34; privilege levels. EL2 is used for hypervisor (virtualization) and EL3 is used for a secure monitor that oversee the switch back and forth to the trusted execution environment (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in drivers/tee in the kernel) but it is a different thing than Linux entirely.&#xA;&#xA;These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.&#xA;&#xA;Exceptional Events&#xA;&#xA;We discussed that different execution modes pertain to certain exceptions. So let&#39;s recap ARM32 exceptions.&#xA;&#xA;As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.&#xA;&#xA;The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.&#xA;&#xA;Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the SVC (supervisor call) instruction. This very same instruction -- i.e. with the same binary operation code -- was previously called SWI (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named TRAP, on x86 there is the INT instruction and RISC-V has the SBI (supervisor binary interface) call.&#xA;&#xA;In my article about how the ARM32 architecture is set up I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace  and back.&#xA;&#xA;The transitions occurs at these distinct points:&#xA;&#xA;A hardware RESET occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.&#xA;An undefined instruction is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)&#xA;A software interrupt occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special SVC (aka SWI) instruction that passes a 1-byte parameter to the software interrupt handler.&#xA;A prefetch abort occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.&#xA;A data abort occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.&#xA;An address exception occurs. This doesn&#39;t happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.&#xA;A hardware interrupt occurs - since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: IRQ and FIQ. Each can be routed to an external interrupt controller, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.&#xA;A fault occurs such as through division by zero or other arithmetic fault - the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a processor abort.&#xA;&#xA;That&#39;s all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and preempted, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.&#xA;&#xA;So how does that actually happen? Let&#39;s look at that next.&#xA;&#xA;Entering Kernelspace&#xA;&#xA;Everything has a beginning. I have explained in a previous article how the kernel bootstraps from the function startkernel() in init/main.c and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.&#xA;&#xA;Further down startkernel() we initialize the timers, start the clocksource (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the startkernel() function where arch\call\rest\init() is called. This is in most cases a call to restinit() in the same file (only S390 does anything different) and that in turn actually initializes some processes:&#xA;&#xA;pid = usermodethread(kernelinit, NULL, CLONEFS);&#xA;(...)&#xA;pid = kernelthread(kthreadd, NULL, CLONEFS | CLONEFILES);&#xA;&#xA;We create separate threads running the in-kernel functions kernel\init and kthreadd, which is the kernel thread daemon which in turn spawns all kernel threads.&#xA;&#xA;The user\mode\thread() or kernel\thread() calls create a new processing context: they both call kernel\clone() which calls copy\process() with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using dup\task\struct() passing current as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach copy\thread() which calls back into the architecture to initialize struct threadinfo for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel\init passed instead of just forking), the following happens:&#xA;&#xA;memset(childregs, 0, sizeof(struct ptregs));&#xA;thread-  cpucontext.r4 = (unsigned long)args-  fnarg;&#xA;thread-  cpucontext.r5 = (unsigned long)args-  fn;&#xA;childregs-  ARMcpsr = SVCMODE;&#xA;(...)&#xA;thread-  cpucontext.pc = (unsigned long)retfromfork;&#xA;&#xA;fn\arg will be NULL in this case but fn is kernel\init or kthreadd. And we execute in SVC\MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to ret\from\fork will be significant, so notice this!&#xA;&#xA;Neither of the functions kernel\init or kthreadd will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called schedule() a single time, which means nothing happens, because nothing is yet scheduled.&#xA;&#xA;kernelinit is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for kthreadd to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged \\init) and only then proceed to run\init\process(). As indicated, this will start the init process using kernel\execve(), usually /sbin/init which will then proceed to spawn all usermode processes/tasks. kernel\execve() will check for supported binary formats and most likely call the ELF loader to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START\THREAD() which in turn wraps the ARM32-specific start\thread() which will, amongst other things, do this:&#xA;&#xA;regs-  ARMcpsr = USRMODE;&#xA;(...)&#xA;regs-  ARMpc = pc &amp; ~1;&#xA;&#xA;So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. regs-  ARMcpsr will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.&#xA;&#xA;kthreadd on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.&#xA;&#xA;But as said: neither is executing.&#xA;&#xA;In order to actually start the scheduling we call schedule\preempt\disabled() which will issue schedule() with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more &#34;blocky&#34; at this point. However: we already have the clockevent timer running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler\tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.&#xA;&#xA;Until this point we have been running in the context of the Linux init task which is a elusive hard-coded kernel thread with PID 0 that is defined in init/inittask.c and which I have briefly discussed in a previous article. This task does not even appear in procfs in /proc.&#xA;&#xA;As we call schedule(), the kernel init task will preempt and give way to kthreadd and then to the userspace init process. However when the scheduler again schedules the init task with PID 0, we return to rest\init(), and we will call cpu\startup\entry(CPUHP\ONLINE) and that function is in kernel/sched/idle.c and looks like this:&#xA;&#xA;void cpustartupentry(enum cpuhpstate state)&#xA;{&#xA;        archcpuidleprepare();&#xA;        cpuhponlineidle(state);&#xA;        while (1)&#xA;                doidle();&#xA;}&#xA;&#xA;That&#39;s right: this function never returns. Nothing ever breaks out of the while(1) loop. All that doidle() does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU &#34;idle&#34; i.e. sleep, since nothing is going on. Then the loop repeats. The kernel init task, PID 0 or &#34;main() function&#34; that begins at startkernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the init task. The kernel has some documentation of the inner loop that assumes that you know this context.&#xA;&#xA;Let&#39;s look closer at do\idle() in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):&#xA;&#xA;while (!needresched()) {&#xA;    localirqdisable();&#xA;    enterarchidlecode();&#xA;    / here a considerable amount of wall-clock time can pass /&#xA;    exitarchidlecode();&#xA;    localirqenable();&#xA;}&#xA;(...)&#xA;scheduleidle();&#xA;&#xA;This will spin here until something else needs to be scheduled, meaning the init task has the TIF\NEED\RESCHED bit set, and should be preempted. The call to schedule\idle() soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic schedule() call which we will see later.&#xA;&#xA;We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.&#xA;&#xA;Scheduling the first task&#xA;&#xA;So we know that schedule() has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.&#xA;&#xA;We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.&#xA;&#xA;Scheduler model&#xA;A mental model of the scheduler: scheduler\tick() sets the flag TIF\NEED\RESCHED and a later call to schedule() will actually call out to check\and\switch\context() that does the job of switching task.&#xA;&#xA;Scheduler tick and TIF\NEED\RESCHED&#xA;&#xA;As part of booting the kernel in start\kernel() we first initialized the scheduler with a call to sched\init() and the system tick with a call to tick\init() and then the timer drivers using time\init(). The timeinit() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in drivers/clocksource.&#xA;&#xA;There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.&#xA;&#xA;The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it&#39;s interrupt will not be able to fire until local\irq\enable() is called further down in start\kernel(). After this point the system has a running scheduling tick.&#xA;&#xA;As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.&#xA;&#xA;The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO\HZ).&#xA;&#xA;What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.&#xA;&#xA;When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own -  eventhandler() which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to do\timer() followed by a call to scheduler\tick(). (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)&#xA;&#xA;As a result of calling schedulertick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, &#34;rescheduled&#34; and calls resched\curr(rq) on the runqueue for the CPU, which in turn will call set\tsk\need\resched(curr) on the current task, which flags it as ready to be rescheduled.&#xA;&#xA;set\tsk\need\resched() will set the flag TIF\NEED\RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in arch/arm/include/asm/threadinfo.h and ARM32 has a bitmask version of this flag helpfully named \TIF\NEED\RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.&#xA;&#xA;This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means &#34;at your earliest convenience, yield to another task&#34;. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when schedule() is called.&#xA;&#xA;The Task State and Stack&#xA;&#xA;We mentioned the architecture-specific struct threadinfo so let&#39;s hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread\info is simply part of the task\struct. The struct taskstruct is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:&#xA;&#xA;struct taskstruct {&#xA;    struct threadinfo threadinfo;&#xA;    (...)&#xA;    unsigned int state;&#xA;    (...)&#xA;    void stack;&#xA;    (...)&#xA;    struct mmstruct mm;&#xA;    (...)&#xA;    pidt pid;&#xA;    (...)&#xA;};&#xA;&#xA;The struct threadinfo which in our case is a member of task\struct contains all the architecture-specific aspects of the state.&#xA;&#xA;The task\struct refers to thread\info, but also to a separate piece of memory void stack called the task stack, which is where the task will store its activation records when executing code. The task stack is of size THREAD\SIZE, usually 8KB (2  PAGE\SIZE). These days, in most systems, the task stack is mapped into the VMALLOC area.&#xA;&#xA;The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first enabled THREAD\INFO\IN\TASK which enabled thread info to be contained in the task\struct and then enabled CONFIG\VMAP\STACK for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!&#xA;&#xA;Task struct&#xA;The task\struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread\info is inside it, and inside the thread\info is a cpu\context\save.  It has a task stack of size THREAD\SIZE for ARM32 which is typically twice the PAGE\SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task\struct and thread\info is such that task\struct is Linux-generic and thread\info is architecture-specific and they correspond 1-to-1.&#xA;&#xA;Actual Preemption&#xA;&#xA;In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of schedule() getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?&#xA;&#xA;schedule() can be called for two reasons:&#xA;&#xA;Voluntary preemption: such as when a kernel thread want to give up it&#39;s time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call schedule\preempt\disabled() the very first time, we voluntarily preempt the kernel execution of the init task with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the kthreadd process. Other places can be found by git grep:ing for calls to cond\resched() or just an explicit call to schedule().&#xA;Forced preemption: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its&#39; timeslice, and schedule\tick() has set the TIF\NEED\RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.&#xA;&#xA;Places where forced preemption happens:&#xA;&#xA;The short answer to the question &#34;where does forced preemption happen?&#34; is &#34;at the end of exception handlers&#34;. Here are the details.&#xA;&#xA;The most classical place for preemption of userspace processes is on the return path of a system call. This happens from arch/arm/kernel/entry-common.S in the assembly snippets for ret\slow\syscall() and ret\fast\syscall(), where the ARM32 kernel makes an explicit call to do\workpending() in arch/arm/kernel/signal.c. This will issue a call to schedule() if the flag \TIF\NEED\RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is ret\from\fork which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.&#xA;&#xA;The most common place for preemption is however when returning from a hardware interrupt. Interrupts on ARM32 are handled in assembly in arch/arm/kernel/entry-armv.S with a piece of assembly that saves the processor state for the current CPU into a struct pt\regs and from there just calls the generic interrupt handling code in kernel/irq/handle.c named generic\handle\arch\irq(). This code is used by other archs than ARM32 and will nominally just store the system state and registers in a struct ptregs record on entry and restore it on exit. However when the simplistic code in generic\handle\arch\irq() is done, it exits through the same routines in  arch/arm/kernel/entry-common.S as fast and slow syscalls, and we can see that in ret\to\user\from\irq the code will explicitly check for the resched and other flags with ldr r1, [tsk, #TIFLAGS] and branch to the handler doing do\workpending(), and consequently preempt to another task instead of returning from an interrupt.&#xA;&#xA;Now study do\work\pending():&#xA;&#xA;doworkpending(struct ptregs regs, unsigned int threadflags, int syscall)&#xA;{&#xA;        /&#xA;         The assembly code enters us with IRQs off, (...)&#xA;         /&#xA;&#xA;        do {&#xA;                if (likely(threadflags &amp; TIFNEEDRESCHED)) {&#xA;                        schedule();&#xA;                } else {&#xA;                        (...)&#xA;                }&#xA;                localirqdisable();&#xA;                threadflags = readthreadflags();&#xA;        } while (...);&#xA;        return 0;&#xA;}&#xA;&#xA;Notice the comment: we enter do\work\pending() with local IRQs disabled so we can&#39;t get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is local\irq\disable() - we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do\work\pending you will see that this looks the same on other architectures with similar setup.&#xA;&#xA;In reality do\work\pending() does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local\irq\disable().&#xA;&#xA;The struct ptregs should be understood as &#34;processor trace registers&#34; which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the CPU state, including the program counter pc, which is the place where the task was supposed to resume execution, unless it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt\regs are stored in the task stack during the call to generic\handle\arch\irq().&#xA;&#xA;The assembly in entry-common.S can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:&#xA;&#xA;&#x9;(...)&#xA;slowworkpending:&#xA;&#x9;mov&#x9;r0, sp&#x9;&#x9;&#x9;&#x9;@ &#39;regs&#39;&#xA;&#x9;mov&#x9;r2, why&#x9;&#x9;&#x9;&#x9;@ &#39;syscall&#39;&#xA;&#x9;bl&#x9;doworkpending&#xA;&#x9;cmp&#x9;r0, #0&#xA;&#x9;beq&#x9;noworkpending&#xA;&#x9;(...)&#xA;&#xA;ENTRY(rettouserfromirq)&#xA;&#x9;ldr&#x9;r1, [tsk, #TIFLAGS]&#xA;&#x9;movs&#x9;r1, r1, lsl #16&#xA;&#x9;bne&#x9;slowworkpending&#xA;noworkpending:&#xA;&#x9;asmtracehardirqson save = 0&#xA;&#x9;ctuserenter save = 0&#xA;&#x9;restoreuserregs fast = 0, offset = 0&#xA;&#xA;We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by do\work\pending() which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to no\work\pending and restore the usersmode registers and continue execution.&#xA;&#xA;Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler\tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler\tick() gets called by the timer interrupt, and if it sets TIF\NEED\RESCHED a different thread will start to execute on our way out of the exception handler!&#xA;&#xA;The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named restoreuserregs can be found in entry-header.S and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):&#xA;&#xA;&#x9;mov&#x9;r2, sp&#xA;&#x9;(...)&#xA;&#x9;ldmdb&#x9;r2, {r0 - lr}^&#x9;&#x9;&#x9;@ get calling r0 - lr&#xA;&#x9;add&#x9;sp, sp, #\offset + PTREGSSIZE&#xA;&#x9;movs&#x9;pc, lr&#x9;&#x9;&#x9;&#x9;@ return &amp; move spsrsvc into cp&#xA;&#xA;r2 is set to the stack pointer, where pt\regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) -- NOTE: the little caret (^) after the ldmdb instruction means &#34;also load CPSR from the stack&#34; -- then moves the stackpointer past the saved registers and returns.&#xA;&#xA;Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task task stack, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.&#xA;&#xA;Also notice that this must happen in the end of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.&#xA;&#xA;But let&#39;s not skip the last part of what schedule() does.&#xA;&#xA;Setting the Program Counter&#xA;&#xA;So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do\work\pending() which in turn will call schedule() for us.&#xA;&#xA;The schedulers schedule() call is supposed to very quickly select a process to run next. Eventually it will call contextswitch() in kernel/sched/core.c, which in turn will do essentially two things:&#xA;&#xA;Check if the next task has a unique memory management context (next-  mm is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call switch\mm\irqs\off() which in the ARM32 case is just defined to the ARM32-specific switchmm() which will call the ARM32-specific check\and\switch\context() -- NOTE that this function for any system with MMU is hidden in the arch/arm/include/asm/mmucontext.h header file -- which in turn does one of two things:&#xA;  If interrupts are disabled, we will just set mm-  context.switchpending = 1 so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific finish\arch\post\lock\switch() which is implemented right below and gets called right after dropping the runqueue lock.&#xA;  If interrupts are not disabled, we will immediately call cpu\switch\mm(). This is a per-cpu callback witch is written in assembly for each CPU as cpu\NNNN\switch\mm() inside arch/arm/mm/proc-NNNN.S. For example, all v7 CPUs have the cpu\v7\switch\mm() in arch/arm/mm/proc-v7.S.&#xA;Switch context (such as the register states and stack) to the new task by calling switch\to() with the new task and the previous one as parameter. In most cases this latches to an architecture-specific \\switch\to(). In the ARM32 case, this routine is written in assembly and can be found in arch/arm/kernel/entry-armv.S.&#xA;&#xA;Now the final details happens in switchto() which is supplied the struct threadinfo (i.e. the architecture-specific state) for both the current and the previous task:&#xA;&#xA;We store the registers of the current task in the task stack, at the TI\CPU\SAVE index of struct threadinfo, which corresponds to the .cpucontext entry in the struct, which is in turn a struct cpucontextsave, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to continue as if nothing has happened when we &#34;return&#34; after the schedule() call. I put &#34;return&#34; in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.&#xA;Then the TLS (Thread Local Storage) settings for the new task are obtained and we issue switch\tls(). On v6 CPUs this has special implications, but in most cases we end up using switch\tls\software() which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the kernel-provided user helpers, which in turn are a few kernel routines &#34;similar to but different from VDSO&#34; that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the struct threadinfo pointer, so it cannot be used for TLS on ARMv7. (More on this later.)&#xA;We then broadcast THREAD\NOTIFY\SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet \\switch\to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.&#xA;&#xA;Then we reach the final step in \\switch\to(), which is a bit different depending on whether we use CONFIG\VMAP\STACK or not.&#xA;&#xA;The simple path when we are not using VMAP:ed stacks looks like this:&#xA;&#xA;&#x9;setcurrent r7, r8&#xA;&#x9;ldmia&#x9;r4, {r4 - sl, fp, sp, pc}&#x9;@ Load all regs saved previously&#xA;&#xA;Here r7 contains a pointer to the next tasks thread\info (which will somewhere the kernel virtual memory map), and set\current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label current but ARMv7 and SMP systems have a dedicated special C5 TPIDRURO thread ID register to store this in the CPU so that the thread\info can be located very quickly. (The only user of this information is, no surprise, the get\current() assembly snippet, but that is in turn called from a lot of places and contexts.)&#xA;&#xA;The next ldmia instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) and pc(r15) from the location pointed out by r4, which again is the .cpucontext entry in the struct thread\info, the struct cpucontextsave, which is all the context there is including pc so the next instruction after this will be whatever pc was inside the struct cpucontextsave. We have switched to the new task and preemption is complete.&#xA;&#xA;But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn&#39;t the task we&#39;re switching to going to miss those registers?&#xA;&#xA;For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be &#34;clobbered&#34; during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It&#39;s a function call to inline assembly and not an exception.&#xA;&#xA;And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or \(C\) explicitly called schedule() when this happened, this just doesn&#39;t matter.&#xA;&#xA;Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)&#xA;&#xA;When starting a completely new task all the contents of struct cpucontextsave will be zero, and the return address will be set to ret\from\fork or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.&#xA;&#xA;If we&#39;re on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn&#39;t matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt\regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.&#xA;&#xA;This is why r0 thru r3 are missing from struct cpucontextsave and need not be preserved.&#xA;&#xA;When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution right after the schedule(); call. If we were on our way out of an exception in do\work\pending() we will proceed to return from the exception handler, and to the process it will &#34;feel&#34; like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.&#xA;&#xA;Running init&#xA;&#xA;So how does /sbin/init actually come to execute?&#xA;&#xA;We saw that after start\kernel we get to rest\init which creates the thread with pid = usermodethread(kernelinit, NULL, CLONEFS).&#xA;&#xA;Then kernel\init calls on kernel\execve() to execute /sbin/init. It locates an ELF parser to read and page in the file. Then it will eventually issue start\thread() which will set regs-  ARMcpsr = USRMODE and regs-  ARMpc to the start of the executable.&#xA;&#xA;Then this tasks taskstruct including memory context etc will be selected after a call to schedule()*.&#xA;&#xA;But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we &#34;cheat&#34;:&#xA;&#xA;When we initialized the process in arch/arm/kernel/process.c, we set the program counter to ret\from\fork so we are not going back after any schedule() call: we are going back to ret\from\fork! And this is just an exception return path, so this will restore regs-  ARMcpsr to USRMODE, and &#34;return from an exception&#34; into whatever is in regs-  ARMpc, which is the start of the binary program from the ELF file!&#xA;&#xA;So /sbin/init is executed as a consequence of returning from a fake exception through ret\from\fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.&#xA;&#xA;This is how ARM32 schedules and executes processes.]]&gt;</description>
      <content:encoded><![CDATA[<p>As of recent I needed to understand how the ARM32 architecture switches control of execution between normal, userspace processes and the kernel processes, such as the <em>init task</em> and the kernel threads. Understanding this invariably involves understanding two aspects of the ARM32 kernel:</p>
<ul><li>How tasks are actually scheduled on ARM32</li>
<li>How the kernelspace and userspace are actually separated, and thus how we move from one to the other</li></ul>

<p>This is going to require knowledge from some other (linked) articles and a good understanding of ARM32 assembly.</p>

<h2 id="terminology">Terminology</h2>

<p>With <strong>tasks</strong> we mean <em>processes</em>, <em>threads</em> and <em>kernel threads</em>. The kernel scheduler see no major difference between these, they are schedulable entities that live on a certain CPU.</p>

<p>Kernel threads are the easiest to understand: in the big computer program that is the kernel, different threads execute on behalf of managing the kernel. They are all instantiated by a special thread called <em>kthreadd</em> — the kernel thread daemon. They exist for various purposes, one is to provide process context to interrupt threads, another to run workqueues such as delayed work and so on. It is handy for e.g. kernel drivers to be able to hand over execution to a process context that can churn on in the background.</p>

<p><strong>Processes</strong> in <em>userspace</em> are in essence executing computer programs, or <em>objects</em> with an older terminology, giving the origin of expressions such as <em>object file format</em>. The kernel will start very few such processes, but <em>modprobe</em> and <em>init</em> (which always has process ID 1) are notable exceptions. Any other userspace processes are started by <em>init</em>. Processes can <em>fork</em> new processes, and it can also create separate threads of execution within itself, and these will become schedulable entities as well, so a certain process (executing computer program) can have concurrency within itself. <a href="https://en.wikipedia.org/wiki/Pthreads" rel="nofollow">POSIX threads</a> is usually the way this happens and further abstractions such as the GLib <a href="https://docs.gtk.org/glib/struct.Thread.html" rel="nofollow">GThread</a> etc exist.</p>

<p><img src="https://dflund.se/~triad/images/cgfreak-kernelorg.jpg" alt="Task pie chart">
<em>A pie chart of tasks according to priority on a certain system produced using <a href="https://cgfreak.sourceforge.net/" rel="nofollow">CGFreak</a> shows that from a scheduler point of view there are just tasks, any kernel threads or threads spawn from processes just become schedulable task entities.</em></p>

<p>The <strong>userspace</strong> is the commonplace name given to a specific <em>context of execution</em> where we execute processes. What defines this context is that it has its own memory context, a unique MMU table, which in the ARM32 case gives each process a huge virtual memory to live in. Its execution is isolated from the kernel and also from other processes, but <em>not</em> from its own threads (typically POSIX threads). To communicate with either the kernel or other userspace processes, it needs to use system calls “syscalls” or emit or receive signals. Both mechanisms are realized as software interrupts. (To communicate with its own spawn threads, shortcuts are available.)</p>

<p>The <strong>kernelspace</strong> conversely is the context of execution of the operating system, in our case Linux. It has its own memory context (MMU table) but some of the kernel memory is usually also accessible by the userspace processes, and the virtual memory space is shared, so that exceptions can jump directly into kernel code in virtual memory, and the kernel can directly read and write into userspace memory. This is done like so to facilitate quick communication between the kernel and userspace. Depending on the architecture we are executing Linux on, executing in kernelspace is associated with elevated machine privileges, and means the operating system can issue certain privileged instructions or otherwise access some certain restricted resources. The MMU table permissions protects kernel code from being inspected or overwritten by userspace processes.</p>

<h2 id="background">Background</h2>

<p>This separation, along with everything else we take for granted in modern computers and operating systems was created in the first <a href="https://en.wikipedia.org/wiki/Time-sharing" rel="nofollow">time-sharing</a> systems such as the <a href="https://en.wikipedia.org/wiki/Compatible_Time-Sharing_System" rel="nofollow">CTSS</a> running on the <a href="https://en.wikipedia.org/wiki/IBM_700/7000_series" rel="nofollow">IBM 700/7000 series computers</a> in the late 1950ies. <a href="https://en.wikipedia.org/wiki/Atlas_(computer)" rel="nofollow">The Ferranti Atlas Computer</a> in 1962-1967 and its <a href="https://en.wikipedia.org/wiki/Atlas_Supervisor" rel="nofollow">supervisor program</a> followed shortly after these. The Atlas invented nifty features such as <a href="https://en.wikipedia.org/wiki/Virtual_memory" rel="nofollow">virtual memory</a> and <a href="https://en.wikipedia.org/wiki/Memory-mapped_I/O" rel="nofollow">memory-mapped I/O</a>, and was of course also using time-sharing. As can be easily guessed, these computers and operating systems (supervisors) designs inspired the hardware design and operating system designs of later computers such as the PDP-11, where Unix began. This is why Unix-like operating systems such as Linux more or less take all of these features and concepts for granted.</p>

<p>The idea of a supervisor or operating system goes deep into the design of CPUs, so for example the <a href="https://en.wikipedia.org/wiki/Motorola_68000" rel="nofollow">Motorola 68000</a> CPU had three function code pins routed out on the package, FC2, FC1 and FC0 comprising three bits of system mode, four of these bit combinations representing user data, user program, supervisor data and supervisor program. (These modes even reflect the sectioning of program and supervisor objects into program <a href="https://en.wikipedia.org/wiki/Code_segment" rel="nofollow">code or TEXT segments</a> and a <a href="https://en.wikipedia.org/wiki/Data_segment" rel="nofollow">DATA segments</a>.) In the supervisor mode, FC2 was always asserted. This way physical access to memory-mapped peripherals could be electronically constrained to access only from supervisor mode. Machines such as the Atari ST exploited this possibility, while others such as the Commodore Amiga did not.</p>

<p>All this said to give you a clear idea why the acronym SVC as in <em>Supervisor Call</em> is used rather than e.g. <em>operating system call</em>  or <em>kernel call</em> which would have been more natural. This naming is historical.</p>

<h3 id="execution-modes-or-levels">Execution Modes or Levels</h3>

<p>We will restrict the following discussion to the ARMv4 and later ARM32 architectures which is what Linux supports.</p>

<p>When it comes to the <a href="https://developer.arm.com/documentation/ddi0100/latest/" rel="nofollow">older CPUs in the ARMv4, ARMv5 and ARMv6 range</a> these have a special <strong>supervisor mode</strong> (SVC mode) and a <strong>user mode</strong>, and as you could guess these two modes are mapped directly to <strong>kernelspace</strong> and <strong>userspace</strong> in Linux. In addition to this there are actually 5 additional <strong>exception modes</strong> for FIQ, IRQ, system mode, abort and undefined, so 7 modes in total! To cut a long story short, all of the modes except the <em>user mode</em> belong to <strong>kernelspace</strong>.</p>

<p>Apart from restricting certain instructions, the only thing actually separating the kernelspace from userspace is the MMU, which is protecting kernelspace from userspace in the same way that different userspace processes are protected from each other: by using virtual memory to hide physical memory, and in the cases where it is not hidden: using protection bits in the page table to restrict access to certain memory areas. The MMU table can naturally only be altered from <em>supervisor mode</em> and this way it is clear who is in control.</p>

<p>The later versions of the ARM32 CPU, the ARMv7, add some further and an even deeper <strong>secure monitor</strong> or just <strong>monitor</strong> mode.</p>

<p>For reference, these modes in the ARMv8 architecture correspond to “privilege levels”.  Here the kernelspace execute at exception level EL1, and userspace at exception level EL0, then there are further EL2 and EL3 “higher” privilege levels. EL2 is used for <em>hypervisor</em> (virtualization) and EL3 is used for a <em>secure monitor</em> that oversee the switch back and forth to the <em>trusted execution environment</em> (TEE), which is a parallel and different operating environment, essentially like a different computer: Linux can interact with it (as can be seen in <code>drivers/tee</code> in the kernel) but it is a different thing than Linux entirely.</p>

<p>These higher privilege levels and the secure mode with its hypervisor and TEE are not always used and may be dormant. Strictly speaking, the security and virtualization functionality is optional, so it is perfectly fine to fabricate ARMv7 silicon without them. To accompany the supervisor call (SVC) on ARMv7 a hypervisor call (HVC) and a secure monitor call (SMC) instruction was added.</p>

<h2 id="exceptional-events">Exceptional Events</h2>

<p>We discussed that different execution modes pertain to certain exceptions. So let&#39;s recap ARM32 exceptions.</p>

<p>As exceptions go, these happen both in kernelspace and userspace, but they are always handled in kernelspace. If that userspace process for example divides by zero, an exception occurs that take us into the kernel, all the time pushing state onto the stack, and resuming execution inside the kernel, which will simply terminate the process over this. If the kernel itself divides by zero we get a kernel crash since there is no way out.</p>

<p>The most natural exception is of course a hardware interrupt, such as when a user presses a key or a hard disk signals that a sector of data has been placed in a buffer, or a network card indicates that an ethernet packet is available from the interface.</p>

<p>Additionally, as mentioned previously, most architectures support a special type of software exception that is initiated for carrying out system calls, and on ARM and Aarch64 that is what is these days called the <strong>SVC</strong> (supervisor call) instruction. This very same instruction — i.e. with the same binary operation code — was previously called <strong>SWI</strong> (software interrupt) which makes things a bit confusing at times, especially when reading old documentation and old code, but the assembly mnemonics SVC and SWI have the same semantic. For comparison on m68k this instruction is named <strong>TRAP</strong>, on x86 there is the <strong>INT</strong> instruction and RISC-V has the <strong>SBI</strong> (supervisor binary interface) call.</p>

<p>In <a href="https://people.kernel.org/linusw/setting-up-the-arm32-architecture-part-2" rel="nofollow">my article about how the ARM32 architecture is set up</a> I talk about the exception vector table which is 8 32bit pointers stored in virtual memory from 0xFFFF0000 to 0xFFFF0020 and it corresponds roughly to everything that can take us from kernelspace to userspace  and back.</p>

<p>The transitions occurs at these distinct points:</p>
<ul><li>A hardware <strong>RESET</strong> occurs. This is pretty obvious: we need to abort all user program execution, return to the kernel and take everything offline.</li>
<li>An <strong>undefined instruction</strong> is encountered. The program flow cannot continue if this happens and the kernel has to do something about it. The most typical use for this is to implement software fallback for floating-point arithmetic instructions that some hardware may be lacking. These fallbacks will in that case be implemented by the kernel. (Since doing this with a context switch and software fallback in the kernel is expensive, you would normally just compile the program with a compiler that replace the floating point instructions with software fallbacks to begin with, but not everyone has the luxury of source code and build environment available and have to run pre-compiled binaries with floating point instructions.)</li>
<li>A <strong>software interrupt</strong> occurs. This is the most common way that a userspace application issues a system call (supervisor call) into the operating system. As mentioned, on ARM32 this is implemented by the special <strong>SVC</strong> (aka <strong>SWI</strong>) instruction that passes a 1-byte parameter to the software interrupt handler.</li>
<li>A <strong>prefetch abort</strong> occurs. This happens when the instruction pointer runs into unpaged memory, and the virtual memory manager (mm) needs to page in new virtual memory to continue execution. Naturally this is a kernel task.</li>
<li>A <strong>data abort</strong> occurs. This is essentially the same as the prefetch abort but the program is trying to access unpaged data rather than unpaged instructions.</li>
<li>An <strong>address exception</strong> occurs. This doesn&#39;t happen on modern ARM32 CPUs, because the exception is for when the CPU moves outside the former 26bit address space on ARM26 architectures that Linux no longer supports.</li>
<li>A <strong>hardware interrupt</strong> occurs – since the operating system handles all hardware, naturally whenever one of these occur, we have to switch to kernel context. The ARM CPUs have two hardware interrupt lines: <strong>IRQ</strong> and <strong>FIQ</strong>. Each can be routed to an external <em>interrupt controller</em>, the most common being the GIC (Global Interrupt Controller) especially for multicore systems, but many ARM systems use their own, custom interrupt controllers.</li>
<li>A <strong>fault</strong> occurs such as through division by zero or other arithmetic fault – the CPU runs into an undefined state and has no idea how to recover and continue. This is also called a <em>processor abort</em>.</li></ul>

<p>That&#39;s all. But these are indeed exceptions. What is the rule? The computer programs that correspond to the kernel and each userspace process have to start somewhere, and then they are excecuted in time slices, which means that somehow they get interrupted by one of these exceptions and <strong>preempted</strong>, a procedure that in turn invariably involves transitions back and forth from userspace to kernelspace and back into userspace again.</p>

<p>So how does that actually happen? Let&#39;s look at that next.</p>

<h2 id="entering-kernelspace">Entering Kernelspace</h2>

<p>Everything has a beginning. I have explained <a href="https://people.kernel.org/linusw/setting-up-the-arm32-architecture-part-2" rel="nofollow">in a previous article</a> how the kernel bootstraps from the function <strong>start_kernel()</strong> in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/init/main.c" rel="nofollow"><code>init/main.c</code></a> and sets up the architecture including virtual memory to a point where the architecture-neutral parts of the kernel starts executing.</p>

<p>Further down <strong>start_kernel()</strong> we initialize the timers, start <a href="https://docs.kernel.org/timers/timekeeping.html" rel="nofollow">the clocksource</a> (the Linux system timeline) and initialize the scheduler so that process scheduling can happen. But nothing really happens, because there are no processes. Then the kernel reaches the end of the <strong>start_kernel()</strong> function where <strong>arch_call_rest_init()</strong> is called. This is in most cases a call to <strong>rest_init()</strong> in the same file (only S390 does anything different) and that in turn actually initializes some processes:</p>

<pre><code>pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
(...)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
</code></pre>

<p>We create separate threads running the in-kernel functions <em>kernel_init</em> and <em>kthreadd</em>, which is the kernel thread daemon which in turn spawns all kernel threads.</p>

<p>The <em>user_mode_thread()</em> or <em>kernel_thread()</em> calls create a new processing context: they both call <em>kernel_clone()</em> which calls <em>copy_process()</em> with NULL as first argument, meaning it will not actually copy any process but instead create a new one. It will create a new task using <em>dup_task_struct()</em> passing <em>current</em> as argument, which is the init task and thus any new task is eventually derived from the compiled-in init task. Then there is a lot of cloning going on, and we reach <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/process.c#n236" rel="nofollow"><strong>copy_thread()</strong></a> which calls back into the architecture to initialize <code>struct thread_info</code> for the new task. This is a struct we will look at later, but notice one thing, and that is that when a new kernel or user mode thread is created like this (with a function such as kernel_init passed instead of just forking), the following happens:</p>

<pre><code>memset(childregs, 0, sizeof(struct pt_regs));
thread-&gt;cpu_context.r4 = (unsigned long)args-&gt;fn_arg;
thread-&gt;cpu_context.r5 = (unsigned long)args-&gt;fn;
childregs-&gt;ARM_cpsr = SVC_MODE;
(...)
thread-&gt;cpu_context.pc = (unsigned long)ret_from_fork;
</code></pre>

<p><em>fn_arg</em> will be NULL in this case but <em>fn</em> is kernel_init or kthreadd. And we execute in SVC_MODE which is the supervisor mode: as the kernel. Also user mode threads are initialized as supervisor mode tasks to begin with, but it will eventually modify itself into a userspace task. Setting the CPU context to <strong>ret_from_fork</strong> will be significant, so notice this!</p>

<p>Neither of the functions <em>kernel_init</em> or <em>kthreadd</em> will execute at this point! We will just return. The threads are initialized but nothing is scheduled yet: we have not yet called <strong>schedule()</strong> a single time, which means nothing happens, because nothing is yet scheduled.</p>

<p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/init/main.c#n1517" rel="nofollow"><strong>kernel_init</strong></a> is a function in the same file that is as indicated will initialize the first userspace process. If you inspect this function you will see that it keeps executing some kernel code for quite a while: it waits for <em>kthreadd</em> to finish initalization so the kernel is ready for action, then it will actually do some housekeeping such as freeing up the kernel initmem (functions tagged __init) and only then proceed to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/init/main.c#n1430" rel="nofollow"><strong>run_init_process()</strong></a>. As indicated, this will start the <strong>init</strong> process using <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/exec.c#n1969" rel="nofollow"><strong>kernel_execve()</strong></a>, usually <code>/sbin/init</code> which will then proceed to spawn all usermode processes/tasks. <em>kernel_execve()</em> will check for supported binary formats and most likely call <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/binfmt_elf.c#n823" rel="nofollow">the ELF loader</a> to process the binary and page in the file into memory from the file system etc. If this goes well, it will end with a call to the macro START_THREAD() which in turn wraps the ARM32-specific <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/processor.h#n52" rel="nofollow"><strong>start_thread()</strong></a> which will, amongst other things, do this:</p>

<pre><code>regs-&gt;ARM_cpsr = USR_MODE;
(...)
regs-&gt;ARM_pc = pc &amp; ~1;
</code></pre>

<p>So the new userspace process will get pushed into userspace mode by the ELF loader, and that will also set the program counter to wherever the ELF file is set to execute. <code>regs-&gt;ARM_cpsr</code> will be pushed into the CPSR register when the task is scheduled, and we start the first task executing in userspace.</p>

<p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/kthread.c#n709" rel="nofollow"><strong>kthreadd</strong></a> on the other hand will execute a perpetual loop starting other kernel daemons as they are placed on a creation list.</p>

<p>But as said: neither is executing.</p>

<p>In order to actually start the scheduling we call <strong>schedule_preempt_disabled()</strong> which will issue <strong>schedule()</strong> with preemption disabled: we can schedule tasks, and they will not interrupt each other (preempt) in fine granular manner, so the scheduling is more “blocky” at this point. However: we already have <a href="https://docs.kernel.org/timers/timekeeping.html" rel="nofollow">the clockevent timer</a> running so that the operating system is now ticking, and new calls to the main scheduler callbacks scheduler_tick() and schedule() will happen from different points in future time, at least at the system tick granularity (HZ) if nothing else happens. We will explain more about this further on in the article.</p>

<p>Until this point we have been running in the context of the Linux <em>init task</em> which is a elusive hard-coded kernel thread with PID 0 that is defined in <code>init/init_task.c</code> and which I have <a href="https://people.kernel.org/linusw/setting-up-the-arm32-architecture-part-1" rel="nofollow">briefly discussed in a previous article</a>. This task does not even appear in procfs in <code>/proc</code>.</p>

<p>As we call schedule(), the kernel <em>init task</em> will preempt and give way to <strong>kthreadd</strong> and then to the userspace <strong>init</strong> process. However when the scheduler again schedules the <em>init task</em> with PID 0, we return to rest_init(), and we will call <strong>cpu_startup_entry(CPUHP_ONLINE)</strong> and that function is in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/idle.c" rel="nofollow"><code>kernel/sched/idle.c</code></a> and looks like this:</p>

<pre><code>void cpu_startup_entry(enum cpuhp_state state)
{
        arch_cpu_idle_prepare();
        cpuhp_online_idle(state);
        while (1)
                do_idle();
}
</code></pre>

<p>That&#39;s right: this function never returns. Nothing ever breaks out of the while(1) loop. All that <strong>do_idle()</strong> does is to wait until no tasks are scheduling, and then call down into the cpuidle subsystem. This will make the CPU “idle” i.e. sleep, since nothing is going on. Then the loop repeats. The kernel <em>init task</em>, PID 0 or “main() function” that begins at start_kernel() and ends here, will just try to push down the system to idle, forever. So this is the eventual fate of the <em>init task</em>. The kernel has <a href="https://docs.kernel.org/scheduler/sched-arch.html" rel="nofollow">some documentation</a> of the inner loop that assumes that you know this context.</p>

<p>Let&#39;s look closer at <strong>do_idle()</strong> in the same file, which has roughly this look (the actual code is more complex, but this is the spirit of it):</p>

<pre><code>while (!need_resched()) {
    local_irq_disable();
    enter_arch_idle_code();
    /* here a considerable amount of wall-clock time can pass */
    exit_arch_idle_code();
    local_irq_enable();
}
(...)
schedule_idle();
</code></pre>

<p>This will spin here until something else needs to be scheduled, meaning the <em>init task</em> has the TIF_NEED_RESCHED bit set, and should be preempted. The call to <strong>schedule_idle()</strong> soon after exiting this loop makes sure that this rescheduling actually happens: this calls right into the scheduler to select a new task and is a variant of the more generic <strong>schedule()</strong> call which we will see later.</p>

<p>We will look into the details soon, but we see the basic pattern of this perpetual task: see if someone else needs to run else idle and when someone else wants to run, stop idling and explicitly yield to whatever task was waiting.</p>

<h2 id="scheduling-the-first-task">Scheduling the first task</h2>

<p>So we know that <strong>schedule()</strong> has been called once on the primary CPU, and we know that this will set the memory management context to the first task, set the program counter to it and execute it. This is the most brutal approach to having a process scheduled, and we will detail what happens further down.</p>

<p>We must however look at the bigger picture of kernel preemtion to get the full picture of what happens here.</p>

<p><img src="https://dflund.se/~triad/images/scheduler.jpg" alt="Scheduler model">
<em>A mental model of the scheduler: scheduler_tick() sets the flag TIF_NEED_RESCHED and a later call to schedule() will actually call out to check_and_switch_context() that does the job of switching task.</em></p>

<h2 id="scheduler-tick-and-tif-need-resched">Scheduler tick and TIF_NEED_RESCHED</h2>

<p>As part of booting the kernel in <strong>start_kernel()</strong> we first initialized the scheduler with a call to <strong>sched_init()</strong> and the system tick with a call to <strong>tick_init()</strong> and then the timer drivers using <strong>time_init()</strong>. The time_init() call will go through some loops and hoops and end up initializing and registering the clocksource driver(s) for the system, such as those that can be found in <code>drivers/clocksource</code>.</p>

<p>There will sometimes be only a broadcast timer to be used by all CPU:s on the system (the interrupts will need to be broadcast to all the CPU:s using IPC interrupts) and sometimes more elaborate architectures have timers dedicated to each CPU so these can be used invidually by each core to plan events and drive the system tick on that specific CPU.</p>

<p>The most suitable timer will also be started as part of the clockevent device(s) being registered. However, it&#39;s interrupt will not be able to fire until <strong>local_irq_enable()</strong> is called further down in start_kernel(). After this point the system has a running scheduling tick.</p>

<p>As scheduling happens separately on each CPU, scheduler timer interrupts and rescheduling calls needs to be done separately on each CPU as well.</p>

<p>The clockevent drivers can provide a periodic tick and then the process will be interrupted after an appropriate number of ticks, or the driver can provide oneshot interrupts, and then it can plan an event further on, avoiding to fire interrupts while the task is running just for ticking and switching itself (a shortcut known as NO_HZ).</p>

<p>What we know for sure is that this subsystem always has a new tick event planned for the system. It can happen in 1/HZ seconds if periodic ticks are used, or it can happen several minutes into the future if nothing happens for a while in the system.</p>

<p>When the clockevent eventually fires, in the form of an interrupt from the timer, it calls its own <code>-&gt;event_handler()</code> which is set up by the clockevent subsystem code. When the interrupt happens it will fast-forward the system tick by repetitive calls to <strong>do_timer()</strong> followed by a call to <strong>scheduler_tick()</strong>. (We reach this point through different paths depending on whether HRTimers and other kernel features are enabled or not.)</p>

<p>As a result of calling scheduler_tick(), some scheduler policy code such as deadline, CFS, etc (this is explained by many others elsewhere) will decide that the current task needs to be preempted, “rescheduled” and calls <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c#n1021" rel="nofollow"><strong>resched_curr(rq)</strong></a> on the runqueue for the CPU, which in turn will call set_tsk_need_resched(curr) on the current task, which flags it as ready to be rescheduled.</p>

<p>set_tsk_need_resched() will set the flag TIF_NEED_RESCHED for the task. The flag is implemented as an arch-specific bitfield, in the ARM32 case in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/thread_info.h#n120" rel="nofollow"><code>arch/arm/include/asm/thread_info.h</code></a> and ARM32 has a bitmask version of this flag helpfully named _TIF_NEED_RESCHED that can be used by assembly snippets to check it quickly with a logical AND operation.</p>

<p>This bit having been set does not in any way mean that a new process will start executing immediately. The flag semantically means “at your earliest convenience, yield to another task”. So the kernel waits until it finds an appropriate time to preempt the task, and that time is when <strong>schedule()</strong> is called.</p>

<h2 id="the-task-state-and-stack">The Task State and Stack</h2>

<p>We mentioned the architecture-specific <code>struct thread_info</code> so let&#39;s hash out where that is actually stored. It is a simpler story than it used to be, because these days, the the ARM32 thread_info is simply part of the task_struct. <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/sched.h#n737" rel="nofollow">The <code>struct task_struct</code></a> is the central per-task information repository that the generic parts of the Linux kernel holds for a certain task, and paramount to keeping the task state. Here is a simplified view that gives you an idea about how much information and pointers it actually contains:</p>

<pre><code>struct task_struct {
    struct thread_info thread_info;
    (...)
    unsigned int state;
    (...)
    void *stack;
    (...)
    struct mm_struct *mm;
    (...)
    pid_t pid;
    (...)
};
</code></pre>

<p>The <code>struct thread_info</code> which in our case is a member of task_struct contains all the architecture-specific aspects of the state.</p>

<p>The task_struct refers to thread_info, but also to a separate piece of memory <code>void *stack</code> called the <strong>task stack</strong>, which is where the task will store its <a href="https://en.wikipedia.org/wiki/Call_stack#ACTIVATION-RECORD" rel="nofollow">activation records</a> when executing code. The task stack is of size THREAD_SIZE, usually 8KB (2 * PAGE_SIZE). These days, in most systems, the task stack is <a href="https://docs.kernel.org/mm/vmalloced-kernel-stacks.html" rel="nofollow">mapped into the VMALLOC area</a>.</p>

<p>The last paragraph deserves some special mentioning with regards to ARM32 because things changed. Ard Biesheuvel recently first <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm/Kconfig?id=9c46929e7989efacc1dd0a1dd662a839897ea2b6" rel="nofollow">enabled THREAD_INFO_IN_TASK</a> which enabled thread info to be contained in the task_struct and then <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm/Kconfig?id=a1c510d0adc604bb143c86052bc5be48cbcfa17c" rel="nofollow">enabled CONFIG_VMAP_STACK</a> for all systems in the ARM32 kernel. This means that the VMALLOC memory area is used to map and access the task stack. This is good for security reasons: the task stack is a common target for kernel security exploits, and by moving this to the VMALLOC area, which is simply a huge area of virtual memory addresses, and surrounding it below and above with unmapped pages, we will get a page violation if a the kernel tries to access memory outside the current task stack!</p>

<p><img src="https://dflund.se/~triad/images/task_stack.jpg" alt="Task struct">
<em>The task_struct in the Linux kernel is where the kernel keeps a nexus of all information about a certain task, i.e. a certain processing context. It contains .mm the memory context where all the virtual memory mappings live for the task. The thread_info is inside it, and inside the thread_info is a cpu_context_save.  It has a task stack of size THREAD_SIZE for ARM32 which is typically twice the PAGE_SIZE, i.e. 8KB, surrounded by unmapped memory for protection. Again this memory is mapped in the memory context of the process. The split between task_struct and thread_info is such that task_struct is Linux-generic and thread_info is architecture-specific and they correspond 1-to-1.</em></p>

<h2 id="actual-preemption">Actual Preemption</h2>

<p>In my mind, preemption happens when the program counter is actually set to a code segment in a different process, and this will happen at different points depending on how the kernel is configured. This happens as a result of <strong>schedule()</strong> getting called, and will in essence be a call down to the architecture to switch memory management context and active task. But where and when does schedule() get called?</p>

<p>schedule() can be called for two reasons:</p>
<ul><li><strong>Voluntary preemption</strong>: such as when a kernel thread want to give up it&#39;s time slice because it knows it cannot proceed for a while. This is the case for most instances of this call that you find in the kernel. The special case when we start the kernel and call <strong>schedule_preempt_disabled()</strong> the very first time, we voluntarily preempt the kernel execution of the <em>init task</em> with PID 0 to instead execute whatever is queued and prioritized in the scheduler, and that will be the <em>kthreadd</em> process. Other places can be found by git grep:ing for calls to <strong>cond_resched()</strong> or just an explicit call to <strong>schedule()</strong>.</li>
<li><strong>Forced preemption</strong>: this happens when a task is simply scheduled out. This happens to kernelthreads and userspace processes alike. This happens when a process has used up its&#39; timeslice, and schedule_tick() has set the TIF_NEED_RESCHED flag. And we described in the previous section how this flag gets set from the scheduler tick.</li></ul>

<p>Places where forced preemption happens:</p>

<p>The short answer to the question “where does forced preemption happen?” is “at the end of exception handlers”. Here are the details.</p>

<p>The most classical place for preemption of userspace processes is on <em>the return path of a system call</em>. This happens from <code>arch/arm/kernel/entry-common.S</code> in the assembly snippets for <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-common.S#n86" rel="nofollow">ret_slow_syscall() and ret_fast_syscall()</a>, where the ARM32 kernel makes an explicit call to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/signal.c#n601" rel="nofollow"><strong>do_work_pending()</strong></a> in <code>arch/arm/kernel/signal.c</code>. This will issue a call to <strong>schedule()</strong> if the flag _TIF_NEED_RESCHED is set for the thread, and the the kernel will handle over execution to whichever task is prioritized next, no matter whether it is a userspace or kernelspace task. A special case is <strong>ret_from_fork</strong> which means a new userspace process has been forked and in many cases the parent gets preempted immediately in favor of the new child through this path.</p>

<p>The most common place for preemption is however when <em>returning from a hardware interrupt</em>. Interrupts on ARM32 are handled in assembly in <code>arch/arm/kernel/entry-armv.S</code> with a piece of assembly that saves the processor state for the current CPU into a struct pt_regs and from there just calls the generic interrupt handling code in <code>kernel/irq/handle.c</code> named <strong>generic_handle_arch_irq()</strong>. This code is used by other archs than ARM32 and will nominally just store the system state and registers in a <code>struct pt_regs</code> record on entry and restore it on exit. However when the simplistic code in <em>generic_handle_arch_irq()</em> is done, it exits through the same routines in  <code>arch/arm/kernel/entry-common.S</code> as fast and slow syscalls, and we can see that in <strong>ret_to_user_from_irq</strong> the code will explicitly check for the resched and other flags with <code>ldr r1, [tsk, #TI_FLAGS]</code> and branch to the handler doing <strong>do_work_pending()</strong>, and consequently preempt to another task <em>instead of</em> returning from an interrupt.</p>

<p>Now study <strong>do_work_pending()</strong>:</p>

<pre><code>do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
        /*
         * The assembly code enters us with IRQs off, (...)
         */

        do {
                if (likely(thread_flags &amp; _TIF_NEED_RESCHED)) {
                        schedule();
                } else {
                        (...)
                }
                local_irq_disable();
                thread_flags = read_thread_flags();
        } while (...);
        return 0;
}
</code></pre>

<p>Notice the comment: we enter do_work_pending() with local IRQs disabled so we can&#39;t get interrupted in an interrupt (other exceptions can still happen though). Then we likely call schedule() and another thread needs to start to run. When we return after having scheduled another thread we are supposed proceed to exit the exception handler with interrupts disabled, so that is why the first instruction after the if/else-clause is <em>local_irq_disable()</em> – we might have come back from a kernel thread which was happily executing with interrupts enabled. So disable them. In fact, if you grep for do_work_pending you will see that this looks the same on other architectures with similar setup.</p>

<p>In reality <em>do_work_pending()</em> does a few more things than preemption: it also handles signals between processes and process termination etc. But for this exercise we only need to know that it calls schedule() followed by local_irq_disable().</p>

<p>The <code>struct pt_regs</code> should be understood as “processor trace registers” which is another historical naming, much due to its use in tracing. On ARM32 it is in reality 18 32-bit words representing all the registers and status bits of the CPU for a certain task, i.e. the <em>CPU state</em>, including the program counter <em>pc</em>, which is the place where the task was supposed to resume execution, <em>unless</em> it got preempted by schedule(). This way, if we preempt and leave a task behind, the CPU state contains all we need to know to continue where we left off. These pt_regs are stored in the task stack during the call to <em>generic_handle_arch_irq()</em>.</p>

<p>The assembly in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-common.S" rel="nofollow"><code>entry-common.S</code></a> can be a bit hard to follow, here is a the core essentials for a return path from an interrupt that occurs while we are executing in userspace:</p>

<pre><code>	(...)
slow_work_pending:
	mov	r0, sp				@ &#39;regs&#39;
	mov	r2, why				@ &#39;syscall&#39;
	bl	do_work_pending
	cmp	r0, #0
	beq	no_work_pending
	(...)

ENTRY(ret_to_user_from_irq)
	ldr	r1, [tsk, #TI_FLAGS]
	movs	r1, r1, lsl #16
	bne	slow_work_pending
no_work_pending:
	asm_trace_hardirqs_on save = 0
	ct_user_enter save = 0
	restore_user_regs fast = 0, offset = 0
</code></pre>

<p>We see that when we return from an IRQ, we check the flags in the thread and if any bit is set we branch to execute slow work, which is done by <em>do_work_pending()</em> which will potentially call schedule(), then return, possibly much later, and if all went fine branch back to <em>no_work_pending</em> and restore the usersmode registers and continue execution.</p>

<p>Notice that the exception we are returning from here can be the timer interrupt that was handled by the Linux clockevent and driving the scheduling by calling scheduler_tick()! This means we can preempt directly on the return path of the interrupt that was triggered by the timer tick. This way the slicing of task time is as precise as it can get: scheduler_tick() gets called by the timer interrupt, and if it sets TIF_NEED_RESCHED a different thread will start to execute on our way out of the exception handler!</p>

<p>The same path will be taken by SVC/SWI software exceptions, so these will also lead to rescheduling of necessary. The routine named <code>restore_user_regs</code> can be found in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-header.S#n293" rel="nofollow"><code>entry-header.S</code></a> and it will pretty much do what it says, ending with the following instructions (if we remove quirks and assume slowpath):</p>

<pre><code>	mov	r2, sp
	(...)
	ldmdb	r2, {r0 - lr}^			@ get calling r0 - lr
	add	sp, sp, #\offset + PT_REGS_SIZE
	movs	pc, lr				@ return &amp; move spsr_svc into cp
</code></pre>

<p>r2 is set to the stack pointer, where pt_regs are stored, these are 17 registers and CPSR (current program status register). We pull the registers from the stack (including r2 which gets overwritten) — <strong>NOTE</strong>: the little caret (<strong>^</strong>) after the <code>ldmdb</code> instruction means “also load CPSR from the stack” — then moves the stackpointer past the saved registers and returns.</p>

<p>Using the exceptions as a point for preemption is natural: exceptions by their very nature are designed to store the processor state before jumping to the exception handler, and it is strictly defined how to store this state into memory such as onto the per-task <em>task stack</em>, and how to reliably restore it at the end of an exception. So this is a good point to do something else, such as switch to something completely different.</p>

<p>Also notice that this <em>must</em> happen in the <em>end</em> of the interrupt (exception) handler. You can probably imagine what would happen on a system with level-triggered interrupts if we would say preempt in the beginning of the interrupt instead of the end: we would not reach the hardware interrupt handler, and the interrupt would not be cleared. Instead, we handle the exception, and then when we are done we optionally check if preemption should happen right before returning to the interrupted task.</p>

<p>But let&#39;s not skip the last part of what schedule() does.</p>

<h2 id="setting-the-program-counter">Setting the Program Counter</h2>

<p>So we now know a few places where the system can preempt and on ARM32 we see that this mostly happens in the function named do_work_pending() which in turn will call schedule() for us.</p>

<p>The schedulers <strong>schedule()</strong> call is supposed to very quickly select a process to run next. Eventually it will call <strong>context_switch()</strong> in <code>kernel/sched/core.c</code>, which in turn will do essentially two things:</p>
<ul><li>Check if the next task has a unique memory management context (<code>next-&gt;mm</code> is not NULL) and in that case switch the memory management context to the next task. This means updating the MMU to use a different MMU table. Kernel threads do not have any unique memory management context so for those we can just keep the previous context (the kernel virtual memory is mapped into all processes on ARM32 so we can just go on). If the memory management context does switch, we call <strong>switch_mm_irqs_off()</strong> which in the ARM32 case is just defined to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/mmu_context.h#n117" rel="nofollow">the ARM32-specific <strong>switch_mm()</strong></a> which will call the ARM32-specific <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/mmu_context.h#n62" rel="nofollow"><strong>check_and_switch_context()</strong></a> — <em>NOTE</em> that this function for any system with MMU is hidden in the <code>arch/arm/include/asm/mmu_context.h</code> header file — which in turn does one of two things:
<ul><li>If interrupts are disabled, we will just set <code>mm-&gt;context.switch_pending = 1</code> so that the memory management context switch will happen at a later time when we are running with interrupts enabled, because it will be very costly to switch task memory context on ARM32 if interrupts are disabled on certain VIVT (virtually indexed, virtually tagged) cache types, and this in turn would cause unpredictable IRQ latencies on these systems. This concerns some ARMv6 cores. The reason why interrupts would be disabled in a schedule() call is that it will be holding a runqueue lock, which in turn disables interrupts. Just like the comment in the code says, this will be done later in the arch-specific <strong>finish_arch_post_lock_switch()</strong> which is implemented right below and gets called right after dropping the runqueue lock.</li>
<li>If interrupts are <em>not</em> disabled, we will immediately call <strong>cpu_switch_mm()</strong>. This is a per-cpu callback witch is written in assembly for each CPU as <strong>cpu_NNNN_switch_mm()</strong> inside <code>arch/arm/mm/proc-NNNN.S</code>. For example, all v7 CPUs have the cpu_v7_switch_mm() in <code>arch/arm/mm/proc-v7.S</code>.</li></ul></li>
<li>Switch context (such as the register states and stack) to the new task by calling <strong>switch_to()</strong> with the new task and the previous one as parameter. In most cases this latches to an architecture-specific <strong>__switch_to()</strong>. In the ARM32 case, this routine is written in assembly and can be found in <code>arch/arm/kernel/entry-armv.S</code>.</li></ul>

<p>Now the final details happens in <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-armv.S#n732" rel="nofollow"><code>__switch_to()</code></a> which is supplied the <code>struct thread_info</code> (i.e. the architecture-specific state) for both the current and the previous task:</p>
<ul><li>We store the registers of the current task in the task stack, at the TI_CPU_SAVE index of <code>struct thread_info</code>, which corresponds to the <code>.cpu_context</code> entry in the struct, which is in turn a <code>struct cpu_context_save</code>, which is 12 32-bit values to store r4-r9, sl, fp, sp and pc. This is everything needed to <em>continue as if nothing has happened</em> when we “return” after the schedule() call. I put “return” in quotation marks, because a plethora of other tasks may have run before we actually get back there. You may ask why r0, r1, r2 and r3 are not stored. This will be addressed shortly.</li>
<li>Then the TLS (Thread Local Storage) settings for the <em>new task</em> are obtained and we issue <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/tls.h" rel="nofollow">switch_tls()</a>. On v6 CPUs this has special implications, but in most cases we end up using <em>switch_tls_software()</em> which sets TLS to 0xffff0ff0 for the task. This is a hard-coded value in virtual memory used by the <a href="https://www.kernel.org/doc/html/latest/arm/kernel_user_helpers.html" rel="nofollow">kernel-provided user helpers</a>, which in turn are a few kernel routines “similar to but different from VDSO” that are utilized by the userspace C library. On ARMv7 CPUs that support the thread ID register (TPIDRURO) this will be used to store the <code>struct thread_info</code> pointer, so it cannot be used for TLS on ARMv7. (More on this later.)</li>
<li>We then broadcast THREAD_NOTIFY_SWITCH using kernel notifiers. These are usually written i C but called from the assembly snippet __switch_to() here. A notable use case is that if the task is making use of VFP (the Vectored Floating Point unit) then the state of the VFP gets saved here, so that will be cleanly restored when the task resumes as well.</li></ul>

<p>Then we reach the final step in __switch_to(), which is a bit different depending on whether we use CONFIG_VMAP_STACK or not.</p>

<p>The simple path when we are <em>not</em> using VMAP:ed stacks looks like this:</p>

<pre><code>	set_current r7, r8
	ldmia	r4, {r4 - sl, fp, sp, pc}	@ Load all regs saved previously
</code></pre>

<p>Here r7 contains a pointer to the <em>next</em> tasks thread_info (which will somewhere the kernel virtual memory map), and set_current() will store the pointer to that task in such a way that the CPU can look it up with a few instructions at any point in time. On older non-SMP ARMv4 and ARMv5 CPU:s this will simply be the memory location pointed out by the label <code>__current</code> but ARMv7 and SMP systems have a dedicated special CP15 <code>TPIDRURO</code> <em>thread ID register</em> to store this in the CPU so that the thread_info can be located <em>very</em> quickly. (The only user of this information is, no surprise, the get_current() assembly snippet, but that is in turn called from a <em>lot</em> of places and contexts.)</p>

<p>The next <code>ldmia</code> instruction does the real trick: it loads registers r4 thru sl (r10), fp (r11), sp(r13) <em>and</em> pc(r15) from the location pointed out by r4, which again is the <code>.cpu_context</code> entry in the struct thread_info, the <code>struct cpu_context_save</code>, which is all the context there is <em>including</em> <strong>pc</strong> so the next instruction after this will be whatever <em>pc</em> was inside the <code>struct cpu_context_save</code>. We have switched to the new task and preemption is complete.</p>

<p>But wait a minute. r4 and up you say. Exept some registers, so what about r0, r1, r2, r3, r12 (ip) and r14 (lr)? Isn&#39;t the task we&#39;re switching to going to miss those registers?</p>

<p>For r0-r3 the short answer is that when we call schedule() explicitly (which only happens inside the kernel) then r0 thru r3 are scratch registers that are free to be “clobbered” during any function call. So since we call schedule() the caller should be prepared that those registers are clobbered anyway. The same goes for the status register CPSR. It&#39;s a function call to inline assembly and not an exception.</p>

<p>And even if we look around the context after a call to schedule(), since we were either (A) starting a brand new task or (B) on our way out of an exception handler for a software or hardware interrupt or (C) explicitly called schedule() when this happened, this just <em>doesn&#39;t matter</em>.</p>

<p>Then r12 is a scratch register and we are not calling down the stack using lr at this point either (we just jump to pc!) so these two do not need to be saved or restored. (On the ARM or VMAP exit path you will find ip and lr being used.)</p>

<p>When starting a completely <em>new</em> task all the contents of <code>struct cpu_context_save</code> will be zero, and the return address will be set to <strong>ret_from_fork</strong> or and then the new task will bootstrap itself in userspace or as a kernel thread anyway.</p>

<p>If we&#39;re on the exit path of an exception handler, we call various C functions and r0 thru r3 are used as scratch registers, meaning that their content doesn&#39;t matter. At the end of the exception (which we are close to when we call schedule()) all registers and the CPSR will be restored from the kernel exception stacks record for pt_regs before the exception returns anyway, which is another good reason to use exceptions handlers as preemption points.</p>

<p>This is why r0 thru r3 are missing from <code>struct cpu_context_save</code> and need not be preserved.</p>

<p>When the scheduler later on decides to schedule in the task that was interrupted again, we will return to execution <em>right after the schedule(); call</em>. If we were on our way out of an exception in <strong>do_work_pending()</strong> we will proceed to return from the exception handler, and to the process it will “feel” like it just returned from a hardware or sofware interrupt, and execution will go on from that point like nothing happened.</p>

<h2 id="running-init">Running <em>init</em></h2>

<p>So how does <code>/sbin/init</code> actually come to execute?</p>

<p>We saw that after <em>start_kernel</em> we get to <em>rest_init</em> which creates the thread with <code>pid = user_mode_thread(kernel_init, NULL, CLONE_FS)</code>.</p>

<p>Then <em>kernel_init</em> calls on <em>kernel_execve()</em> to execute <code>/sbin/init</code>. It locates an ELF parser to read and page in the file. Then it will eventually issue <em>start_thread()</em> which will set <code>regs-&gt;ARM_cpsr = USR_MODE</code> and <code>regs-&gt;ARM_pc</code> to the start of the executable.</p>

<p>Then this tasks <code>task_struct</code> including memory context etc will be selected after a call to <em>schedule()</em>.</p>

<p>But every call to schedule() will return to the point right after a schedule() call, and the only place a userspace task is ever preempted to get schedule() called on it is in the exception handlers, such as when a timer interrupt occurs. Well, this is where we “cheat”:</p>

<p>When we initialized the process in <code>arch/arm/kernel/process.c</code>, we set the program counter to <strong>ret_from_fork</strong> so we are not going back after any schedule() call: we are going back to <strong>ret_from_fork</strong>! And this is just an exception return path, so this will restore <code>regs-&gt;ARM_cpsr</code> to USR_MODE, and “return from an exception” into whatever is in <code>regs-&gt;ARM_pc</code>, which is the start of the binary program from the ELF file!</p>

<p>So <code>/sbin/init</code> is executed as a consequence of returning from a fake exception through ret_from_fork. From that point on, only real exceptions, such as getting interrupted by the IRQ, will happen to the process.</p>

<p>This is how ARM32 schedules and executes processes.</p>
]]></content:encoded>
      <author>linusw</author>
      <guid>https://people.kernel.org/read/a/xu6dip25co</guid>
      <pubDate>Tue, 25 Apr 2023 08:40:53 +0000</pubDate>
    </item>
    <item>
      <title>Mounting into mount namespaces</title>
      <link>https://people.kernel.org/brauner/mounting-into-mount-namespaces</link>
      <description>&lt;![CDATA[The original blogpost is at https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html&#xA;&#xA;Early on when the LXD project was started we were clear that we wanted to make it possible to change settings while the container is running.&#xA;On of the very first things that came to our mind was making it possible to insert new mounts into a running container.&#xA;When I was still at Canonical working on LXD we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.&#xA;&#xA;Back then the only way to create mounts or change mount option was by using the mount(2) system call.&#xA;The mount system call multiplexes a lot of different operations.&#xA;For example, it doesn&#39;t just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes.&#xA;Mounting is overall a pretty complex operation as it doesn&#39;t just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.&#xA;&#xA;I want to take a look at our legacy solution to this problem and a new approach that I&#39;ve used and that has existed for a while but never talked about widely.&#xA;&#xA;Creative uses of mount(2)&#xA;&#xA;Before openat2(2) came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks.&#xA;A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem.&#xA;These attacks used to be quite common and there was no straightforward solution available; at least not before the RESOLVE flag namespace of openat2(2) improved things so considerably that symlink attacks on new kernels can be effectively blocked.&#xA;&#xA;But before openat2() symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm.&#xA;I won&#39;t go into too much detail but it is roughly done by verifying each path component in userspace using OPATH file descriptors making sure that the paths point into the container&#39;s rootfs.&#xA;&#xA;But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that mount(2) only operates on paths.&#xA;So you are still susceptible to symlink attacks as soon as you call mount(source, target, ...).&#xA;&#xA;The way we solved this problem was by realizing that mount(2) was perfectly happy to operate on /proc/self/fd/nr paths&#xA;(This is similar to how fexecve() used to work before the addition of the execveat() system call.).&#xA;So we could verify the whole path and then open the last component of the source and target paths at which point we could call mount(&#34;/proc/self/fd/1234&#34;, &#34;/proc/self/fd/5678&#34;, ...).&#xA;&#xA;We immediately thought that if mount(2) allows you to do that then we could easily use this to mount into namespaces.&#xA;So if the container is running it its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container&#39;s mount namespace (and it&#39;s owning user namespace) and then simply call mount(&#34;/proc/self/fd/1234&#34;, &#34;/mnt&#34;, ...).&#xA;In pseudo C code it would look roughly:&#xA;&#xA;fdmnt = openat(-EBADF, &#34;/opt&#34;, OPATH, ...);&#xA;setns(fduserns, CLONENEWUSER);&#xA;setns(fdmntns, CLONENEWNS);&#xA;mount(&#34;/proc/self/fd/fdmnt&#34;, &#34;/mnt&#34;, ...);&#xA;&#xA;However, this isn&#39;t possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller&#39;s mount namespace.&#xA;Since the caller will be located in the container&#39;s mount namespace after the setns() call but the source file descriptors refers to a mount located in the host&#39;s mount namespace this check fails.&#xA;The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted.&#xA;Back then it would&#39;ve also meant that adding mounts to containers would&#39;ve only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.&#xA;&#xA;Mount namespace tunnels&#xA;&#xA;So we came up with the idea of mount namespace tunnels.&#xA;Since we spearheaded this idea it has been picked up by various projects such as systemd for system services and it&#39;s own systemd-nspawn container runtime.&#xA;&#xA;The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:&#xA;&#xA;mount --bind /opt /opt&#xA;mount --make-private /opt&#xA;mount --make-shared /opt&#xA;Create new mount namespace with all mounts turned into dependent mounts.&#xA;unshare --mount --propagation=slave&#xA;&#xA;and then create a mount on or beneath the shared /opt mount on the host:&#xA;&#xA;mkdir /opt/a&#xA;mount --bind /tmp /opt/a&#xA;&#xA;then the new mount of /tmp on the dentry /opt/a will propagate into the mount namespace we created earlier.&#xA;Since the /opt mount at the /opt dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:&#xA;&#xA;mount --move /opt/a /mnt&#xA;&#xA;As a last step we can unmount /opt/a in the host mount namespace.&#xA;And as long as the /mnt dentry doesn&#39;t reside on a mount that is a dependent mount of /opt&#39;s peer group the unmount of /opt/a we just performed on the host will only unmount the mount in the host mount namespace.&#xA;&#xA;There are various problems with this solution:&#xA;&#xA;It&#39;s complex.&#xA;The container manager needs to set up the mount tunnel when the container starts.&#xA;  In other words, it needs to part of the architecture of the container which is always unfortunate.&#xA;The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted.&#xA;  Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.&#xA;&#xA;Mounting into mount namespaces&#xA;&#xA;A few years ago a new mount api made it into the kernel.&#xA;Shortly after I&#39;ve also added the mountsetattr(2) system call.&#xA;Since then I&#39;ve been expanding the abilities of this api and to put it to its full use.&#xA;&#xA;Unfortunately the adoption of the new mount api has been slow.&#xA;Mostly, because people don&#39;t know about it or because they don&#39;t yet see the many advantages it offers over the old one.&#xA;But with the next release of the mount(8) binary a lot of us use the new mount api will be used whenever possible.&#xA;&#xA;I won&#39;t be covering all the features that the mount api offers.&#xA;This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let&#39;s us get rid of the complex mount propagation scheme.&#xA;&#xA;Luckily, the new mount api is designed around file descriptors.&#xA;&#xA;Filesystem Mounts&#xA;&#xA;To create a new filesystem mount using the old mount api is simple:&#xA;&#xA;mount(&#34;/dev/sda&#34;, &#34;/mnt&#34;, &#34;xfs&#34;, ...);&#xA;&#xA;We pass the source, target, and filesystem type and potentially additional mount options.&#xA;This single system call does a lot behind the scenes.&#xA;A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller&#39;s mount namespace.&#xA;&#xA;In the new mount api the various steps are split into separate system calls.&#xA;While this makes mounting more complex it allows allows for greater flexibility.&#xA;Mounting doesn&#39;t have to be a fast operation and never has been.&#xA;&#xA;So in the new mount api we would create a new filesystem mount with the following steps:&#xA;&#xA;/ Create a new filesystem context. /&#xA;fdfs = fsopen(&#34;xfs&#34;);&#xA;&#xA;/&#xA; Set the source of the filsystem mount. Whether or not this is required&#xA; depends on the type of filesystem of course. For example, mounting a tmpfs&#xA; filesystem would not require us to set the &#34;source&#34; property as it&#39;s not&#xA; backed by a block device. &#xA; /&#xA;fsconfig(fdfs, FSCONFIGSETSTRING, &#34;source&#34;, &#34;/dev/sda&#34;, 0);&#xA;&#xA;/ Actually create the superblock and prepare to allocate a mount. /&#xA;fsconfig(fdfs, FSCONFIGCMDCREATE, NULL, NULL, 0);&#xA;&#xA;The fdfs file descriptor refers to VFS context object that doesn&#39;t concern us here.&#xA;Let it suffice that it is an opaque object that can only be used to configure&#xA;the superblock and the filesystem until fsmount() is called:&#xA;&#xA;/ Create a new detached mount and return an OPATH file descriptor refering to the mount. */&#xA;fdmnt = fsmount(fdfs, 0, 0);&#xA;&#xA;The fsmount() call will turn the context file descriptor into an OPATH file descriptor that refers to a detached mount.&#xA;A detached mount is a mount that isn&#39;t attached to any mount namespace.&#xA;&#xA;Bind Mounts&#xA;&#xA;The old mount api created bind mounts via:&#xA;&#xA;mount(&#34;/opt&#34;, &#34;/mnt&#34;, MNTBIND, ...)&#xA;&#xA;and recursive bind mounts via:&#xA;&#xA;mount(&#34;/opt&#34;, &#34;/mnt&#34;, MNTBIND | MSREC, ...)&#xA;&#xA;Most people however will be more familiar with mount(8):&#xA;&#xA;mount --bind /opt /mnt&#xA;mount --rbind / /mnt&#xA;&#xA;Bind mounts play a major role in container runtimes and system services as run by systemd.&#xA;&#xA;The new mount api supports bind mounts through the opentree() system call.&#xA;Calling opentree() on an existing mount will just return an OPATH file descriptor referring to that mount.&#xA;But if OPENTREECLONE is specified opentree() will create a detached mount and return an OPATH file descriptor.&#xA;That file descriptor is indistinguishable from an OPATH file descriptor returned from the earlier fsmount() example:&#xA;&#xA;fdmnt = opentree(-EBADF, &#34;/opt&#34;, OPENTREECLONE, ...)&#xA;&#xA;creates a new detached mount of /opt and:&#xA;&#xA;fdmnt = opentree(-EBADF, &#34;/&#34;, OPENTREECLONE | ATRECURSIVE, ...)&#xA;&#xA;would create a new detached copy of the whole rootfs mount tree.&#xA;&#xA;Attaching detached mounts&#xA;&#xA;As mentioned before the file descriptor returned from fsmount() and opentree(OPENTREECLONE) refers to a detached mount in both cases.&#xA;The mount it refers to doesn&#39;t appear anywhere in the filesystem hierarchy.&#xA;Consequently, the mount can&#39;t be found by lookup operations going through the filesystem hierarchy.&#xA;The new mount api thus provides an elegant mechanism for:&#xA;&#xA;mount(&#34;/opt&#34;, &#34;/mnt&#34;, MSBIND, ...);&#xA;fdmnt = openat(-EABDF, &#34;/mnt&#34;, OPATH | ODIRECTORY | OCLOEXEC, ...);&#xA;umount2(&#34;/mnt&#34;, MNTDETACH);&#xA;&#xA;and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace.&#xA;This alone is already a very powerful tool but we won&#39;t go into depth today.&#xA;&#xA;Most of the time a detached mount isn&#39;t wanted however.&#xA;Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it.&#xA;So we need to attach them to the filesystem hierarchy.&#xA;&#xA;In order to attach a mount we can use the movemount() system call.&#xA;For example, to attach the detached mount fdmnt we create before we can use:&#xA;&#xA;movemount(fdmnt, &#34;&#34;, -EBADF, &#34;/mnt&#34;, MOVEMOUNTFEMPTYPATH);&#xA;&#xA;This will attach the detached mount of /opt at the /mnt dentry on the / mount.&#xA;What this means is that the /opt mount will be inserted into the mount namespace that the caller is located in at the time of calling movemount().&#xA;(The kernel has very tight semantics here. For example, it will enforce that the caller has CAPSYSADMIN in the owning user namespace of its mount namespace.&#xA;It will also enforce that the mount the /mnt dentry is located on belongs to the same mount namespace as the caller.)&#xA;&#xA;After movemount() returns the mount is permanently attached.&#xA;Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to.&#xA;In other words, movemount() is an irreversible operation.&#xA;&#xA;The main point is that before movemount() is called a detached mount doesn&#39;t belong to any mount namespace and can thus be freely moved around.&#xA;&#xA;Mounting a new filesystem into a mount namespace&#xA;&#xA;To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:&#xA;&#xA;fdfs = fsopen(&#34;xfs&#34;);&#xA;fsconfig(fdfs, FSCONFIGSETSTRING, &#34;source&#34;, &#34;/dev/sda&#34;, 0);&#xA;fsconfig(fdfs, FSCONFIGCMDCREATE, NULL, NULL, 0);&#xA;fdmnt = fsmount(fdfs, 0, 0);&#xA;&#xA;For filesystems that require host privileges such as xfs, ext4, or btrfs (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges.&#xA;However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace.&#xA;So we can simply attach to the user namespace and mount namespace of the container:&#xA;&#xA;setns(fduserns);&#xA;setns(fdmntns);&#xA;&#xA;and then use&#xA;&#xA;movemount(fdmnt, &#34;&#34;, -EBADF, &#34;/mnt&#34;, MOVEMOUNTFEMPTYPATH);&#xA;&#xA;to attach the detached mount anywhere we like in the container.&#xA;&#xA;Mounting a new bind mount into a mount namespace&#xA;&#xA;A bind mount is even simpler.&#xA;If we want to share a specific host directory with the container we can just have the container manager call:&#xA;&#xA;fdmnt = opentree(-EBADF, &#34;/opt&#34;, OPENTREECLOEXEC | OPENTREECLONE);&#xA;&#xA;to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:&#xA;&#xA;setns(fduserns);&#xA;setns(fdmntns);&#xA;&#xA;and as above we are free to attach the detached mount anywhere we like in the container.&#xA;&#xA;Conclusion&#xA;&#xA;This is really it and as simple as it sounds.&#xA;It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers.&#xA;We&#39;ve making heavy use of this LXD and it is general the proper way to insert mounts into mount namespaces on newer kernels.]]&gt;</description>
      <content:encoded><![CDATA[<p>The original blogpost is at <a href="https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html" rel="nofollow">https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html</a></p>

<p>Early on when the <code>LXD</code> project was started we were clear that we wanted to make it possible to change settings while the container is running.
On of the very first things that came to our mind was making it possible to insert new mounts into a running container.
When I was still at Canonical working on <code>LXD</code> we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.</p>

<p>Back then the only way to create mounts or change mount option was by using the <code>mount(2)</code> system call.
The mount system call multiplexes a lot of different operations.
For example, it doesn&#39;t just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes.
Mounting is overall a pretty complex operation as it doesn&#39;t just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.</p>

<p>I want to take a look at our legacy solution to this problem and a new approach that I&#39;ve used and that has existed for a while but never talked about widely.</p>

<h1 id="creative-uses-of-mount-2">Creative uses of <code>mount(2)</code></h1>

<p>Before <code>openat2(2)</code> came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks.
A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem.
These attacks used to be quite common and there was no straightforward solution available; at least not before the <code>RESOLVE_*</code> flag namespace of <code>openat2(2)</code> improved things so considerably that symlink attacks on new kernels can be effectively blocked.</p>

<p>But before <code>openat2()</code> symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm.
I won&#39;t go into too much detail but it is roughly done by verifying each path component in userspace using <code>O_PATH</code> file descriptors making sure that the paths point into the container&#39;s rootfs.</p>

<p>But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that <code>mount(2)</code> only operates on paths.
So you are still susceptible to symlink attacks as soon as you call <code>mount(source, target, ...)</code>.</p>

<p>The way we solved this problem was by realizing that <code>mount(2)</code> was perfectly happy to operate on <code>/proc/self/fd/&lt;nr&gt;</code> paths
(This is similar to how <code>fexecve()</code> used to work before the addition of the <code>execveat()</code> system call.).
So we could verify the whole path and then open the last component of the source and target paths at which point we could call <code>mount(&#34;/proc/self/fd/1234&#34;, &#34;/proc/self/fd/5678&#34;, ...)</code>.</p>

<p>We immediately thought that if <code>mount(2)</code> allows you to do that then we could easily use this to mount into namespaces.
So if the container is running it its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container&#39;s mount namespace (and it&#39;s owning user namespace) and then simply call <code>mount(&#34;/proc/self/fd/1234&#34;, &#34;/mnt&#34;, ...)</code>.
In pseudo C code it would look roughly:</p>

<pre><code class="language-c">fd_mnt = openat(-EBADF, &#34;/opt&#34;, O_PATH, ...);
setns(fd_userns, CLONE_NEWUSER);
setns(fd_mntns, CLONE_NEWNS);
mount(&#34;/proc/self/fd/fd_mnt&#34;, &#34;/mnt&#34;, ...);
</code></pre>

<p>However, this isn&#39;t possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller&#39;s mount namespace.
Since the caller will be located in the container&#39;s mount namespace after the <code>setns()</code> call but the source file descriptors refers to a mount located in the host&#39;s mount namespace this check fails.
The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted.
Back then it would&#39;ve also meant that adding mounts to containers would&#39;ve only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.</p>

<h1 id="mount-namespace-tunnels">Mount namespace tunnels</h1>

<p>So we came up with the idea of mount namespace tunnels.
Since we spearheaded this idea it has been picked up by various projects such as <code>systemd</code> for system services and it&#39;s own <code>systemd-nspawn</code> container runtime.</p>

<p>The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:</p>

<pre><code>mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
# Create new mount namespace with all mounts turned into dependent mounts.
unshare --mount --propagation=slave
</code></pre>

<p>and then create a mount on or beneath the shared <code>/opt</code> mount on the host:</p>

<pre><code>mkdir /opt/a
mount --bind /tmp /opt/a
</code></pre>

<p>then the new mount of <code>/tmp</code> on the dentry <code>/opt/a</code> will propagate into the mount namespace we created earlier.
Since the <code>/opt</code> mount at the <code>/opt</code> dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:</p>

<pre><code>mount --move /opt/a /mnt
</code></pre>

<p>As a last step we can unmount <code>/opt/a</code> in the host mount namespace.
And as long as the <code>/mnt</code> dentry doesn&#39;t reside on a mount that is a dependent mount of <code>/opt</code>&#39;s peer group the unmount of <code>/opt/a</code> we just performed on the host will only unmount the mount in the host mount namespace.</p>

<p>There are various problems with this solution:</p>
<ul><li>It&#39;s complex.</li>
<li>The container manager needs to set up the mount tunnel when the container starts.
In other words, it needs to part of the architecture of the container which is always unfortunate.</li>
<li>The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted.
Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.</li></ul>

<h1 id="mounting-into-mount-namespaces">Mounting into mount namespaces</h1>

<p>A few years ago a new mount api made it into the kernel.
Shortly after I&#39;ve also added the <code>mount_setattr(2)</code> system call.
Since then I&#39;ve been expanding the abilities of this api and to put it to its full use.</p>

<p>Unfortunately the adoption of the new mount api has been slow.
Mostly, because people don&#39;t know about it or because they don&#39;t yet see the many advantages it offers over the old one.
But with the next release of the <code>mount(8)</code> binary a lot of us use the new mount api will be used whenever possible.</p>

<p>I won&#39;t be covering all the features that the mount api offers.
This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let&#39;s us get rid of the complex mount propagation scheme.</p>

<p>Luckily, the new mount api is designed around file descriptors.</p>

<h2 id="filesystem-mounts">Filesystem Mounts</h2>

<p>To create a new filesystem mount using the old mount api is simple:</p>

<pre><code>mount(&#34;/dev/sda&#34;, &#34;/mnt&#34;, &#34;xfs&#34;, ...);
</code></pre>

<p>We pass the source, target, and filesystem type and potentially additional mount options.
This single system call does a lot behind the scenes.
A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller&#39;s mount namespace.</p>

<p>In the new mount api the various steps are split into separate system calls.
While this makes mounting more complex it allows allows for greater flexibility.
Mounting doesn&#39;t have to be a fast operation and never has been.</p>

<p>So in the new mount api we would create a new filesystem mount with the following steps:</p>

<pre><code class="language-c">/* Create a new filesystem context. */
fd_fs = fsopen(&#34;xfs&#34;);

/*
 * Set the source of the filsystem mount. Whether or not this is required
 * depends on the type of filesystem of course. For example, mounting a tmpfs
 * filesystem would not require us to set the &#34;source&#34; property as it&#39;s not
 * backed by a block device. 
 */
fsconfig(fd_fs, FSCONFIG_SET_STRING, &#34;source&#34;, &#34;/dev/sda&#34;, 0);

/* Actually create the superblock and prepare to allocate a mount. */
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
</code></pre>

<p>The <code>fd_fs</code> file descriptor refers to VFS context object that doesn&#39;t concern us here.
Let it suffice that it is an opaque object that can only be used to configure
the superblock and the filesystem until <code>fsmount()</code> is called:</p>

<pre><code class="language-c">/* Create a new detached mount and return an O_PATH file descriptor refering to the mount. */
fd_mnt = fsmount(fd_fs, 0, 0);
</code></pre>

<p>The <code>fsmount()</code> call will turn the context file descriptor into an <code>O_PATH</code> file descriptor that refers to a detached mount.
A detached mount is a mount that isn&#39;t attached to any mount namespace.</p>

<h2 id="bind-mounts">Bind Mounts</h2>

<p>The old mount api created bind mounts via:</p>

<pre><code class="language-c">mount(&#34;/opt&#34;, &#34;/mnt&#34;, MNT_BIND, ...)
</code></pre>

<p>and recursive bind mounts via:</p>

<pre><code class="language-c">mount(&#34;/opt&#34;, &#34;/mnt&#34;, MNT_BIND | MS_REC, ...)
</code></pre>

<p>Most people however will be more familiar with <code>mount(8)</code>:</p>

<pre><code>mount --bind /opt /mnt
mount --rbind / /mnt
</code></pre>

<p>Bind mounts play a major role in container runtimes and system services as run by <code>systemd</code>.</p>

<p>The new mount api supports bind mounts through the <code>open_tree()</code> system call.
Calling <code>open_tree()</code> on an existing mount will just return an <code>O_PATH</code> file descriptor referring to that mount.
But if <code>OPEN_TREE_CLONE</code> is specified <code>open_tree()</code> will create a detached mount and return an <code>O_PATH</code> file descriptor.
That file descriptor is indistinguishable from an <code>O_PATH</code> file descriptor returned from the earlier <code>fsmount()</code> example:</p>

<pre><code class="language-c">fd_mnt = open_tree(-EBADF, &#34;/opt&#34;, OPEN_TREE_CLONE, ...)
</code></pre>

<p>creates a new detached mount of <code>/opt</code> and:</p>

<pre><code class="language-c">fd_mnt = open_tree(-EBADF, &#34;/&#34;, OPEN_TREE_CLONE | AT_RECURSIVE, ...)
</code></pre>

<p>would create a new detached copy of the whole rootfs mount tree.</p>

<h3 id="attaching-detached-mounts">Attaching detached mounts</h3>

<p>As mentioned before the file descriptor returned from <code>fsmount()</code> and <code>open_tree(OPEN_TREE_CLONE)</code> refers to a detached mount in both cases.
The mount it refers to doesn&#39;t appear anywhere in the filesystem hierarchy.
Consequently, the mount can&#39;t be found by lookup operations going through the filesystem hierarchy.
The new mount api thus provides an elegant mechanism for:</p>

<pre><code class="language-c">mount(&#34;/opt&#34;, &#34;/mnt&#34;, MS_BIND, ...);
fd_mnt = openat(-EABDF, &#34;/mnt&#34;, O_PATH | O_DIRECTORY | O_CLOEXEC, ...);
umount2(&#34;/mnt&#34;, MNT_DETACH);
</code></pre>

<p>and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace.
This alone is already a very powerful tool but we won&#39;t go into depth today.</p>

<p>Most of the time a detached mount isn&#39;t wanted however.
Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it.
So we need to attach them to the filesystem hierarchy.</p>

<p>In order to attach a mount we can use the <code>move_mount()</code> system call.
For example, to attach the detached mount <code>fd_mnt</code> we create before we can use:</p>

<pre><code class="language-c">move_mount(fd_mnt, &#34;&#34;, -EBADF, &#34;/mnt&#34;, MOVE_MOUNT_F_EMPTY_PATH);
</code></pre>

<p>This will attach the detached mount of <code>/opt</code> at the <code>/mnt</code> dentry on the <code>/</code> mount.
What this means is that the <code>/opt</code> mount will be inserted into the mount namespace that the caller is located in at the time of calling <code>move_mount()</code>.
(The kernel has very tight semantics here. For example, it will enforce that the caller has <code>CAP_SYS_ADMIN</code> in the owning user namespace of its mount namespace.
It will also enforce that the mount the <code>/mnt</code> dentry is located on belongs to the same mount namespace as the caller.)</p>

<p>After <code>move_mount()</code> returns the mount is permanently attached.
Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to.
In other words, <code>move_mount()</code> is an irreversible operation.</p>

<p>The main point is that before <code>move_mount()</code> is called a detached mount doesn&#39;t belong to any mount namespace and can thus be freely moved around.</p>

<h2 id="mounting-a-new-filesystem-into-a-mount-namespace">Mounting a new filesystem into a mount namespace</h2>

<p>To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:</p>

<pre><code class="language-c">fd_fs = fsopen(&#34;xfs&#34;);
fsconfig(fd_fs, FSCONFIG_SET_STRING, &#34;source&#34;, &#34;/dev/sda&#34;, 0);
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fd_mnt = fsmount(fd_fs, 0, 0);
</code></pre>

<p>For filesystems that require host privileges such as <code>xfs</code>, <code>ext4</code>, or <code>btrfs</code> (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges.
However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace.
So we can simply attach to the user namespace and mount namespace of the container:</p>

<pre><code class="language-c">setns(fd_userns);
setns(fd_mntns);
</code></pre>

<p>and then use</p>

<pre><code class="language-c">move_mount(fd_mnt, &#34;&#34;, -EBADF, &#34;/mnt&#34;, MOVE_MOUNT_F_EMPTY_PATH);
</code></pre>

<p>to attach the detached mount anywhere we like in the container.</p>

<h2 id="mounting-a-new-bind-mount-into-a-mount-namespace">Mounting a new bind mount into a mount namespace</h2>

<p>A bind mount is even simpler.
If we want to share a specific host directory with the container we can just have the container manager call:</p>

<pre><code class="language-c">fd_mnt = open_tree(-EBADF, &#34;/opt&#34;, OPEN_TREE_CLOEXEC | OPEN_TREE_CLONE);
</code></pre>

<p>to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:</p>

<pre><code class="language-c">setns(fd_userns);
setns(fd_mntns);
</code></pre>

<p>and as above we are free to attach the detached mount anywhere we like in the container.</p>

<h1 id="conclusion">Conclusion</h1>

<p>This is really it and as simple as it sounds.
It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers.
We&#39;ve making heavy use of this <code>LXD</code> and it is general the proper way to insert mounts into mount namespaces on newer kernels.</p>
]]></content:encoded>
      <author>Christian Brauner</author>
      <guid>https://people.kernel.org/read/a/usy2yo2ogp</guid>
      <pubDate>Wed, 01 Mar 2023 21:36:16 +0000</pubDate>
    </item>
    <item>
      <title>Fix your mutt</title>
      <link>https://people.kernel.org/monsieuricon/fix-your-mutt</link>
      <description>&lt;![CDATA[At some point in the recent past, mutt changed the way it generates Message-ID header values. Instead of the perfectly good old way of doing it, the developers switched to using base64-encoded random bytes. The base64 dictionary contains the / character, which causes unnecessary difficulties when linking to these messages on lore.kernel.org, since the / character needs to be escaped as %2F for everything to work properly.&#xA;&#xA;Mutt developers seem completely uninterested in changing this, so please save everyone a lot of trouble and do the following if you&#39;re using mutt for your kernel development needs (should work for all mutt versions):&#xA;&#xA;Create a ~/.mutt-hook-fix-msgid file with the following contents (change &#34;mylaptop.local&#34; to whatever you like):&#xA;&#xA;        myhdr Message-ID: uuidgen -r@mylaptop.local&#xA;    &#xA;Add the following to your ~/.muttrc:&#xA;&#xA;        send-hook . &#34;source ~/.mutt-hook-fix-msgid&#34;&#xA;    &#xA;UPDATE: if you have mutt 2.1 or later you can alternatively set the $messageidformat variable to restore the pre-mutt-2.0 behaviour:&#xA;&#xA;mutt-2.1+ only&#xA;set messageid_format = &#34;%Y%02m%02d%02H%02M%02S.G%c%p@%f&#34;&#xA;&#xA;Thanks to Thomas Weißschuh for the suggestion!&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>At some point in the recent past, mutt changed the way it generates <code>Message-ID</code> header values. Instead of the perfectly good old way of doing it, the developers switched to using base64-encoded random bytes. The base64 dictionary contains the <code>/</code> character, which causes unnecessary difficulties when linking to these messages on lore.kernel.org, since the <code>/</code> character needs to be escaped as <code>%2F</code> for everything to work properly.</p>

<p>Mutt developers seem completely <a href="https://gitlab.com/muttmua/mutt/-/issues/400" rel="nofollow">uninterested</a> in changing this, so please save everyone a lot of trouble and do the following if you&#39;re using mutt for your kernel development needs (should work for all mutt versions):</p>
<ol><li><p>Create a <code>~/.mutt-hook-fix-msgid</code> file with the following contents (change “mylaptop.local” to whatever you like):</p>

<pre><code>my_hdr Message-ID: &lt;`uuidgen -r`@mylaptop.local&gt;
</code></pre></li>

<li><p>Add the following to your <code>~/.muttrc</code>:</p>

<pre><code>send-hook . &#34;source ~/.mutt-hook-fix-msgid&#34;
</code></pre></li></ol>

<p>UPDATE: if you have mutt 2.1 or later you can alternatively set the <a href="http://www.mutt.org/doc/manual/#message-id-format" rel="nofollow"><code>$message_id_format</code></a> variable to restore the pre-mutt-2.0 behaviour:</p>

<pre><code># mutt-2.1+ only
set message_id_format = &#34;&lt;%Y%02m%02d%02H%02M%02S.G%c%p@%f&gt;&#34;
</code></pre>

<p>Thanks to <a href="https://public-inbox.org/meta/20230218175857.b6g7aqx7inhbduxz@t-8ch.de/" rel="nofollow">Thomas Weißschuh</a> for the suggestion!</p>
]]></content:encoded>
      <author>Konstantin Ryabitsev</author>
      <guid>https://people.kernel.org/read/a/2wwu4ql4y5</guid>
      <pubDate>Wed, 01 Mar 2023 17:40:26 +0000</pubDate>
    </item>
    <item>
      <title>NIC memory reserve</title>
      <link>https://people.kernel.org/kuba/nic-memory-reserve</link>
      <description>&lt;![CDATA[NIC drivers pre-allocate memory for received packets. Once the packets arrive NIC can DMA them into the buffers, potentially hundreds of them, before host processing kicks in.&#xA;&#xA;For efficiency reasons each packet-processing CPU (in extreme cases every CPU on the system) will have its own set of packet queues, including its own set of pre-allocated buffers.&#xA;&#xA;The amount of memory pre-allocated for Rx is a product of:&#xA;&#xA;buffer size&#xA;number of queues&#xA;queue depth&#xA;&#xA;A reasonable example in data centers would be:&#xA;8k  32 queues  4k entries = 1GB&#xA;&#xA;Buffer size is roughly dictated by the MTU of the network, for a modern datacenter network 8k (2 pages) is likely a right ballpark figure. Number of queues depends on the number of cores on the system and the request-per-second rate of the workload. 32 queues is a reasonable choice for the example (either 100+ threads or a network-heavy workload).&#xA;&#xA;Last but not least - the queue depth. Because networking is bursty, and NAPI processing is at the whim of the scheduler (the latter is more of an issue in practice) the queue depths of 4k or 8k entries are not uncommon.&#xA;&#xA;Can we do better?&#xA;&#xA;Memory is not cheap, having 1GB of memory sitting around unused 99% of the time has a real cost. If we were to advise a NIC design (or had access to highly flexible devices like the Netronome/Corigine NICs) we could use the following scheme to save memory:&#xA;&#xA;Normal processing rarely requires queue depth of more than 512 entries. We could therefore have smaller dedicated queues, and a larger &#34;reserve&#34; - a queue from which every Rx queue can draw, but which requires additional synchronization on the host side. To achieve the equivalent of 4k entries we&#39;d only need:&#xA;8k  32 queues  512 entries + 8k  1 reserve  4k entries = 160MB&#xA;The NIC would try to use the 512 entries dedicated to each queue first, but if they run out (due to a packet burst or a scheduling delay) it could use the entries from the reserve. Bursts and latency spikes are rarely synchronized across the queues.&#xA;&#xA;Can we do worse?&#xA;&#xA;In practice memory savings are rarely top-of-mind for NIC vendors. Multiple drivers in Linux allocate a set of rings for each thread of the CPU. I can only guess that this is to make sure iperf tests run without a hitch...&#xA;&#xA;As we wait for vendors to improve their devices - double check the queue count and queue size you use are justified (ethtool -g / ethtool -l).]]&gt;</description>
      <content:encoded><![CDATA[<p>NIC drivers pre-allocate memory for received packets. Once the packets arrive NIC can DMA them into the buffers, potentially hundreds of them, before host processing kicks in.</p>

<p>For efficiency reasons each packet-processing CPU (in extreme cases every CPU on the system) will have its own set of packet queues, including its own set of pre-allocated buffers.</p>

<p>The amount of memory pre-allocated for Rx is a product of:</p>
<ul><li>buffer size</li>
<li>number of queues</li>
<li>queue depth</li></ul>

<p>A reasonable example in data centers would be:</p>

<pre><code>8k * 32 queues * 4k entries = 1GB
</code></pre>

<p>Buffer size is roughly dictated by the MTU of the network, for a modern datacenter network 8k (2 pages) is likely a right ballpark figure. Number of queues depends on the number of cores on the system and the request-per-second rate of the workload. 32 queues is a reasonable choice for the example (either 100+ threads or a network-heavy workload).</p>

<p>Last but not least – the queue depth. Because networking is bursty, and NAPI processing is at the whim of the scheduler (the latter is more of an issue in practice) the queue depths of 4k or 8k entries are not uncommon.</p>

<h3 id="can-we-do-better">Can we do better?</h3>

<p>Memory is not cheap, having 1GB of memory sitting around unused 99% of the time has a real cost. If we were to advise a NIC design (or had access to highly flexible devices like the Netronome/Corigine NICs) we could use the following scheme to save memory:</p>

<p>Normal processing rarely requires queue depth of more than 512 entries. We could therefore have smaller dedicated queues, and a larger “reserve” – a queue from which every Rx queue can draw, but which requires additional synchronization on the host side. To achieve the equivalent of 4k entries we&#39;d only need:</p>

<pre><code>8k * 32 queues * 512 entries + 8k * 1 reserve * 4k entries = 160MB
</code></pre>

<p>The NIC would try to use the 512 entries dedicated to each queue first, but if they run out (due to a packet burst or a scheduling delay) it could use the entries from the reserve. Bursts and latency spikes are rarely synchronized across the queues.</p>

<h3 id="can-we-do-worse">Can we do worse?</h3>

<p>In practice memory savings are rarely top-of-mind for NIC vendors. Multiple drivers in Linux allocate a set of rings for each <em>thread</em> of the CPU. I can only guess that this is to make sure iperf tests run without a hitch...</p>

<p>As we wait for vendors to improve their devices – double check the queue count and queue size you use are justified (<code>ethtool -g</code> / <code>ethtool -l</code>).</p>
]]></content:encoded>
      <author>Jakub Kicinski</author>
      <guid>https://people.kernel.org/read/a/kyc6cv5cfu</guid>
      <pubDate>Wed, 01 Mar 2023 00:00:30 +0000</pubDate>
    </item>
    <item>
      <title>Bounded Flexible Arrays in C</title>
      <link>https://people.kernel.org/kees/bounded-flexible-arrays-in-c</link>
      <description>&lt;![CDATA[How to modernize C arrays for greater memory safety: a case-study in refactoring the Linux kernel and a look to the future&#xA;Kees Cook&#xA;&#xA;C is not just a fancy assembler any more&#xA;&#xA;Large projects written in C, especially those written close to the hardware layer like Linux, have long treated the language as a high-level assembler. Using C allowed for abstracting away much of the difficulty of writing directly in machine code while still providing easy low-level access to memory, registers, and CPU features. However, C has matured over the last half century, and many language features that improve robustness go unused in older codebases. This is especially true for arrays, where the historical lack of bounds checking has been a consistent source of security flaws.&#xA;&#xA;Converting such codebases to use &#34;modern&#34; language features, like those in C99 (still from the prior millennium), can be a major challenge, but it is an entirely tractable problem. This post is a deep dive into an effort underway in the Linux kernel to make array index overflows (and more generally, buffer overflows) a thing of the past, where they belong. Our success hinges on replacing anachronistic array definitions with well-defined C99 flexible arrays. This approach can be used by developers to refactor C code, making it possible to leverage 21st century mitigations (like -fsanitize=bounds and FORTIFYSOURCE), since such things can finally be cleanly applied to the modernized codebase.&#xA;&#xA;The fraught history of arrays in C&#xA;&#xA;For the compiler to successfully apply array index bounds checking, array sizes must be defined unambiguously, which is not always easy in C. Depending on the array definition, bounds checking falls roughly into three categories: fixed-sized arrays, dynamically-sized arrays, and pointer offsets. Each category of array definitions must be made unambiguous before the next, as they mostly build on top of each other.  For example, if the compiler cannot protect a fixed-sized array, it certainly cannot protect a dynamically-sized array, and array indexing is just a specialized case of calculating a memory pointer offset.&#xA;&#xA;Properly defined dynamically-sized arrays were introduced in C99 (int foo[]), and called &#34;flexible arrays&#34;. Before that, many C projects used the GNU extension of zero-length arrays (int foo[0]), which is not recognized by the C standard. This was done because, before the GNU extension, C projects would use single-element arrays (int foo[1]) which had several frustrating characteristics. (Using sizeof() on such a structure would include a single element as well, which would require additional handling to get allocation sizes to be accurate. This is not a problem for zero-element or true flexible arrays.)&#xA;&#xA;However, due to yet more historical situations (e.g. struct sockaddr, which has a fixed-size trailing array that is not supposed to actually be treated as fixed-size), GCC and Clang actually treat all trailing arrays as flexible arrays. This behavior makes things even more problematic, since it becomes impossible to limit a flexible array heuristic to only 1-element or 0-element (i.e. zero-length) arrays. For example, a compiler can&#39;t tell the intent of variable&#39;s use here:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        unsigned char bytes;&#xA;        int variable[4];&#xA;};&#xA;Is it actually a 4 element array, or is it sized by the bytes member? As such, compilers have had to assume that trailing arrays must be intended to be dynamically sized (even though most are intended to be fixed-size).&#xA;&#xA;To clear the way for sensible protection of fixed-size arrays, and to have a common framework for handling dynamically-sized arrays, Linux must have all the &#34;fake&#34; flexible array members replaced with actual C99 flexible array members so that the programmer&#39;s intent can actually be represented in an unambiguous way. With this done, -Warray-bounds (and similar things like _builtinobjectsize()) will catch compile-time problems, and -fsanitize=bounds (and similar things like _builtindynamicobjectsize()) can catch run-time problems.&#xA;&#xA;Once fixed-sized arrays are protected, dynamically sized arrays can be protected as well, though this requires introducing a way to annotate structures that contain flexible arrays. Nearly all such structs also contain the count of allocated elements present in the flexible array:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        unsigned short count;&#xA;        struct foo items[]; / Has &#34;count&#34; many &#34;struct foo&#34;s /&#xA;} ptr;&#xA;Such structs therefore fully describe their contents at runtime (and are called &#34;flexible array structures&#34; from here on). In other words, their size can be determined at run-time as:&#xA;&#xA;sizeof(ptr) + sizeof(ptr-  items)  ptr-  count&#xA;Teaching the compiler which struct member is associated with the count of a given flexible array member will allow -fsanitize=bounds and builtindynamicobjectsize() to reason about flexible array structure usage as well, covering all arrays in Linux with &#34;known bounds&#34;.&#xA;&#xA;(Not covered here is the closely related work to tighten the FORTIFYSOURCE implementation for the memcpy()-family of functions which also depends on making flexible array sizes unambiguous.)&#xA;&#xA;Replacing &#34;fake&#34; flexible arrays&#xA;&#xA;Compile-time diagnostics about the size of arrays use either internal value range checking or things similar to the FORTIFYSOURCE macros (which use _builtinobjectsize() for their implementations). This works well for arrays not at the end of the structure, but gets disabled for trailing arrays since the compiler must treat trailing arrays as flexible arrays (see struct sockaddr above). And for everything treated as a flexible array (i.e. dynamically sized), the compiler cannot know the array length at compile time, since it will be only known at runtime. To make such array declarations unambiguous (and therefore able to gain sane runtime bounds checking), compilers must gain an option to disable all &#34;fake&#34; flexible array heuristics, and treat only true flexible arrays as flexible arrays.&#xA;&#xA;The creation of -fstrict-flex-arrays is now available in recent GCC and Clang builds, but any project using it will need to replace all fake flexible arrays with true flexible arrays first (to separate them from any fixed-size trailing arrays). This comes with several challenges.&#xA;&#xA;Replace 0-length arrays&#xA;&#xA;Most replacement of 0-length arrays with flexible arrays requires no special handling. Simply removing the &#34;0&#34; in the array declaration is sufficient. For example,&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int flex[0];&#xA;};&#xA;becomes:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int flex[];&#xA;&#xA;};&#xA;However, there are a few things of note that can go wrong with these conversions:&#xA;&#xA;Changes to sizeof()&#xA;&#xA;While sizeof(instance-  flex) for a 0-length array returns 0, it becomes a compile-time failure once it becomes a true flexible array. This usually manifests within other complex macros that are examining the details of a given struct, and are usually hidden bugs that switching to a flexible array helps expose.&#xA;&#xA;Pass by value&#xA;&#xA;Converting to a true flexible array will expose any strange cases of trying to pass a flexible array struct by value. These are almost always a bug, so it&#39;s another case where a problem is exposed by cleaning up fake flexible arrays. For example:&#xA;&#xA;net/core/flowdissector.c: In function &#39;ispppoeseshdrvalid&#39;:&#xA;net/core/flowdissector.c:898:13: note: the ABI of passing struct with a flexible array member has changed in GCC 4.4&#xA;&#xA;898 | static bool ispppoeseshdrvalid(struct pppoehdr hdr)&#xA;    |                                   ^~&#xA;&#xA;Flexible arrays in unions&#xA;&#xA;C99 6.7.2.1 &#34;Structure and union specifiers&#34; #16 declares true flexible arrays may not be in unions nor otherwise empty structures: &#34;As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member.&#34;&#xA;&#xA;However, this situation is allowed by the GNU &#34;trailing array&#34; extension, where such arrays are treated as flexible arrays. More importantly, flexible arrays (via the GNU extension) are used in unions in many places throughout Linux code. The C99 treatment of true flexible arrays appears to be only a definitional limitation (and likely just an oversight) since the restriction can be worked around with creative use of anonymous structs. For example, this will build:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        union {&#xA;                struct foo name1[0];&#xA;                struct bar name2[0];&#xA;        };&#xA;};&#xA;but this will not:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        union {&#xA;                struct foo name1[];&#xA;                struct bar name2[];&#xA;        };&#xA;};&#xA;source:5:22: error: flexible array member in union&#xA;  5 | struct foo name1[];&#xA;    |            ^&#xA;But in both cases, the compiler treats name1 and name2 as flexible arrays. What will happily compile, though, is wrapping true flexible arrays in a struct that has at least 1 other non-true-flexible array, including an empty anonymous struct (i.e. taking up no size):&#xA;&#xA;struct obj {&#xA;        ...&#xA;        union {&#xA;                struct {&#xA;                        struct { } unusedmember1;&#xA;                        struct foo name1[];&#xA;                };&#xA;                struct {&#xA;                        struct { } _unusedmember2;&#xA;                        struct bar name2[];&#xA;                };&#xA;        };&#xA;};&#xA;Thankfully, this was wrapped in Linux with the DECLAREFLEXARRAY() macro:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        union {&#xA;                DECLAREFLEXARRAY(struct foo, name1);&#xA;                DECLAREFLEXARRAY(struct bar, name2);&#xA;        };&#xA;};&#xA;which makes this much more readable. I hope to see future C standards eliminate this restriction.&#xA;&#xA;Overlapping composite structure members&#xA;&#xA;This is another case of a real bug being exposed by true flexible array usage, as it is possible to create an implicit union of a flexible array and something else by including a flexible array structure in another struct. For example:&#xA;&#xA;struct inner {&#xA;        ...&#xA;        int flex[0];&#xA;};&#xA;&#xA;struct outer {&#xA;        ...&#xA;        struct inner header;&#xA;        int overlap;&#xA;        ...&#xA;} instance;&#xA;Here, instance-  overlap and instance-  header.flex[0] share the same memory location. Whether or not this is intentional cannot be understood by the compiler. If it is a bug, then using a true flexible array will trigger a warning. If it&#39;s not a bug, rearranging the structures to use an actual union is needed (see above).&#xA;&#xA;struct definition parsed by something other than a C compiler&#xA;&#xA;If the converted struct is part of a source file that is parsed by something that is not a C compiler, it may not be prepared to handle empty square braces on arrays. For example, SWIG broke when the Linux Userspace API headers got converted. This is a known issue in SWIG, and can be worked around in various ways.&#xA;&#xA;Replace 1-element arrays&#xA;&#xA;Most 1-element array conversions are similar to 0-length array conversions, but with the effect that the surrounding structure&#39;s sizeof() changes. This leads to a few additional significant issues:&#xA;&#xA;Size calculations&#xA;&#xA;If a struct is used entirely internally to Linux, it is generally sufficient to make changes to both the struct and all size calculations, which will result in identical binary output. For example:&#xA;&#xA;struct object {&#xA;        ...&#xA;        int flex[1];&#xA;} p;&#xA;&#xA;p = kmalloc(sizeof(p) + sizeof(p-  flex[0])  (count - 1)),&#xA;            GFPKERNEL);&#xA;the above count - 1 becomes just count now:&#xA;&#xA;struct object {&#xA;        ...&#xA;        int flex[];&#xA;} p;&#xA;&#xA;p = kmalloc(sizeof(p) + sizeof(p-  flex[0])  count),&#xA;            GFPKERNEL);&#xA;If all size calculations are correctly adjusted, there should be no differences in the resulting allocation size, etc. If a discrepancy is found, it is going to be either a bug introduced by the conversion, or the discovery of an existing bug in the original size calculations.&#xA;&#xA;Note that depending on the sizes of the structure, its flexible array element, and count, there is also the risk associated with arithmetic overflow. Linux uses the structsize() macro to perform these calculations so that the result saturates to at most SIZEMAX, which will cause an allocation failure rather than wrapping around. So the best way to perform this allocation would be:&#xA;&#xA;p = kmalloc(structsize(p, flex, count), GFPKERNEL);&#xA;Padding and interface sizes&#xA;&#xA;When a structure definition is also used by a codebase we don&#39;t control (e.g. firmware, userspace, virtualization), changing its layout or sizeof() may break such code. Specifically, it may break its ability to communicate correctly with the kernel across the shared interface. Such structures cannot suddenly lose the single element of its trailing array. In these cases, a new member needs to be used for kernel code, explicitly keeping the original member for backward compatibility. For example:&#xA;&#xA;struct object {&#xA;        ...&#xA;        int flex[1];&#xA;};&#xA;becomes:&#xA;&#xA;struct object {&#xA;        ...&#xA;        union {&#xA;                int flex[1];&#xA;                DECLAREFLEXARRAY(int, data);&#xA;        };&#xA;};&#xA;Now the kernel will only use the newly named data member (and gain any potential bounds checking protections from the compiler), and external code that shares this structure definition can continue to use the flex member, all without changing the size of the structure.&#xA;&#xA;This has the downside of needing to change the member name throughout Linux. However, if the other side of the interface doesn&#39;t actually use the original member, we can avoid this. We can convert the member to a flexible array and add explicit padding instead. This would mean no collateral changes with the member name in Linux are needed:&#xA;&#xA;struct object {&#xA;        ...&#xA;        union {&#xA;                int _padding;&#xA;                DECLAREFLEXARRAY(int, flex);&#xA;        };&#xA;};&#xA;&#xA;Replace multi-element arrays&#xA;&#xA;In the cases of trailing arrays with larger element counts, the usage needs to be even more carefully studied. Most problems end up looking very similar to 1-element interface conversions above. For example, if there is some hardware interface that returns at least 4 bytes for an otherwise dynamically sized array, the conversion would start from here:&#xA;&#xA;struct object {&#xA;        ...&#xA;        unsigned char data[4];&#xA;};&#xA;which becomes:&#xA;&#xA;struct object {&#xA;        ...&#xA;        union {&#xA;                unsigned char padding[4];&#xA;                DECLAREFLEXARRAY(unsigned char, data);&#xA;        };&#xA;};&#xA;&#xA;Enable -Warray-bounds&#xA;&#xA;With all fixed-size array bounds able to be determined at build time, -Warray-bounds can actually perform the checking, keeping provably bad code out of Linux. (This option is already part of -Wall, which Linux isn&#39;t quite able to use itself yet, but is strongly recommended for other C projects.) As a reminder, optimization level will impact this option. The kernel is built with -O2, which is likely the right choice for most C projects. &#xA;&#xA;Enable -Wzero-length-array&#xA;&#xA;If all zero length arrays have been removed from the code, future uses can be kept out of the code by using -Wzero-length-array.  This option is currently only available in Clang, and will warn when finding the definition of such structure members, rather than warning when they are accessed in code. Because of this, it is unlikely to ever be enabled in Linux since some array sizes are constructed from build configurations, and may drop to 0 when they are unused (i.e. they were never used as flexible arrays). As such, it is sufficient to use -fstrict-flex-arrays (see below) and -Warray-bounds.&#xA;&#xA;Enable -fstrict-flex-arrays&#xA;&#xA;Once all the fake flexible arrays have been converted to true flexible arrays, the remaining fixed-sized trailing arrays can start being treated as actually fixed-size by enabling -fstrict-flex-arrays. Future attempts to add fake flexible arrays to the code will then elicit warnings as part of the existing diagnostics from -Warray-bounds, since all fake flexible arrays are now treated as fixed-size arrays. (Note that this option sees the subset of 0-length arrays caught by -Wzero-length-array when they are actually used in the code, so -Wzero-length-array may be redundant.) &#xA;&#xA;Coming soon: annotate bounds of flexible arrays&#xA;&#xA;With flexible arrays now a first-class citizen in Linux and the compilers, it becomes possible to extend their available diagnostics.  What the compiler is missing is knowledge of how the length of a given flexible array is tracked. For well-described flexible array structs, this means associating the member holding the element count with the flexible array member. This idea is not new, though prior implementation proposals have wanted to make changes to the C language syntax. A simpler approach is the addition of struct member attributes, and is under discussion and early development by both the GCC and Clang developer communities.&#xA;&#xA;Add attribute((_countedby(member)))&#xA;&#xA;In order to annotate flexible arrays, a new attribute could be used to describe the relationship between struct members. For example:&#xA;&#xA;struct object {&#xA;        ...&#xA;        signed char items;&#xA;        ...&#xA;        int flex[];&#xA;} p;&#xA;becomes:&#xA;&#xA;struct object {&#xA;        ...&#xA;        signed char items;&#xA;        ...&#xA;        int flex[] attribute((countedby(items)));&#xA;} p;&#xA;&#xA;This would allow -fsanitize=bounds to check for out-of-bounds accesses.  For example, given the above annotation, each of the marked access into p-  flex should trap:&#xA;&#xA;sum += p-  flex[-1];  // trap all negative indexes&#xA;sum += p-  flex[128]; // trap when index larger than bounds type&#xA;sum += p-  flex[0];   // trap when p-  items &lt;= 0&#xA;sum += p-  flex[5];   // trap when p-  items &lt;= 5&#xA;sum += p-  flex[idx]; // trap when p-  items &lt;= idx || idx &lt; 0&#xA;The type associated with the bounds check (signed char in the example above) should perhaps be required to be an unsigned type, but Linux has so many counters implemented as int that it becomes an additional refactoring burden to change these to unsigned, especially since sometimes they are sneakily being used with negative values in some other part of the code. Better to leave them as-is (though perhaps emit a warning), and just add a negativity check at access time. Switching the counter to unsigned then potentially becomes a small performance improvement.&#xA;&#xA;Similar to -fsanitize=bounds above, builtindynamicobjectsize() will perform the expected calculations with the items member as the basis for the resulting size (and where values less than 0 are considered to be 0 to avoid pathological calculations):&#xA;&#xA;p-  items = 5;&#xA;assert(_builtindynamicobjectsize(p, 1) ==&#xA;        sizeof(p) + 5  sizeof(p-  flex));&#xA;assert(_builtindynamicobjectsize(p-  flex, 1) ==&#xA;        5  sizeof(p-  flex));&#xA;assert(_builtindynamicobjectsize(&amp;p-  flex[0], 1) ==&#xA;        sizeof(p-  flex));&#xA;assert(_builtindynamicobjectsize(&amp;p-  flex[2], 0) ==&#xA;        3  sizeof(p-  flex));&#xA;&#xA;p-  items = -10;&#xA;assert(_builtindynamicobjectsize(p, 0) == sizeof(p));&#xA;assert(_builtindynamicobjectsize(p, 1) == sizeof(p));&#xA;assert(_builtindynamicobjectsize(p-  flex, 1) == 0);&#xA;assert(_builtindynamicobjectsize(&amp;p-  flex[2], 1) == 0);&#xA;Additional attributes may be needed if structures explicitly use byte counts rather than element counts.&#xA;&#xA;Scope considerations&#xA;&#xA;Composite structures need to be able to define _countedby across struct boundaries:&#xA;&#xA;struct object {&#xA;        ...&#xA;        char items;&#xA;        ...&#xA;        struct inner {&#xA;                ...&#xA;                int flex[] attribute((countedby(.items)));&#xA;        };&#xA;} ptr;&#xA;This may mean passing &amp;ptr-  inner to a function will lose the bounds knowledge, but it may be possible to automatically include a bounds argument as an invisible function argument, as any function able to understand the layout of struct inner must by definition have visibility into the definition of struct object. For example, with this:&#xA;&#xA;struct object instance;&#xA;...&#xA;func(&amp;instance.inner);&#xA;...&#xA;void func(struct inner ptr) {&#xA;        ...&#xA;        ptr-  flex[foo]; / &#34;items&#34; is not scope /&#xA;        ...&#xA;}&#xA;The prototype could either be rejected due to lack of available scope, or could be automatically converted into passing the outer object pointer with an injected scope:&#xA;&#xA;void func(struct object ptr) {&#xA;        struct inner ptr = &amp;ptr-  inner;&#xA;        ...&#xA;        ptr-  flex[foo]; / ptr-  items is in scope /&#xA;        ...&#xA;}&#xA;&#xA;Annotate kernel flexible array structs&#xA;&#xA;With the compiler attribute available, all of Linux&#39;s flexible arrays can be updated to include the annotation, and CONFIGFORTIFYSOURCE can be expanded to use builtindynamicobjectsize().&#xA;&#xA;Replace DECLAREFLEXARRAY with DECLAREBOUNDEDARRAY&#xA;&#xA;Most uses of DECLAREFLEXARRAY() can be replaced with DECLAREBOUNDEDARRAY(), explicitly naming the expected flex array bounds member. For example, if we had:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int items;&#xA;        ...&#xA;        union {&#xA;                DECLAREFLEXARRAY(struct foo, name1);&#xA;                DECLAREFLEXARRAY(struct bar, name2);&#xA;        };&#xA;};&#xA;it would become:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int items;&#xA;        ...&#xA;        union {&#xA;                DECLAREBOUNDEDARRAY(struct foo, name1, items);&#xA;                DECLAREBOUNDEDARRAY(struct bar, name2, items);&#xA;        };&#xA;};&#xA;&#xA;Add manual annotations&#xA;&#xA;Any flexible array structures not already using DECLAREBOUNDEDARRAY() can be annotated manually with the new attribute. For example, assuming the proposed attribute((_countedby(member))) is wrapped in a macro named countedby():&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int items;&#xA;        ...&#xA;        int flex[];&#xA;};&#xA;becomes:&#xA;&#xA;struct obj {&#xA;        ...&#xA;        int items;&#xA;        ...&#xA;        int flex[] countedby(items);&#xA;};&#xA;&#xA;Future work: expand attribute beyond arrays&#xA;&#xA;It will also be possible to use the new attribute on pointers and function arguments as well as flexible arrays. All the same details are available, though there would be the obvious differences for enclosing structure sizes, as the pointers are aimed (usually) outside the struct itself. Regardless, having it be possible to check offsets and inform _builtindynamicobjectsize() would allow for several more places where runtime checking could be possible. For example, given this:&#xA;&#xA;struct object {&#xA;        ...&#xA;        unsigned char items;&#xA;        ...&#xA;        int data attribute((_countedby(items)));&#xA;        ...&#xA;} p;&#xA;It should be possible to detect sizing information:&#xA;&#xA;p-  items = 5;&#xA;assert(builtindynamicobjectsize(p-  data, 1) ==&#xA;        5  sizeof(p-  data));&#xA;assert(builtindynamicobjectsize(p-  data, 1) ==&#xA;        sizeof(p-  data));&#xA;assert(_builtindynamicobjectsize(p-  data, 0) ==&#xA;        5  sizeof(p-  data));&#xA;And it should be possible to trap on the following bad accesses:&#xA;&#xA;int ptr = p-  data;&#xA;sum += ptr[-1];  // trap all negative indexes&#xA;sum += ptr[500]; // trap when index larger than bounds type&#xA;sum += ptr[0];   // trap when p-  items &lt;= 0&#xA;sum += ptr[5];   // trap when p-  items &lt;= 5&#xA;ptr += 5;        // don&#39;t trap yet: allow ptr++ in a for loop&#xA;sum += ptr;     // trap when p-  items &lt;= 5&#xA;A safer code base&#xA;&#xA;A C codebase that has refactored all of its arrays into proper flexible arrays can now finally build by using:&#xA;&#xA;        -Warray-bounds&#xA;        -fstrict-flex-arrays&#xA;        -fsanitize=bounds&#xA;        -fsanitize-undefined-trap-on-error&#xA;        -DFORTIFYSOURCE=3&#xA;With this, the burdens of C array index bounds checking will have been shifted to the toolchain, and array index overflow flaw exploitation can be a thing of the past, reducing severity to a simple denial of service (assuming the traps aren&#39;t handled gracefully). For the next trick, new code can be written in a language that is memory safe to start with (e.g. Rust).&#xA;&#xA;Acknowledgements&#xA;&#xA;Thanks to many people who gave me feedback on this post: Nick Desaulniers, Gustavo A. R. Silva, Bill Wendling, Qing Zhao, Kara Olive, Chris Palmer, Steven Rostedt, Allen Webb, Julien Voisin, Guenter Roeck, Evan Benn, Seth Jenkins, Alexander Potapenko, Ricardo Ribalda, and Kevin Chowski.&#xA;&#xA;Discussion&#xA;Please join this thread with your thoughts, comments, and corrections. :)&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h3 id="how-to-modernize-c-arrays-for-greater-memory-safety-a-case-study-in-refactoring-the-linux-kernel-and-a-look-to-the-future">How to modernize C arrays for greater memory safety: a case-study in refactoring the Linux kernel and a look to the future</h3>

<h2 id="kees-cook-mailto-kees-kernel-org"><a href="mailto:kees@kernel.org" rel="nofollow">Kees Cook</a></h2>

<h2 id="c-is-not-just-a-fancy-assembler-any-more">C is not just a fancy assembler any more</h2>

<p>Large projects written in C, especially those written close to the hardware layer like Linux, have long treated the language as a high-level assembler. Using C allowed for abstracting away much of the difficulty of writing directly in machine code while still providing easy low-level access to memory, registers, and CPU features. However, C has matured over the last <a href="https://en.wikipedia.org/wiki/C_%28programming_language%29#cite_note-dottcl_2-2" rel="nofollow">half century</a>, and many language features that improve robustness go unused in older codebases. This is especially true for arrays, where the historical <a href="https://cwe.mitre.org/data/definitions/129.html" rel="nofollow">lack of bounds checking</a> has been a consistent source of <a href="https://www.cvedetails.com/vulnerability-search.php?f=1&amp;product=Linux+Kernel&amp;cweid=129" rel="nofollow">security flaws</a>.</p>

<p>Converting such codebases to use “modern” language features, like those in <a href="https://en.wikipedia.org/wiki/Flexible_array_member" rel="nofollow">C99</a> (still from the prior millennium), can be a major challenge, but it is an entirely tractable problem. This post is a deep dive into an effort underway in the Linux kernel to make array index overflows (and more generally, buffer overflows) a thing of the <a href="https://git.kernel.org/linus/7f14c7227f342d9932f9b918893c8814f86d2a0d" rel="nofollow">past</a>, where they belong. Our success hinges on replacing anachronistic array definitions with well-defined C99 flexible arrays. This approach can be used by developers to refactor C code, making it possible to leverage <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Instrumentation-Options.html" rel="nofollow">21st century mitigations</a> (like <code>-fsanitize=bounds</code> and <code>FORTIFY_SOURCE</code>), since such things can finally be cleanly applied to the modernized codebase.</p>

<h2 id="the-fraught-history-of-arrays-in-c">The fraught history of arrays in C</h2>

<p>For the compiler to successfully apply array index bounds checking, array sizes must be defined unambiguously, which is not always easy in C. Depending on the array definition, bounds checking falls roughly into three categories: fixed-sized arrays, dynamically-sized arrays, and pointer offsets. Each category of array definitions must be made unambiguous before the next, as they mostly build on top of each other.  For example, if the compiler cannot protect a fixed-sized array, it certainly cannot protect a dynamically-sized array, and array indexing is just a specialized case of calculating a memory pointer offset.</p>

<p>Properly defined dynamically-sized arrays were introduced in C99 (<code>int foo[]</code>), and called “flexible arrays”. Before that, many C projects used the GNU extension of zero-length arrays (<code>int foo[0]</code>), which is not recognized by the C standard. This was done because, before the GNU extension, C projects would use single-element arrays (<code>int foo[1]</code>) which had several frustrating characteristics. (Using <code>sizeof()</code> on such a structure would include a single element as well, which would require additional handling to get allocation sizes to be accurate. This is not a problem for zero-element or true flexible arrays.)</p>

<p>However, due to yet more historical situations (e.g. <a href="https://www.gnu.org/software/libc/manual/html_node/Address-Formats.html#index-struct-sockaddr" rel="nofollow">struct sockaddr</a>, which has a fixed-size trailing array that is <em>not</em> supposed to actually be treated as fixed-size), GCC and Clang actually treat <em>all</em> trailing arrays as flexible arrays. This behavior makes things even more problematic, since it becomes impossible to limit a flexible array heuristic to only 1-element or 0-element (i.e. zero-length) arrays. For example, a compiler can&#39;t tell the intent of variable&#39;s use here:</p>

<pre><code class="language-c">struct obj {
        ...
        unsigned char bytes;
        int variable[4];
};
</code></pre>

<p>Is it actually a 4 element array, or is it sized by the bytes member? As such, compilers have had to assume that trailing arrays must be intended to be dynamically sized (even though most are intended to be fixed-size).</p>

<p>To clear the way for sensible protection of fixed-size arrays, and to have a common framework for handling dynamically-sized arrays, Linux must have all the “fake” flexible array members replaced with <em>actual</em> C99 flexible array members so that the programmer&#39;s intent can actually be represented in an unambiguous way. With this done, <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Warning-Options.html#index-Warray-bounds" rel="nofollow"><code>-Warray-bounds</code></a> (and similar things like <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Object-Size-Checking.html#index-_005f_005fbuiltin_005fobject_005fsize-1" rel="nofollow"><code>__builtin_object_size()</code></a>) will catch compile-time problems, and <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Instrumentation-Options.html#index-fsanitize_003dbounds" rel="nofollow"><code>-fsanitize=bounds</code></a> (and similar things like <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Object-Size-Checking.html#index-_005f_005fbuiltin_005fdynamic_005fobject_005fsize-1" rel="nofollow"><code>__builtin_dynamic_object_size()</code></a>) can catch run-time problems.</p>

<p>Once fixed-sized arrays are protected, dynamically sized arrays can be protected as well, though this requires introducing a way to annotate structures that contain flexible arrays. Nearly all such structs also contain the count of allocated elements present in the flexible array:</p>

<pre><code class="language-c">struct obj {
        ...
        unsigned short count;
        struct foo items[]; /* Has &#34;count&#34; many &#34;struct foo&#34;s */
} *ptr;
</code></pre>

<p>Such structs therefore fully describe their contents at runtime (and are called “flexible array structures” from here on). In other words, their size can be determined at run-time as:</p>

<pre><code class="language-c">sizeof(*ptr) + sizeof(*ptr-&gt;items) * ptr-&gt;count
</code></pre>

<p>Teaching the compiler which struct member is associated with the count of a given flexible array member will allow <code>-fsanitize=bounds</code> and <code>__builtin_dynamic_object_size()</code> to reason about flexible array structure usage as well, covering all arrays in Linux with “known bounds”.</p>

<p>(Not covered here is the closely related work to tighten the <code>FORTIFY_SOURCE</code> <a href="https://outflux.net/slides/2022/lss-na/" rel="nofollow">implementation</a> for the <code>memcpy()</code>-family of functions which also depends on making flexible array sizes unambiguous.)</p>

<h2 id="replacing-fake-flexible-arrays-https-github-com-kspp-linux-issues-21"><a href="https://github.com/KSPP/linux/issues/21" rel="nofollow">Replacing “fake” flexible arrays</a></h2>

<p>Compile-time diagnostics about the size of arrays use either internal value range checking or things similar to the <code>FORTIFY_SOURCE</code> macros (which use <code>__builtin_object_size()</code> for their implementations). This works well for arrays not at the end of the structure, but gets disabled for trailing arrays since the compiler must treat trailing arrays as flexible arrays (see struct sockaddr above). And for everything treated as a flexible array (i.e. dynamically sized), the compiler cannot know the array length at compile time, since it will be only known at runtime. To make such array declarations unambiguous (and therefore able to gain sane runtime bounds checking), compilers must gain an option to disable all “fake” flexible array heuristics, and treat only <em>true</em> flexible arrays as flexible arrays.</p>

<p>The creation of <code>-fstrict-flex-arrays</code> is now available in recent GCC and Clang builds, but any project using it will need to replace all fake flexible arrays with true flexible arrays first (to separate them from any fixed-size trailing arrays). This comes with several challenges.</p>

<h3 id="replace-0-length-arrays">Replace 0-length arrays</h3>

<p>Most replacement of 0-length arrays with flexible arrays requires no special handling. Simply removing the “0” in the array declaration is sufficient. For example,</p>

<pre><code class="language-c">struct obj {
        ...
        int flex[0];
};
</code></pre>

<p>becomes:</p>

<pre><code class="language-c">struct obj {
        ...
        int flex[];

};
</code></pre>

<p>However, there are a few things of note that can go wrong with these conversions:</p>

<h4 id="changes-to-sizeof">Changes to <code>sizeof()</code></h4>

<p>While <code>sizeof(instance-&gt;flex)</code> for a 0-length array returns 0, it becomes a compile-time failure once it becomes a true flexible array. This usually manifests within other complex macros that are examining the details of a given struct, and are usually hidden bugs that switching to a flexible array helps expose.</p>

<h4 id="pass-by-value">Pass by value</h4>

<p>Converting to a true flexible array will expose any strange cases of trying to pass a flexible array struct by value. These are almost always a bug, so it&#39;s another case where a problem is exposed by cleaning up fake flexible arrays. For <a href="https://lore.kernel.org/lkml/CAHk-=wiwRtpyMVn1F9KT14H64tajiWsPnd0FfL5-BFnPOuFa_w@mail.gmail.com/" rel="nofollow">example</a>:</p>

<pre><code>net/core/flow_dissector.c: In function &#39;is_pppoe_ses_hdr_valid&#39;:
net/core/flow_dissector.c:898:13: note: the ABI of passing struct with a flexible array member has changed in GCC 4.4

898 | static bool is_pppoe_ses_hdr_valid(struct pppoe_hdr hdr)
    |                                   ^~~~~~~~~~~~~~~~~~~~~~
</code></pre>

<h4 id="flexible-arrays-in-unions">Flexible arrays in unions</h4>

<p>C99 6.7.2.1 “Structure and union specifiers” #16 declares true flexible arrays may not be in unions nor otherwise empty structures: “As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member.”</p>

<p>However, this situation <em>is</em> allowed by the GNU “trailing array” extension, where such arrays are treated as flexible arrays. More importantly, flexible arrays (via the GNU extension) are used in unions in many places throughout Linux code. The C99 treatment of true flexible arrays appears to be only a definitional limitation (and likely just an oversight) since the restriction can be worked around with creative use of anonymous structs. For example, <a href="https://godbolt.org/z/T394GPndf" rel="nofollow">this</a> will build:</p>

<pre><code class="language-c">struct obj {
        ...
        union {
                struct foo name1[0];
                struct bar name2[0];
        };
};
</code></pre>

<p>but <a href="https://godbolt.org/z/Grx4Msbq1" rel="nofollow">this</a> will not:</p>

<pre><code class="language-c">struct obj {
        ...
        union {
                struct foo name1[];
                struct bar name2[];
        };
};
</code></pre>

<pre><code>&lt;source&gt;:5:22: error: flexible array member in union
  5 | struct foo name1[];
    |            ^~~~~
</code></pre>

<p>But in both cases, the compiler treats name1 and name2 as flexible arrays. What will <a href="https://godbolt.org/z/GeYjGsKjj" rel="nofollow">happily compile</a>, though, is wrapping true flexible arrays in a struct that has at least 1 other non-true-flexible array, including an empty anonymous struct (i.e. taking up no size):</p>

<pre><code class="language-c">struct obj {
        ...
        union {
                struct {
                        struct { } __unused_member1;
                        struct foo name1[];
                };
                struct {
                        struct { } __unused_member2;
                        struct bar name2[];
                };
        };
};
</code></pre>

<p>Thankfully, this was wrapped in Linux with the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/stddef.h?h=v6.1#n32" rel="nofollow"><code>DECLARE_FLEX_ARRAY()</code></a> macro:</p>

<pre><code class="language-c">struct obj {
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};
</code></pre>

<p>which makes this much more readable. I hope to see future C standards eliminate this restriction.</p>

<h4 id="overlapping-composite-structure-members">Overlapping composite structure members</h4>

<p>This is another case of a real bug being exposed by true flexible array usage, as it is possible to create an implicit union of a flexible array and something else by including a flexible array structure in another struct. For <a href="https://lore.kernel.org/lkml/202206281009.4332AA33@keescook/" rel="nofollow">example</a>:</p>

<pre><code class="language-c">struct inner {
        ...
        int flex[0];
};

struct outer {
        ...
        struct inner header;
        int overlap;
        ...
} *instance;
</code></pre>

<p>Here, <code>instance-&gt;overlap</code> and <code>instance-&gt;header.flex[0]</code> share the same memory location. Whether or not this is <em>intentional</em> cannot be understood by the compiler. If it is a bug, then using a true flexible array will trigger a warning. If it&#39;s not a bug, rearranging the structures to use an actual union is needed (see above).</p>

<h4 id="struct-definition-parsed-by-something-other-than-a-c-compiler">struct definition parsed by something other than a C compiler</h4>

<p>If the converted struct is part of a source file that is parsed by something that is not a C compiler, it may not be prepared to handle empty square braces on arrays. For example, <a href="https://www.spinics.net/lists/fedora-devel/msg297996.html" rel="nofollow">SWIG broke</a> when the Linux Userspace API headers got converted. This is a <a href="https://github.com/swig/swig/issues/1699" rel="nofollow">known issue</a> in SWIG, and can be <a href="https://github.com/dgibson/dtc/commit/abbd523bae6e75545ccff126a4a47218ec0defab" rel="nofollow">worked around</a> in various ways.</p>

<h3 id="replace-1-element-arrays">Replace 1-element arrays</h3>

<p>Most 1-element array conversions are similar to 0-length array conversions, but with the effect that the surrounding structure&#39;s <code>sizeof()</code> changes. This leads to a few additional significant issues:</p>

<h4 id="size-calculations">Size calculations</h4>

<p>If a struct is used entirely internally to Linux, it is generally sufficient to make changes to both the struct and all size calculations, which will result in identical binary output. For example:</p>

<pre><code class="language-c">struct object {
        ...
        int flex[1];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p-&gt;flex[0]) * (count - 1)),
            GFP_KERNEL);
</code></pre>

<p>the above <code>count - 1</code> becomes just <code>count</code> now:</p>

<pre><code class="language-c">struct object {
        ...
        int flex[];
} *p;

p = kmalloc(sizeof(*p) + sizeof(p-&gt;flex[0]) * count),
            GFP_KERNEL);
</code></pre>

<p>If all size calculations are correctly adjusted, there should be no differences in the resulting allocation size, etc. If a discrepancy is found, it is going to be either a bug introduced by the conversion, or the discovery of an existing bug in the original size calculations.</p>

<p>Note that depending on the sizes of the structure, its flexible array element, and count, there is also the risk associated with arithmetic overflow. Linux uses the <code>struct_size()</code> macro to perform these calculations so that the result saturates to at most <code>SIZE_MAX</code>, which will cause an allocation failure rather than wrapping around. So the best way to perform this allocation would be:</p>

<pre><code class="language-c">p = kmalloc(struct_size(p, flex, count), GFP_KERNEL);
</code></pre>

<h4 id="padding-and-interface-sizes">Padding and interface sizes</h4>

<p>When a structure definition is also used by a codebase we don&#39;t control (e.g. firmware, userspace, virtualization), changing its layout or <code>sizeof()</code> may break such code. Specifically, it may break its ability to communicate correctly with the kernel across the shared interface. Such structures cannot suddenly lose the single element of its trailing array. In these cases, a new member needs to be used for kernel code, explicitly keeping the original member for backward compatibility. For example:</p>

<pre><code class="language-c">struct object {
        ...
        int flex[1];
};
</code></pre>

<p>becomes:</p>

<pre><code class="language-c">struct object {
        ...
        union {
                int flex[1];
                DECLARE_FLEX_ARRAY(int, data);
        };
};
</code></pre>

<p>Now the kernel will only use the newly named data member (and gain any potential bounds checking protections from the compiler), and external code that shares this structure definition can continue to use the flex member, all without changing the size of the structure.</p>

<p>This has the downside of needing to change the member name throughout Linux. However, if the other side of the interface doesn&#39;t actually use the original member, we can avoid this. We can convert the member to a flexible array and add explicit padding instead. This would mean no collateral changes with the member name in Linux are needed:</p>

<pre><code class="language-c">struct object {
        ...
        union {
                int __padding;
                DECLARE_FLEX_ARRAY(int, flex);
        };
};
</code></pre>

<h3 id="replace-multi-element-arrays">Replace multi-element arrays</h3>

<p>In the cases of trailing arrays with larger element counts, the usage needs to be even more carefully studied. Most problems end up looking very similar to 1-element interface conversions above. For example, if there is some hardware interface that returns at least 4 bytes for an otherwise dynamically sized array, the conversion would start from here:</p>

<pre><code class="language-c">struct object {
        ...
        unsigned char data[4];
};
</code></pre>

<p>which becomes:</p>

<pre><code class="language-c">struct object {
        ...
        union {
                unsigned char __padding[4];
                DECLARE_FLEX_ARRAY(unsigned char, data);
        };
};
</code></pre>

<h2 id="enable-warray-bounds">Enable <code>-Warray-bounds</code></h2>

<p>With all fixed-size array bounds able to be determined at build time, <code>-Warray-bounds</code> can actually perform the checking, keeping provably bad code out of Linux. (This option is already part of <code>-Wall</code>, which Linux isn&#39;t quite able to use itself yet, but is strongly recommended for other C projects.) As a reminder, optimization level will impact this option. The kernel is built with <code>-O2</code>, which is likely the right choice for most C projects.</p>

<h2 id="enable-wzero-length-array">Enable <code>-Wzero-length-array</code></h2>

<p>If all zero length arrays have been removed from the code, future uses can be kept out of the code by using <a href="https://clang.llvm.org/docs/DiagnosticsReference.html#wzero-length-array" rel="nofollow"><code>-Wzero-length-array</code></a>.  This option is currently <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94428" rel="nofollow">only</a> available in Clang, and will warn when finding the <em>definition</em> of such structure members, rather than warning when they are <em>accessed</em> in code. Because of this, it is unlikely to ever be enabled in Linux since some array sizes are constructed from build configurations, and may drop to 0 when they are unused (i.e. they were never used as flexible arrays). As such, it is sufficient to use <code>-fstrict-flex-arrays</code> (see below) and <code>-Warray-bounds</code>.</p>

<h2 id="enable-fstrict-flex-arrays">Enable <code>-fstrict-flex-arrays</code></h2>

<p>Once all the fake flexible arrays have been converted to true flexible arrays, the remaining fixed-sized trailing arrays can start being treated as actually fixed-size by enabling <code>-fstrict-flex-arrays</code>. Future attempts to add fake flexible arrays to the code will then elicit warnings as part of the existing diagnostics from <code>-Warray-bounds</code>, since all fake flexible arrays are now treated as fixed-size arrays. (Note that this option sees the subset of 0-length arrays caught by <code>-Wzero-length-array</code> when they are actually <em>used</em> in the code, so <code>-Wzero-length-array</code> may be redundant.)</p>

<h2 id="coming-soon-annotate-bounds-of-flexible-arrays">Coming soon: annotate bounds of flexible arrays</h2>

<p>With flexible arrays now a first-class citizen in Linux and the compilers, it becomes possible to extend their available diagnostics.  What the compiler is missing is knowledge of how the length of a given flexible array is tracked. For well-described flexible array structs, this means associating the member holding the element count with the flexible array member. This idea is not new, though prior <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2660.pdf" rel="nofollow">implementation proposals</a> have wanted to make changes to the C language syntax. A simpler approach is the addition of struct member attributes, and is under discussion and early development by both the GCC and Clang developer communities.</p>

<h3 id="add-attribute-counted-by-member">Add <code>__attribute__((__counted_by__(member)))</code></h3>

<p>In order to annotate flexible arrays, a new attribute could be used to describe the relationship between struct members. For example:</p>

<pre><code class="language-c">struct object {
        ...
        signed char items;
        ...
        int flex[];
} *p;
</code></pre>

<p>becomes:</p>

<pre><code class="language-c">struct object {
        ...
        signed char items;
        ...
        int flex[] __attribute__((__counted_by__(items)));
} *p;
</code></pre>

<p>This would allow <code>-fsanitize=bounds</code> to check for out-of-bounds accesses.  For example, given the above annotation, each of the marked access into <code>p-&gt;flex</code> should trap:</p>

<pre><code class="language-c">sum += p-&gt;flex[-1];  // trap all negative indexes
sum += p-&gt;flex[128]; // trap when index larger than bounds type
sum += p-&gt;flex[0];   // trap when p-&gt;items &lt;= 0
sum += p-&gt;flex[5];   // trap when p-&gt;items &lt;= 5
sum += p-&gt;flex[idx]; // trap when p-&gt;items &lt;= idx || idx &lt; 0
</code></pre>

<p>The type associated with the bounds check (<code>signed char</code> in the example above) should perhaps be required to be an unsigned type, but Linux has so many counters implemented as <code>int</code> that it becomes an additional refactoring burden to change these to unsigned, especially since sometimes they are sneakily being used with negative values in some other part of the code. Better to leave them as-is (though perhaps emit a warning), and just add a negativity check at access time. Switching the counter to unsigned then potentially becomes a small performance improvement.</p>

<p>Similar to <code>-fsanitize=bounds</code> above, <code>__builtin_dynamic_object_size()</code> will perform the expected calculations with the items member as the basis for the resulting size (and where values less than 0 are considered to be 0 to avoid pathological calculations):</p>

<pre><code class="language-c">p-&gt;items = 5;
assert(__builtin_dynamic_object_size(p, 1) ==
        sizeof(*p) + 5 * sizeof(*p-&gt;flex));
assert(__builtin_dynamic_object_size(p-&gt;flex, 1) ==
        5 * sizeof(*p-&gt;flex));
assert(__builtin_dynamic_object_size(&amp;p-&gt;flex[0], 1) ==
        sizeof(*p-&gt;flex));
assert(__builtin_dynamic_object_size(&amp;p-&gt;flex[2], 0) ==
        3 * sizeof(*p-&gt;flex));

p-&gt;items = -10;
assert(__builtin_dynamic_object_size(p, 0) == sizeof(*p));
assert(__builtin_dynamic_object_size(p, 1) == sizeof(*p));
assert(__builtin_dynamic_object_size(p-&gt;flex, 1) == 0);
assert(__builtin_dynamic_object_size(&amp;p-&gt;flex[2], 1) == 0);
</code></pre>

<p>Additional attributes may be needed if structures explicitly use byte counts rather than element counts.</p>

<h4 id="scope-considerations">Scope considerations</h4>

<p>Composite structures need to be able to define <code>__counted_by__</code> across struct boundaries:</p>

<pre><code class="language-c">struct object {
        ...
        char items;
        ...
        struct inner {
                ...
                int flex[] __attribute__((__counted_by__(.items)));
        };
} *ptr;
</code></pre>

<p>This may mean passing <code>&amp;ptr-&gt;inner</code> to a function will lose the bounds knowledge, but it may be possible to automatically include a bounds argument as an invisible function argument, as any function able to understand the layout of <code>struct inner</code> must by definition have visibility into the definition of <code>struct object</code>. For example, with this:</p>

<pre><code class="language-c">struct object instance;
...
func(&amp;instance.inner);
...
void func(struct inner *ptr) {
        ...
        ptr-&gt;flex[foo]; /* &#34;items&#34; is not scope */
        ...
}
</code></pre>

<p>The prototype could either be rejected due to lack of available scope, or could be automatically converted into passing the outer object pointer with an injected scope:</p>

<pre><code class="language-c">void func(struct object *__ptr) {
        struct inner *ptr = &amp;__ptr-&gt;inner;
        ...
        ptr-&gt;flex[foo]; /* __ptr-&gt;items is in scope */
        ...
}
</code></pre>

<h3 id="annotate-kernel-flexible-array-structs">Annotate kernel flexible array structs</h3>

<p>With the compiler attribute available, all of Linux&#39;s flexible arrays can be updated to include the annotation, and <code>CONFIG_FORTIFY_SOURCE</code> can be expanded to use <code>__builtin_dynamic_object_size()</code>.</p>

<h3 id="replace-declare-flex-array-with-declare-bounded-array">Replace <code>DECLARE_FLEX_ARRAY</code> with <code>DECLARE_BOUNDED_ARRAY</code></h3>

<p>Most uses of <code>DECLARE_FLEX_ARRAY()</code> can be replaced with <code>DECLARE_BOUNDED_ARRAY()</code>, explicitly naming the expected flex array bounds member. For example, if we had:</p>

<pre><code class="language-c">struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_FLEX_ARRAY(struct foo, name1);
                DECLARE_FLEX_ARRAY(struct bar, name2);
        };
};
</code></pre>

<p>it would become:</p>

<pre><code class="language-c">struct obj {
        ...
        int items;
        ...
        union {
                DECLARE_BOUNDED_ARRAY(struct foo, name1, items);
                DECLARE_BOUNDED_ARRAY(struct bar, name2, items);
        };
};
</code></pre>

<h3 id="add-manual-annotations">Add manual annotations</h3>

<p>Any flexible array structures not already using <code>DECLARE_BOUNDED_ARRAY()</code> can be annotated manually with the new attribute. For example, assuming the proposed <code>__attribute__((__counted_by__(member)))</code> is wrapped in a macro named <code>__counted_by()</code>:</p>

<pre><code class="language-c">struct obj {
        ...
        int items;
        ...
        int flex[];
};
</code></pre>

<p>becomes:</p>

<pre><code class="language-c">struct obj {
        ...
        int items;
        ...
        int flex[] __counted_by(items);
};
</code></pre>

<h2 id="future-work-expand-attribute-beyond-arrays">Future work: expand attribute beyond arrays</h2>

<p>It will also be possible to use the new attribute on pointers and function arguments as well as flexible arrays. All the same details are available, though there would be the obvious differences for enclosing structure sizes, as the pointers are aimed (usually) outside the struct itself. Regardless, having it be possible to check offsets and inform <code>__builtin_dynamic_object_size()</code> would allow for several more places where runtime checking could be possible. For example, given this:</p>

<pre><code class="language-c">struct object {
        ...
        unsigned char items;
        ...
        int *data __attribute__((__counted_by__(items)));
        ...
} *p;
</code></pre>

<p>It should be possible to detect sizing information:</p>

<pre><code class="language-c">p-&gt;items = 5;
assert(__builtin_dynamic_object_size(p-&gt;data, 1) ==
        5 * sizeof(*p-&gt;data));
assert(__builtin_dynamic_object_size(*p-&gt;data, 1) ==
        sizeof(*p-&gt;data));
assert(__builtin_dynamic_object_size(*p-&gt;data, 0) ==
        5 * sizeof(*p-&gt;data));
</code></pre>

<p>And it should be possible to trap on the following bad accesses:</p>

<pre><code class="language-c">int *ptr = p-&gt;data;
sum += ptr[-1];  // trap all negative indexes
sum += ptr[500]; // trap when index larger than bounds type
sum += ptr[0];   // trap when p-&gt;items &lt;= 0
sum += ptr[5];   // trap when p-&gt;items &lt;= 5
ptr += 5;        // don&#39;t trap yet: allow ptr++ in a for loop
sum += *ptr;     // trap when p-&gt;items &lt;= 5
</code></pre>

<h2 id="a-safer-code-base">A safer code base</h2>

<p>A C codebase that has refactored all of its arrays into proper flexible arrays can now finally build by using:</p>

<pre><code>        -Warray-bounds
        -fstrict-flex-arrays
        -fsanitize=bounds
        -fsanitize-undefined-trap-on-error
        -D_FORTIFY_SOURCE=3
</code></pre>

<p>With this, the burdens of C array index bounds checking will have been shifted to the toolchain, and array index overflow flaw exploitation can be a thing of the past, reducing severity to a simple denial of service (assuming the traps aren&#39;t handled gracefully). For the next trick, new code can be written in a language that is <a href="https://security.googleblog.com/2022/12/memory-safe-languages-in-android-13.html" rel="nofollow">memory safe</a> to start with (e.g. Rust).</p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>Thanks to many people who gave me feedback on this post: Nick Desaulniers, Gustavo A. R. Silva, Bill Wendling, Qing Zhao, Kara Olive, Chris Palmer, Steven Rostedt, Allen Webb, Julien Voisin, Guenter Roeck, Evan Benn, Seth Jenkins, Alexander Potapenko, Ricardo Ribalda, and Kevin Chowski.</p>

<h2 id="discussion">Discussion</h2>

<p>Please join <a href="https://fosstodon.org/@kees/109740441782167956" rel="nofollow">this thread</a> with your thoughts, comments, and corrections. :)</p>
]]></content:encoded>
      <author>kees</author>
      <guid>https://people.kernel.org/read/a/5bw9rfqder</guid>
      <pubDate>Mon, 23 Jan 2023 19:15:53 +0000</pubDate>
    </item>
    <item>
      <title>kdevops v6.2-rc1 released</title>
      <link>https://people.kernel.org/mcgrof/kdevops-v6-2-rc1-released</link>
      <description>&lt;![CDATA[kdevops logo&#xA;&#xA;After 3 years since the announcement of the first release of kdevops I&#39;d like to announce the release of v6.2-rc1 of kdevops!&#xA;&#xA;kdevops is designed to help with automation of Linux kernel development workflows. At first, is was not clear how and if kdevops could be used outside of filesystems testing easily. In fact my last post about it 3 years ago explained how one could only use kdevops in an odd way for other things, one had to fork it to use it for different workflows. That&#39;s old nonsense now. kdevops has grown to adopt kconfig and so in one single tree different workflows are now possible. Embracing other things such as using jinja2 for file templating with ansible and having to figure out a way to add PCI-E passthrough support through kconfig has made me realize that the growth component to the project is no longer a concern, it is actually a feature now. It is clear now that new technologies and very complex workflows can easily be added to kdevops.&#xA;&#xA;But it is easy to say unless you have proof, and fortunately I have it. There are two new technologies that go well supported in kdevops that folks who are curious can start mucking around with, which otherwise may take a bit of time to ramp up with. The technologies are: Zoned storage and CXL. Supporting new technologies also means ensuring you get whatever tooling you might need to want to test or work with such technologies.&#xA;&#xA;So for instance, getting a full Linux kernel development workflow going for CXL with the meson unit tests, even by enabling PCI-E passthrough, with the latest linux-next kernel is now reduced to just a few basic commands, in a Linux distribution / cloud provider agnostic manner:&#xA;&#xA;make dynconfig&#xA;make&#xA;make bringup&#xA;make linux&#xA;make cxl&#xA;make cxl-test-meson&#xA;&#xA;Just ask around a typical CXL Linux kernel developer how long it took them to get a CXL Linux kernel development &amp; test environment up and running that they were happy with. And ask if it was reproducible. This is all now just reduced to 6 commands.&#xA;&#xA;As for the details, it has been 8 months since the last release, and over that time the project has received 680 commits. I&#39;d like to thank the developers who contributed:&#xA;&#xA;Adam Manzanares&#xA;Amir Goldstein&#xA;Chandan Babu R&#xA;Jeff Layton&#xA;Joel Granados&#xA;Josef Bacik&#xA;Luis Chamberlain&#xA;Pankaj Raghav&#xA;&#xA;I&#39;d also like to thank my employer for trusting in this work, and allowing me to share a big iron server to help the community with Linux kernel stable work and general kernel technology enablement.&#xA;&#xA;As for the exact details of changes merged, there so many! So I&#39;ve tried to provide a nice terse summary on highlights on the git tag for v6.2-rc1. 8 months was certainly a long time to wait for a new release, so my hope is we&#39;ll try to bake a release now in tandem with the Linux kernel, in cadence with the same Linux kernel versioning and release timeline.&#xA;&#xA;Based on feedback at LSFMM this year the project is now under the github linux-kdevops organization. This enables other developers to push into the tree. This let&#39;s us scale, specially as different workflows are supported.&#xA;&#xA;If you see value in enabling rapid ramp up with Linux kernel development through kdevops for your subsystem / technology / feel free to join the party and either send a pull request to the group or just send patches.]]&gt;</description>
      <content:encoded><![CDATA[<p><img src="https://github.com/linux-kdevops/kdevops/raw/master/images/kdevops-trans-bg-edited-individual-with-logo-gausian-blur-1600x1600.png" alt="kdevops logo"></p>

<p>After 3 years since the <a href="https://people.kernel.org/mcgrof/kdevops-a-devops-framework-for-linux-kernel-development" rel="nofollow">announcement of the first release of kdevops</a> I&#39;d like to announce the release of v6.2-rc1 of kdevops!</p>

<p>kdevops is designed to help with automation of Linux kernel development workflows. At first, is was not clear how and if kdevops could be used outside of filesystems testing easily. In fact my last post about it 3 years ago explained how one could only use kdevops in an odd way for other things, one had to fork it to use it for different workflows. That&#39;s old nonsense now. kdevops has grown to adopt kconfig and so in one single tree different workflows are now possible. Embracing other things such as using jinja2 for file templating with ansible and having to figure out a way to add PCI-E passthrough support through kconfig has made me realize that the <em>growth</em> component to the project is no longer a concern, it is actually a feature now. It is clear now that new technologies and very complex workflows can easily be added to kdevops.</p>

<p>But it is easy to say unless you have proof, and fortunately I have it. There are two new technologies that go well supported in kdevops that folks who are curious can start mucking around with, which otherwise may take a bit of time to ramp up with. The technologies are: Zoned storage and CXL. Supporting new technologies also means ensuring you get whatever tooling you might need to want to test or work with such technologies.</p>

<p>So for instance, getting a <em>full</em> Linux kernel development workflow going for CXL with the meson unit tests, even by enabling PCI-E passthrough, with the latest linux-next kernel is now reduced to just a few basic commands, in a Linux distribution / cloud provider agnostic manner:</p>

<pre><code>make dynconfig
make
make bringup
make linux
make cxl
make cxl-test-meson
</code></pre>

<p>Just ask around a typical CXL Linux kernel developer how long it took them to get a CXL Linux kernel development &amp; test environment up and running that they were happy with. And ask if it was reproducible. This is all now just reduced to 6 commands.</p>

<p>As for the details, it has been 8 months since the last release, and over that time the project has received 680 commits. I&#39;d like to thank the developers who contributed:</p>

<pre><code>Adam Manzanares
Amir Goldstein
Chandan Babu R
Jeff Layton
Joel Granados
Josef Bacik
Luis Chamberlain
Pankaj Raghav
</code></pre>

<p>I&#39;d also like to thank my employer for trusting in this work, and allowing me to share a big iron server to help the community with Linux kernel stable work and general kernel technology enablement.</p>

<p>As for the exact details of changes merged, there so many! So I&#39;ve tried to provide a nice terse summary on highlights on the <a href="https://github.com/linux-kdevops/kdevops/releases/tag/v6.2-rc1" rel="nofollow">git tag for v6.2-rc1</a>. 8 months was certainly a long time to wait for a new release, so my hope is we&#39;ll try to bake a release now in tandem with the Linux kernel, in cadence with the same Linux kernel versioning and release timeline.</p>

<p>Based on feedback at LSFMM this year the project is now under the github <a href="https://github.com/linux-kdevops/" rel="nofollow">linux-kdevops</a> organization. This enables other developers to push into the tree. This let&#39;s us scale, specially as different workflows are supported.</p>

<p>If you see value in enabling rapid ramp up with Linux kernel development through kdevops for your subsystem / technology / feel free to join the party and either send a pull request to the group or just send patches.</p>
]]></content:encoded>
      <author>mcgrof</author>
      <guid>https://people.kernel.org/read/a/fknqzeiqq2</guid>
      <pubDate>Thu, 22 Dec 2022 05:59:39 +0000</pubDate>
    </item>
    <item>
      <title>Sending a kernel patch with b4 (part 1)</title>
      <link>https://people.kernel.org/monsieuricon/sending-a-kernel-patch-with-b4-part-1</link>
      <description>&lt;![CDATA[While b4 started out as a way for maintainers to retrieve patches from mailing lists, it also has contributor-oriented features. Starting with version 0.10 b4 can:&#xA;&#xA;create and manage patch series and cover letters&#xA;track and auto-reroll series revisions&#xA;display range-diffs between revisions&#xA;apply trailers received from reviewers and maintainers&#xA;submit patches without needing a valid SMTP gateway&#xA;&#xA;These features are still considered experimental, but they should be stable for most work and I&#39;d be happy to receive further feedback from occasional contributors.&#xA;&#xA;In this article, we&#39;ll go through the process of submitting an actual typo fix patch to the upstream kernel. This bug was identified a few years ago and submitted via bugzilla, but never fixed:&#xA;&#xA;https://bugzilla.kernel.org/showbug.cgi?id=205891&#xA;&#xA;Accompanying video&#xA;&#xA;This article has an accompanying video where I go through all the steps and submit the actual patch at the end:&#xA;&#xA;https://www.youtube.com/watch?v=QBR06ml2YLQ&#xA;&#xA;Installing the latest b4 version&#xA;&#xA;Start by installing b4. The easiest is to do it via pip, as this would grab the latest stable version:&#xA;$ pip install --user b4&#xA;[...]&#xA;$ b4 --version&#xA;0.11.1&#xA;If you get an error or an older version of b4, please check that your `$PATH contains $HOME/.local/bin` where pip installs the binaries.&#xA;&#xA;Preparing the tree&#xA;&#xA;`b4 prep -n [name-of-branch] -f [nearest-tag]`&#xA;&#xA;Next, prepare a topical branch where you will be doing your work. We&#39;ll be fixing a typo in `arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts, and we&#39;ll base this work on tag v6.1`:&#xA;$ b4 prep -n lanyang-dts-typo -f v6.1&#xA;Created new branch b4/lanyang-dts-typo&#xA;Created the default cover letter, you can edit with --edit-cover.&#xA;This is just a regular branch prepended with &#34;b4/&#34;:&#xA;$ git branch&#xA;b4/lanyang-dts-typo&#xA;  master&#xA;You can do all the normal operations with it, and the only special thing about it is that it has an &#34;empty commit&#34; at the start of the series containing the template of our cover letter.&#xA;&#xA;Editing the cover letter&#xA;&#xA;`b4 prep --edit-cover`&#xA;&#xA;If you plan to submit a single patch, then the cover letter is not that necessary and will only be used to track the destination addresses and changelog entries. You can delete most of the template content and leave just the title and sign-off. The tracking information json will always be appended to the end automatically -- you don&#39;t need to worry about it.&#xA;&#xA;Here&#39;s what the commit looks like after I edited it:&#xA;$ git cat-file -p HEAD&#xA;tree c7c1b7db9ced3eba518cfc1f711e9d89f73f8667&#xA;parent 830b3c68c1fb1e9176028d02ef86f3cf76aa2476&#xA;author Konstantin Ryabitsev icon@mricon.com 1671656701 -0500&#xA;committer Konstantin Ryabitsev icon@mricon.com 1671656701 -0500&#xA;&#xA;Simple typo fix for the lanyang dts&#xA;&#xA;Signed-off-by: Konstantin Ryabitsev icon@mricon.com&#xA;&#xA;--- b4-submit-tracking ---&#xA;This section is used internally by b4 prep for tracking purposes.&#xA;{&#xA;  &#34;series&#34;: {&#xA;    &#34;revision&#34;: 1,&#xA;    &#34;change-id&#34;: &#34;20221221-lanyang-dts-typo-8509e8ffccd4&#34;,&#xA;    &#34;base-branch&#34;: &#34;master&#34;,&#xA;    &#34;prefixes&#34;: []&#xA;  }&#xA;}&#xA;Committing your work&#xA;&#xA;You can add commits to this branch as you normally would with any other git work. I am going to fix two obvious typos in a single file and make a single commit:&#xA;$ git show HEAD&#xA;commit 820ce2d9bc7c88e1515642cf3fc4005a52e4c490 (HEAD -  b4/lanyang-dts-typo)&#xA;Author: Konstantin Ryabitsev icon@mricon.com&#xA;Date:   Wed Dec 21 16:17:21 2022 -0500&#xA;&#xA;    arm: lanyang: fix lable-  label typo for lanyang dts&#xA;&#xA;    Fix an obvious spelling error in the dts file for Lanyang BMC.&#xA;    This was reported via bugzilla a few years ago but never fixed.&#xA;&#xA;    Reported-by: Jens Schleusener Jens.Schleusener@fossies.org&#xA;    Link: https://bugzilla.kernel.org/showbug.cgi?id=205891&#xA;    Signed-off-by: Konstantin Ryabitsev icon@mricon.com&#xA;&#xA;diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts&#xA;index c0847636f20b..e72e8ef5bff2 100644&#xA;--- a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts&#xA;+++ b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts&#xA;@@ -52,12 +52,12 @@ hddfault {&#xA;                        gpios = &amp;gpio ASPEEDGPIO(B, 3) GPIOACTIVEHIGH;&#xA;                };&#xA;                bmcerr {&#xA;lable = &#34;BMCfault&#34;;&#xA;label = &#34;BMCfault&#34;;&#xA;                        gpios = &amp;gpio ASPEEDGPIO(H, 6) GPIOACTIVEHIGH;&#xA;                };&#xA;&#xA;                syserr {&#xA;lable = &#34;Sysfault&#34;;&#xA;label = &#34;Sysfault&#34;;&#xA;                        gpios = &amp;gpio ASPEEDGPIO(H, 7) GPIOACTIVEHIGH;&#xA;                };&#xA;        };&#xA;Collecting To: and Cc: addresses&#xA;&#xA;`b4 prep --auto-to-cc`&#xA;&#xA;After you&#39;ve committed your work, you will want to collect the addresses of people who should be the ones reviewing it. Running `b4 prep --auto-to-cc will invoke scripts/getmaintainer.pl with the default recommended flags to find out who should go into the To: and Cc:` headers:&#xA;$ b4 prep --auto-to-cc&#xA;Will collect To: addresses using getmaintainer.pl&#xA;Will collect Cc: addresses using getmaintainer.pl&#xA;Collecting To/Cc addresses&#xA;    To: Rob Herring ...&#xA;    To: Krzysztof Kozlowski ...&#xA;    To: Joel Stanley ...&#xA;    To: Andrew Jeffery ...&#xA;    Cc: devicetree@vger.kernel.org&#xA;    Cc: linux-arm-kernel@lists.infradead.org&#xA;    Cc: linux-aspeed@lists.ozlabs.org&#xA;    Cc: linux-kernel@vger.kernel.org&#xA;    Cc: Jens Schleusener ...&#xA;---&#xA;You can trim/expand this list with: b4 prep --edit-cover&#xA;Invoking git-filter-repo to update the cover letter.&#xA;New history written in 0.06 seconds...&#xA;Completely finished after 0.33 seconds.&#xA;These addresses will be added to the cover letter and you can edit them to add/remove destinations using the usual `b4 prep --edit-cover` command.&#xA;&#xA;Creating your patatt keypair for web endpoint submission&#xA;&#xA;(This needs to be done only once.)&#xA;&#xA;`patatt genkey`&#xA;&#xA;Note: if you already have a PGP key and it&#39;s set as `user.signingKey`, then you can skip this section entirely.&#xA;&#xA;Before we submit the patch, let&#39;s set up the keypair to sign our contributions. This is not strictly necessary if you are going to be using your own SMTP server to submit the patches, but it&#39;s a required step if you will use the kernel.org patch submission endpoint (which is what b4 will use in the absence of any `[sendemail]` sections in your git config).&#xA;&#xA;The process is very simple. Run `patatt genkey and add the resulting [patatt] section to your ~/.gitconfig` as instructed by the output.&#xA;&#xA;NOTE: You will want to back up the contents of your `~/.local/share/patatt` so you don&#39;t lose access to your private key.&#xA;&#xA;Dry-run and checkpatch&#xA;&#xA;`b4 send -o /tmp/tosend`&#xA;`./scripts/checkpatch.pl /tmp/tosend/`&#xA;&#xA;Next, generate the patches and look at their contents to make sure that everything is looking sane. Good things to check are:&#xA;&#xA;the From: address&#xA;the To: and Cc: addresses&#xA;general patch formatting&#xA;cover letter formatting (if more than 1 patch in the series)&#xA;&#xA;If everything looks sane, one more recommended step is to run `checkpatch.pl` from the top of the kernel tree:&#xA;$ ./scripts/checkpatch.pl /tmp/tosend/&#xA;total: 0 errors, 0 warnings, 14 lines checked&#xA;&#xA;/tmp/tosend/0001-arm-lanyang-fix-lable-label-typo-for-lanyang-dts.eml has no obvious style problems and is ready for submission.&#xA;Register your key with the web submission endpoint&#xA;&#xA;(This needs to be done only once, unless you change your keys.)&#xA;&#xA;`b4 send --web-auth-new`&#xA;`b4 send --web-auth-verify [challenge]`&#xA;&#xA;If you&#39;re not going to use your own SMTP server to send the patch, you should register your new keypair with the endpoint:&#xA;$ b4 send --web-auth-new&#xA;Will submit a new email authorization request to:&#xA;  Endpoint: https://lkml.kernel.org/b4submit&#xA;      Name: Konstantin Ryabitsev&#xA;  Identity: icon@mricon.com&#xA;  Selector: 20221221&#xA;    Pubkey: ed25519:24L8+ejW6PwbTbrJ/uT8HmSM8XkvGGtjTZ6NftSSI6I=&#xA;---&#xA;Press Enter to confirm or Ctrl-C to abort&#xA;Submitting new auth request to https://lkml.kernel.org/b4submit&#xA;---&#xA;Challenge generated and sent to icon@mricon.com&#xA;Once you receive it, run b4 send --web-auth-verify [challenge-string]&#xA;The challenge is a UUID4 string and this step is a simple verification that you are able to receive email at the address you want associated with this key. Once you receive the challenge, complete the process as described:&#xA;$ b4 send --web-auth-verify 897851db-9b84-4117-9d82-1d970f9df5f8&#xA;Signing challenge&#xA;Submitting verification to https://lkml.kernel.org/b4_submit&#xA;---&#xA;Challenge successfully verified for icon@mricon.com&#xA;You may now use this endpoint for submitting patches.&#xA;OR, set up your [sendemail] section&#xA;&#xA;You don&#39;t have to use the web endpoint -- it exists primarily for people who are not able or not willing to set up their SMTP information with git. Setting up a SMTP gateway is not a straightforward process for many:&#xA;&#xA;platforms using OAuth require setting up &#34;application-specific passwords&#34;&#xA;some companies only provide Exchange or browser-based access to email and don&#39;t offer any other way to send mail&#xA;some company SMTP gateways rewrite messages to add lengthy disclaimers or rewrite links to quarantine them&#xA;&#xA;However, if you have access to a functional SMTP gateway, then you are encouraged to use it instead of submitting via the web endpoint, as this ensures that the development process remains distributed and not dependent on any central services. Just follow instructions in `man git-send-email and add a valid [sendemail]` section to your git config. If b4 finds it, it will use it instead of relying on the web endpoint.&#xA;[sendemail]&#xA;    smtpEncryption = tls&#xA;    smtpServer = smtp.gmail.com&#xA;    smtpServerPort = 465&#xA;    smtpEncryption = ssl&#xA;    smtpUser = yourname@gmail.com&#xA;    smtpPass = your-gmail-app-password&#xA;Reflect the email to yourself&#xA;&#xA;`b4 send --reflect`&#xA;&#xA;This is the last step to use before sending off your contribution. Note, that it will fill out the `To: and Cc: headers of all messages with actual recipients, but it will NOT actually send mail to them, just to yourself. Mail servers don&#39;t actually pay any attention to those headers -- the only thing that matters to them is what was specified in the RCPT TO` outer envelope of the negotiation.&#xA;&#xA;This step is particularly useful if you&#39;re going to send your patches via the web endpoint. Unless your email address is from one of the following domains, the `From:` header will be rewritten in order to not violate DMARC policies:&#xA;&#xA;@kernel.org&#xA;@linuxfoundation.org&#xA;@linux.dev&#xA;&#xA;If your email domain doesn&#39;t match the above, the `From: header will be rewritten to be a kernel.org dummy address. Your actual From: will be added to the body of the message where git expects to find it, and the Reply-To:` header will be set so anyone replying to your message will be sending it to the right place.&#xA;&#xA;Send it off!&#xA;&#xA;`b4 send`&#xA;&#xA;If all your tests are looking good, then you are ready to send your work. Fire off &#34;`b4 send&#34;, review the &#34;Ready to:&#34; section for one final check and either Ctrl-C to get out of it, or hit Enter` to submit your work upstream.&#xA;&#xA;Coming up next&#xA;&#xA;In the next post, I will go over:&#xA;&#xA;making changes to your patches using: `git rebase -i`&#xA;retrieving and applying follow-up trailers using: `b4 trailers -u`&#xA;comparing v2 and v1 to see what changes you made using: `b4 prep --compare-to v1`&#xA;adding changelog entries using: `b4 prep --edit-cover`&#xA;&#xA;Documentation&#xA;&#xA;All contributor-oriented features of b4 are documented on the following site:&#xA;&#xA;https://b4.docs.kernel.org/en/stable-0.11.y/contributor/overview.html]]&gt;</description>
      <content:encoded><![CDATA[<p>While b4 started out as a way for maintainers to retrieve patches from mailing lists, it also has contributor-oriented features. Starting with version 0.10 b4 can:</p>
<ul><li>create and manage patch series and cover letters</li>
<li>track and auto-reroll series revisions</li>
<li>display range-diffs between revisions</li>
<li>apply trailers received from reviewers and maintainers</li>
<li>submit patches without needing a valid SMTP gateway</li></ul>

<p>These features are still considered experimental, but they should be stable for most work and I&#39;d be happy to receive further feedback from occasional contributors.</p>

<p>In this article, we&#39;ll go through the process of submitting an actual typo fix patch to the upstream kernel. This bug was identified a few years ago and submitted via bugzilla, but never fixed:</p>
<ul><li><a href="https://bugzilla.kernel.org/show_bug.cgi?id=205891" rel="nofollow">https://bugzilla.kernel.org/show_bug.cgi?id=205891</a></li></ul>

<h2 id="accompanying-video">Accompanying video</h2>

<p>This article has an accompanying video where I go through all the steps and submit the actual patch at the end:</p>
<ul><li><a href="https://www.youtube.com/watch?v=QBR06ml2YLQ" rel="nofollow">https://www.youtube.com/watch?v=QBR06ml2YLQ</a></li></ul>

<h2 id="installing-the-latest-b4-version">Installing the latest b4 version</h2>

<p>Start by installing b4. The easiest is to do it via pip, as this would grab the latest stable version:</p>

<pre><code>$ pip install --user b4
[...]
$ b4 --version
0.11.1
</code></pre>

<p>If you get an error or an older version of b4, please check that your <code>$PATH</code> contains <code>$HOME/.local/bin</code> where pip installs the binaries.</p>

<h2 id="preparing-the-tree">Preparing the tree</h2>
<ul><li><code>b4 prep -n [name-of-branch] -f [nearest-tag]</code></li></ul>

<p>Next, prepare a topical branch where you will be doing your work. We&#39;ll be fixing a typo in <code>arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts</code>, and we&#39;ll base this work on tag <code>v6.1</code>:</p>

<pre><code>$ b4 prep -n lanyang-dts-typo -f v6.1
Created new branch b4/lanyang-dts-typo
Created the default cover letter, you can edit with --edit-cover.
</code></pre>

<p>This is just a regular branch prepended with “b4/”:</p>

<pre><code>$ git branch
* b4/lanyang-dts-typo
  master
</code></pre>

<p>You can do all the normal operations with it, and the only special thing about it is that it has an “empty commit” at the start of the series containing the template of our cover letter.</p>

<h2 id="editing-the-cover-letter">Editing the cover letter</h2>
<ul><li><code>b4 prep --edit-cover</code></li></ul>

<p>If you plan to submit a single patch, then the cover letter is not that necessary and will only be used to track the destination addresses and changelog entries. You can delete most of the template content and leave just the title and sign-off. The tracking information json will always be appended to the end automatically — you don&#39;t need to worry about it.</p>

<p>Here&#39;s what the commit looks like after I edited it:</p>

<pre><code>$ git cat-file -p HEAD
tree c7c1b7db9ced3eba518cfc1f711e9d89f73f8667
parent 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
author Konstantin Ryabitsev &lt;icon@mricon.com&gt; 1671656701 -0500
committer Konstantin Ryabitsev &lt;icon@mricon.com&gt; 1671656701 -0500

Simple typo fix for the lanyang dts

Signed-off-by: Konstantin Ryabitsev &lt;icon@mricon.com&gt;

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
  &#34;series&#34;: {
    &#34;revision&#34;: 1,
    &#34;change-id&#34;: &#34;20221221-lanyang-dts-typo-8509e8ffccd4&#34;,
    &#34;base-branch&#34;: &#34;master&#34;,
    &#34;prefixes&#34;: []
  }
}
</code></pre>

<h2 id="committing-your-work">Committing your work</h2>

<p>You can add commits to this branch as you normally would with any other git work. I am going to fix two obvious typos in a single file and make a single commit:</p>

<pre><code>$ git show HEAD
commit 820ce2d9bc7c88e1515642cf3fc4005a52e4c490 (HEAD -&gt; b4/lanyang-dts-typo)
Author: Konstantin Ryabitsev &lt;icon@mricon.com&gt;
Date:   Wed Dec 21 16:17:21 2022 -0500

    arm: lanyang: fix lable-&gt;label typo for lanyang dts

    Fix an obvious spelling error in the dts file for Lanyang BMC.
    This was reported via bugzilla a few years ago but never fixed.

    Reported-by: Jens Schleusener &lt;Jens.Schleusener@fossies.org&gt;
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205891
    Signed-off-by: Konstantin Ryabitsev &lt;icon@mricon.com&gt;

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
index c0847636f20b..e72e8ef5bff2 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-lanyang.dts
@@ -52,12 +52,12 @@ hdd_fault {
                        gpios = &lt;&amp;gpio ASPEED_GPIO(B, 3) GPIO_ACTIVE_HIGH&gt;;
                };
                bmc_err {
-                       lable = &#34;BMC_fault&#34;;
+                       label = &#34;BMC_fault&#34;;
                        gpios = &lt;&amp;gpio ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH&gt;;
                };

                sys_err {
-                       lable = &#34;Sys_fault&#34;;
+                       label = &#34;Sys_fault&#34;;
                        gpios = &lt;&amp;gpio ASPEED_GPIO(H, 7) GPIO_ACTIVE_HIGH&gt;;
                };
        };
</code></pre>

<h2 id="collecting-to-and-cc-addresses">Collecting To: and Cc: addresses</h2>
<ul><li><code>b4 prep --auto-to-cc</code></li></ul>

<p>After you&#39;ve committed your work, you will want to collect the addresses of people who should be the ones reviewing it. Running <code>b4 prep --auto-to-cc</code> will invoke <code>scripts/get_maintainer.pl</code> with the default recommended flags to find out who should go into the <code>To:</code> and <code>Cc:</code> headers:</p>

<pre><code>$ b4 prep --auto-to-cc
Will collect To: addresses using get_maintainer.pl
Will collect Cc: addresses using get_maintainer.pl
Collecting To/Cc addresses
    + To: Rob Herring &lt;...&gt;
    + To: Krzysztof Kozlowski &lt;...&gt;
    + To: Joel Stanley &lt;...&gt;
    + To: Andrew Jeffery &lt;...&gt;
    + Cc: devicetree@vger.kernel.org
    + Cc: linux-arm-kernel@lists.infradead.org
    + Cc: linux-aspeed@lists.ozlabs.org
    + Cc: linux-kernel@vger.kernel.org
    + Cc: Jens Schleusener &lt;...&gt;
---
You can trim/expand this list with: b4 prep --edit-cover
Invoking git-filter-repo to update the cover letter.
New history written in 0.06 seconds...
Completely finished after 0.33 seconds.
</code></pre>

<p>These addresses will be added to the cover letter and you can edit them to add/remove destinations using the usual <code>b4 prep --edit-cover</code> command.</p>

<h2 id="creating-your-patatt-keypair-for-web-endpoint-submission">Creating your patatt keypair for web endpoint submission</h2>

<p>(This needs to be done only once.)</p>
<ul><li><code>patatt genkey</code></li></ul>

<p>Note: if you already have a PGP key and it&#39;s set as <code>user.signingKey</code>, then you can skip this section entirely.</p>

<p>Before we submit the patch, let&#39;s set up the keypair to sign our contributions. This is not strictly necessary if you are going to be using your own SMTP server to submit the patches, but it&#39;s a required step if you will use the kernel.org patch submission endpoint (which is what b4 will use in the absence of any <code>[sendemail]</code> sections in your git config).</p>

<p>The process is very simple. Run <code>patatt genkey</code> and add the resulting <code>[patatt]</code> section to your <code>~/.gitconfig</code> as instructed by the output.</p>

<p>NOTE: You will want to back up the contents of your <code>~/.local/share/patatt</code> so you don&#39;t lose access to your private key.</p>

<h2 id="dry-run-and-checkpatch">Dry-run and checkpatch</h2>
<ul><li><code>b4 send -o /tmp/tosend</code></li>
<li><code>./scripts/checkpatch.pl /tmp/tosend/*</code></li></ul>

<p>Next, generate the patches and look at their contents to make sure that everything is looking sane. Good things to check are:</p>
<ul><li>the From: address</li>
<li>the To: and Cc: addresses</li>
<li>general patch formatting</li>
<li>cover letter formatting (if more than 1 patch in the series)</li></ul>

<p>If everything looks sane, one more recommended step is to run <code>checkpatch.pl</code> from the top of the kernel tree:</p>

<pre><code>$ ./scripts/checkpatch.pl /tmp/tosend/*
total: 0 errors, 0 warnings, 14 lines checked

/tmp/tosend/0001-arm-lanyang-fix-lable-label-typo-for-lanyang-dts.eml has no obvious style problems and is ready for submission.
</code></pre>

<h2 id="register-your-key-with-the-web-submission-endpoint">Register your key with the web submission endpoint</h2>

<p>(This needs to be done only once, unless you change your keys.)</p>
<ul><li><code>b4 send --web-auth-new</code></li>
<li><code>b4 send --web-auth-verify [challenge]</code></li></ul>

<p>If you&#39;re not going to use your own SMTP server to send the patch, you should register your new keypair with the endpoint:</p>

<pre><code>$ b4 send --web-auth-new
Will submit a new email authorization request to:
  Endpoint: https://lkml.kernel.org/_b4_submit
      Name: Konstantin Ryabitsev
  Identity: icon@mricon.com
  Selector: 20221221
    Pubkey: ed25519:24L8+ejW6PwbTbrJ/uT8HmSM8XkvGGtjTZ6NftSSI6I=
---
Press Enter to confirm or Ctrl-C to abort
Submitting new auth request to https://lkml.kernel.org/_b4_submit
---
Challenge generated and sent to icon@mricon.com
Once you receive it, run b4 send --web-auth-verify [challenge-string]
</code></pre>

<p>The challenge is a UUID4 string and this step is a simple verification that you are able to receive email at the address you want associated with this key. Once you receive the challenge, complete the process as described:</p>

<pre><code>$ b4 send --web-auth-verify 897851db-9b84-4117-9d82-1d970f9df5f8
Signing challenge
Submitting verification to https://lkml.kernel.org/_b4_submit
---
Challenge successfully verified for icon@mricon.com
You may now use this endpoint for submitting patches.
</code></pre>

<h2 id="or-set-up-your-sendemail-section">OR, set up your [sendemail] section</h2>

<p>You don&#39;t <em>have</em> to use the web endpoint — it exists primarily for people who are not able or not willing to set up their SMTP information with git. Setting up a SMTP gateway is not a straightforward process for many:</p>
<ul><li>platforms using OAuth require setting up “application-specific passwords”</li>
<li>some companies only provide Exchange or browser-based access to email and don&#39;t offer any other way to send mail</li>
<li>some company SMTP gateways rewrite messages to add lengthy disclaimers or rewrite links to quarantine them</li></ul>

<p>However, if you have access to a functional SMTP gateway, then you are encouraged to use it instead of submitting via the web endpoint, as this ensures that the development process remains distributed and not dependent on any central services. Just follow instructions in <code>man git-send-email</code> and add a valid <code>[sendemail]</code> section to your git config. If b4 finds it, it will use it instead of relying on the web endpoint.</p>

<pre><code>[sendemail]
    smtpEncryption = tls
    smtpServer = smtp.gmail.com
    smtpServerPort = 465
    smtpEncryption = ssl
    smtpUser = yourname@gmail.com
    smtpPass = your-gmail-app-password
</code></pre>

<h2 id="reflect-the-email-to-yourself">Reflect the email to yourself</h2>
<ul><li><code>b4 send --reflect</code></li></ul>

<p>This is the last step to use before sending off your contribution. Note, that it will fill out the <code>To:</code> and <code>Cc:</code> headers of all messages with actual recipients, but it will <strong>NOT actually send mail to them, just to yourself</strong>. Mail servers don&#39;t actually pay any attention to those headers — the only thing that matters to them is what was specified in the <code>RCPT TO</code> outer envelope of the negotiation.</p>

<p>This step is particularly useful if you&#39;re going to send your patches via the web endpoint. Unless your email address is from one of the following domains, the <code>From:</code> header will be rewritten in order to not violate DMARC policies:</p>
<ul><li>@kernel.org</li>
<li>@linuxfoundation.org</li>
<li>@linux.dev</li></ul>

<p>If your email domain doesn&#39;t match the above, the <code>From:</code> header will be rewritten to be a kernel.org dummy address. Your actual <code>From:</code> will be added to the body of the message where git expects to find it, and the <code>Reply-To:</code> header will be set so anyone replying to your message will be sending it to the right place.</p>

<h2 id="send-it-off">Send it off!</h2>
<ul><li><code>b4 send</code></li></ul>

<p>If all your tests are looking good, then you are ready to send your work. Fire off “<code>b4 send</code>”, review the “<code>Ready to:</code>” section for one final check and either <code>Ctrl-C</code> to get out of it, or hit <code>Enter</code> to submit your work upstream.</p>

<h2 id="coming-up-next">Coming up next</h2>

<p>In the next post, I will go over:</p>
<ul><li>making changes to your patches using: <code>git rebase -i</code></li>
<li>retrieving and applying follow-up trailers using: <code>b4 trailers -u</code></li>
<li>comparing v2 and v1 to see what changes you made using: <code>b4 prep --compare-to v1</code></li>
<li>adding changelog entries using: <code>b4 prep --edit-cover</code></li></ul>

<h2 id="documentation">Documentation</h2>

<p>All contributor-oriented features of b4 are documented on the following site:</p>
<ul><li><a href="https://b4.docs.kernel.org/en/stable-0.11.y/contributor/overview.html" rel="nofollow">https://b4.docs.kernel.org/en/stable-0.11.y/contributor/overview.html</a></li></ul>
]]></content:encoded>
      <author>Konstantin Ryabitsev</author>
      <guid>https://people.kernel.org/read/a/ao5abk14my</guid>
      <pubDate>Wed, 21 Dec 2022 22:13:04 +0000</pubDate>
    </item>
    <item>
      <title>On workings of hrtimer&#39;s slack time functionality</title>
      <link>https://people.kernel.org/joelfernandes/on-workings-of-hrtimers-slack-time-functionality</link>
      <description>&lt;![CDATA[Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that:&#xA;Normal hrtimers will have both a soft and hard expiry which are equal to each other.&#xA;But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.&#xA;&#xA;The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (hrtimerrunqueues()). However, for programming the next timer interrupt, we only look at the hard expiry (hrtimerupdatenextevent() -  hrtimergetnextevent() -  _hrtimernexteventbase()-  hrtimergetexpires()). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.&#xA;&#xA;The commit containing the original code added for range timers is:&#xA;commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90&#xA;Author: Arjan van de Ven arjan@linux.intel.com&#xA;Date:   Mon Sep 1 15:47:08 2008 -0700&#xA;&#xA;    hrtimer: turn hrtimers into range timers&#xA;   &#xA;    this patch turns hrtimers into range timers;&#xA;    they have 2 expire points&#xA;    1) the soft expire point&#xA;    2) the hard expire point&#xA;   &#xA;    the kernel will do it&#39;s regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future.&#xA;The original code seems a bit buggy. I got a bit confused about how/where we handle the case in hrtimerinterrupt() where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.&#xA;&#xA;To see the issue, consider the case where we have 2 timers queued:&#xA;&#xA;The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.&#xA;&#xA;The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.&#xA;&#xA;Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expiresnext to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.&#xA;&#xA;As shown below, the loop in hrtimerinterrupt() goes through all the active timers in the timerqueue, softexpires is made to be the real expiry, and the old expires now becomes softexpires + slack.&#xA;       while((node = timerqueuegetnext(&amp;base-  active))) {&#xA;              struct hrtimer timer;&#xA;&#xA;              timer = containerof(node, struct hrtimer, node);&#xA;&#xA;              /&#xA;                The immediate goal for using the softexpires is&#xA;                minimizing wakeups, not running timers at the&#xA;                earliest interrupt after their soft expiration.&#xA;                This allows us to avoid using a Priority Search&#xA;                Tree, which can answer a stabbing querry for&#xA;                overlapping intervals and instead use the simple&#xA;                BST we already have.&#xA;                We don&#39;t add extra wakeups by delaying timers that&#xA;                are right-of a not yet expired timer, because that&#xA;                timer will have to trigger a wakeup anyway.&#xA;               /&#xA;&#xA;              if (basenow.tv64 &lt; hrtimergetsoftexpirestv64(timer)) {&#xA;                      ktimet expires;&#xA;&#xA;                      expires = ktimesub(hrtimergetexpires(timer),&#xA;                                          base-  offset);&#xA;                      if (expires.tv64 &lt; expiresnext.tv64)&#xA;                              expiresnext = expires;&#xA;                      break;&#xA;              }&#xA;&#xA;              runhrtimer(timer, &amp;basenow);&#xA;      }&#xA;However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because _hrtimernexteventbase() calls hrtimergetexpires() which correctly use the &#34;hard expiry&#34; times to do the programming.&#xA;&#xA;As of v6.2, the _hrtimerrunqueues() function looks like this:&#xA;static void hrtimerrunqueues(struct hrtimercpubase cpubase, ktimet now,&#xA;&#x9;&#x9;&#x9;&#x9; unsigned long flags, unsigned int activemask)&#xA;{&#xA;&#x9;struct hrtimerclockbase base;&#xA;&#x9;unsigned int active = cpubase-  activebases &amp; activemask;&#xA;&#xA;&#x9;foreachactivebase(base, cpubase, active) {&#xA;&#x9;&#x9;struct timerqueuenode node;&#xA;&#x9;&#x9;ktimet basenow;&#xA;&#xA;&#x9;&#x9;basenow = ktimeadd(now, base-  offset);&#xA;&#xA;&#x9;&#x9;while ((node = timerqueuegetnext(&amp;base-  active))) {&#xA;&#x9;&#x9;&#x9;struct hrtimer timer;&#xA;&#xA;&#x9;&#x9;&#x9;timer = containerof(node, struct hrtimer, node);&#xA;&#xA;&#x9;&#x9;&#x9;/&#xA;&#x9;&#x9;&#x9; The immediate goal for using the softexpires is&#xA;&#x9;&#x9;&#x9; minimizing wakeups, not running timers at the&#xA;&#x9;&#x9;&#x9; earliest interrupt after their soft expiration.&#xA;&#x9;&#x9;&#x9; This allows us to avoid using a Priority Search&#xA;&#x9;&#x9;&#x9; Tree, which can answer a stabbing query for&#xA;&#x9;&#x9;&#x9; overlapping intervals and instead use the simple&#xA;&#x9;&#x9;&#x9; BST we already have.&#xA;&#x9;&#x9;&#x9; We don&#39;t add extra wakeups by delaying timers that&#xA;&#x9;&#x9;&#x9; are right-of a not yet expired timer, because that&#xA;&#x9;&#x9;&#x9; timer will have to trigger a wakeup anyway.&#xA;&#x9;&#x9;&#x9; */&#xA;&#x9;&#x9;&#x9;if (basenow &lt; hrtimergetsoftexpirestv64(timer))&#xA;&#x9;&#x9;&#x9;&#x9;break;&#xA;&#xA;&#x9;&#x9;&#x9;runhrtimer(cpubase, base, timer, &amp;basenow, flags);&#xA;&#x9;&#x9;&#x9;if (activemask == HRTIMERACTIVESOFT)&#xA;&#x9;&#x9;&#x9;&#x9;hrtimersyncwaitrunning(cpubase, flags);&#xA;&#x9;&#x9;}&#xA;&#x9;}&#xA;}&#xA;&#xA;The utilization of hrtimergetsoftexpirestv64() might be perplexing, as it may raise the question of how this loop expires non-slack timers that possess only a hard expiry time. To clarify, it&#39;s important to note that what was once referred to as expiry is now considered soft expiry for non-slack timers. Consequently, the condition basenow &lt; hrtimergetsoftexpirestv64(timer) is capable of expiring both slack and non-slack timers effectively.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>Below are some notes I wrote while studying hrtimer slack behavior (range timers), which was added to reduce wakeups and save power, in the commit below. The idea is that:
1. Normal hrtimers will have both a soft and hard expiry which are equal to each other.
2. But hrtimers with timer slack will have a soft expiry and a hard expiry which is the soft expiry + delta.</p>

<p>The slack/delay effect is achieved by splitting the execution of the timer function, and the programming of the next timer event into 2 separate steps. That is, we execute the timer function as soon as we notice that its soft expiry has passed (<code>hrtimer_run_queues()</code>). However, for programming the next timer interrupt, we only look at the hard expiry (<code>hrtimer_update_next_event()</code> –&gt; <code>__hrtimer_get_next_event()</code> –&gt; <code>__hrtimer_next_event_base()</code>–&gt;<code>hrtimer_get_expires()</code>). As a result, the only way a slack-based timer will execute before its slack time elapses, is, if another timer without any slack time gets queued such that it hard-expires before the slack time of the slack-based timer passes.</p>

<p>The commit containing the original code added for range timers is:</p>

<pre><code>commit 654c8e0b1c623b156c5b92f28d914ab38c9c2c90
Author: Arjan van de Ven &lt;arjan@linux.intel.com&gt;
Date:   Mon Sep 1 15:47:08 2008 -0700

    hrtimer: turn hrtimers into range timers
   
    this patch turns hrtimers into range timers;
    they have 2 expire points
    1) the soft expire point
    2) the hard expire point
   
    the kernel will do it&#39;s regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future.
</code></pre>

<p>The original code seems a bit buggy. I got a bit confused about how/where we handle the case in <code>hrtimer_interrupt()</code> where other normal timers that expire before the slack time elapses, have their next timer interrupt programmed correctly such that the interrupt goes off before the slack time passes.</p>

<p>To see the issue, consider the case where we have 2 timers queued:</p>
<ol><li><p>The first one soft expires at t = 10, and say it has a slack of 50, so it hard expires at t = 60.</p></li>

<li><p>The second one is a normal timer, so the soft/hard expiry of it is both at t = 30.</p></li></ol>

<p>Now say, an hrtimer interrupt happens at t=5 courtesy of an unrelated expiring timer. In the below code, we notice that the next expiring timer is (the one with slack one), which has not soft-expired yet. So we have no reason to run it. However, we reprogram the next timer interrupt to be t=60 which is its hard expiry time (this is stored in expires_next to use as the value to program the next timer interrupt with).  Now we have a big problem, because the timer expiring at t=30 will not run in time and run much later.</p>

<p>As shown below, the loop in <code>hrtimer_interrupt()</code> goes through all the active timers in the timerqueue, <code>_softexpires</code> is made to be the real expiry, and the old <code>_expires</code> now becomes <code>_softexpires + slack</code>.</p>

<pre><code>       while((node = timerqueue_getnext(&amp;base-&gt;active))) {
              struct hrtimer *timer;

              timer = container_of(node, struct hrtimer, node);

              /*
               * The immediate goal for using the softexpires is
               * minimizing wakeups, not running timers at the
               * earliest interrupt after their soft expiration.
               * This allows us to avoid using a Priority Search
               * Tree, which can answer a stabbing querry for
               * overlapping intervals and instead use the simple
               * BST we already have.
               * We don&#39;t add extra wakeups by delaying timers that
               * are right-of a not yet expired timer, because that
               * timer will have to trigger a wakeup anyway.
               */

              if (basenow.tv64 &lt; hrtimer_get_softexpires_tv64(timer)) {
                      ktime_t expires;

                      expires = ktime_sub(hrtimer_get_expires(timer),
                                          base-&gt;offset);
                      if (expires.tv64 &lt; expires_next.tv64)
                              expires_next = expires;
                      break;
              }

              __run_hrtimer(timer, &amp;basenow);
      }
</code></pre>

<p>However, this seems to be an old kernel issue, as, in upstream v6.0, I believe the next hrtimer interrupt will be programmed correctly because <code>__hrtimer_next_event_base()</code> calls <code>hrtimer_get_expires()</code> which correctly use the “hard expiry” times to do the programming.</p>

<p>As of v6.2, the <code>__hrtimer_run_queues()</code> function looks like this:</p>

<pre><code>static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
				 unsigned long flags, unsigned int active_mask)
{
	struct hrtimer_clock_base *base;
	unsigned int active = cpu_base-&gt;active_bases &amp; active_mask;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *node;
		ktime_t basenow;

		basenow = ktime_add(now, base-&gt;offset);

		while ((node = timerqueue_getnext(&amp;base-&gt;active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing query for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don&#39;t add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow &lt; hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(cpu_base, base, timer, &amp;basenow, flags);
			if (active_mask == HRTIMER_ACTIVE_SOFT)
				hrtimer_sync_wait_running(cpu_base, flags);
		}
	}
}
</code></pre>

<p>The utilization of <code>hrtimer_get_softexpires_tv64()</code> might be perplexing, as it may raise the question of how this loop expires non-slack timers that possess only a hard expiry time. To clarify, it&#39;s important to note that what was once referred to as expiry is now considered soft expiry for non-slack timers. Consequently, the condition <code>basenow &lt; hrtimer_get_softexpires_tv64(timer)</code> is capable of expiring both slack and non-slack timers effectively.</p>
]]></content:encoded>
      <author>joelfernandes</author>
      <guid>https://people.kernel.org/read/a/r3bewuy85o</guid>
      <pubDate>Sun, 13 Nov 2022 17:35:01 +0000</pubDate>
    </item>
    <item>
      <title>TLS 1.3 Rx improvements in Linux 5.20</title>
      <link>https://people.kernel.org/kuba/tls-1-3-rx-improvements-in-linux-5-20</link>
      <description>&lt;![CDATA[Kernel TLS implements the record encapsulation and cryptography of the TLS protocol.  There are four areas where implementing (a portion of) TLS in the kernel helps:&#xA;&#xA;enabling seamless acceleration (NIC or crypto accelerator offload)&#xA;enabling sendfile on encrypted connections&#xA;saving extra data copies (data can be encrypted as it is copied into the kernel)&#xA;enabling the use of TLS on kernel sockets (nbd, NFS etc.)&#xA;&#xA;Kernel TLS handles only data records turning them into a cleartext data stream, all the control records (TLS handshake etc.) get sent to the application via a side channel for user space (OpenSSL or such) to process.&#xA;The first implementation of kTLS was designed in the good old days of TLS 1.2. When TLS 1.3 came into the picture the interest in kTLS had slightly diminished and the implementation, although functional, was rather simple and did not retain all the benefits. This post covers developments in the Linux 5.20 implementation of TLS which claws back the performance lost moving to TLS 1.3.&#xA;One of the features we lost in TLS 1.3 was the ability to decrypt data as it was copied into the user buffer during read. TLS 1.3 hides the true type of the record. Recall that kTLS wants to punt control records to a different path than data records. TLS 1.3 always populates the TLS header with applicationdata as the record type and the real record type is appended at the end, before record padding. This means that the data has to be decrypted for the true record type to be known.&#xA;Problem 1 - CoW on big GRO segments is inefficient&#xA;kTLS was made to dutifully decrypt the TLS 1.3 records first before copying the data to user space. Modern CPUs are relatively good at copying data, so the copy is not a huge problem in itself. What’s more problematic is how the kTLS code went about performing the copy.&#xA;The data queued on TCP sockets is considered read-only by the kernel. The pages data sits in may have been zero-copy-sent and for example belong to a file. kTLS tried to decrypt “in place” because it didn’t know how to deal with separate input/output skbs. To decrypt “in place” it calls skbcowdata(). As the name suggests this function makes a copy of the memory underlying an skb, to make it safe for writing. This function, however, is intended to be run on MTU-sized skbs (individual IP packets), not skbs from the TCP receive queue. The skbs from the receive queue can be much larger than a single TLS record (16kB). As a result TLS would CoW a 64kB skb 4 times to extract the 4 records inside it. Even worse if we consider that the last record will likely straddle skbs so we need to CoW two 64kB skbs to decrypt it “in place”. The diagram below visualizes the problem and the solution.&#xA;SKB CoW&#xA;The possible solutions are quite obvious - either create a custom version of skbcow_data() or teach TLS to deal with different input and output skbs. I opted for the latter (due to further optimizations it enables). Now we use a fresh buffer for the decrypted data and there is no need to CoW the big skbs TCP produces. This fix alone results in ~25-45% performance improvement (depending on the exact CPU SKU and available memory bandwidth). A jump in performance from abysmal to comparable with the user space OpenSSL.&#xA;Problem 2 - direct decrypt&#xA;Removing pointless copies is all well and good, but as mentioned we also lost the ability to decrypt directly to the user space buffer. We still need to copy the data to user space after it has been decrypted (A in the diagram below, here showing just a single record not full skb).&#xA;SKB direct decrypt&#xA;We can’t regain the full efficiency of TLS 1.2 because we don’t know the record type upfront. In practice, however, most of the records are data/application records (records carrying the application data rather than TLS control traffic like handshake messages or keys), so we can optimize for that case. We can optimistically decrypt to the user buffer, hoping the record contains data, and then check if we were right. Since decrypt to a user space buffer does not destroy the original encrypted record if we turn out to be wrong we can decrypting again, this time to a kernel skb (which we can then direct to the control message queue). Obviously this sort of optimization would not be acceptable in the Internet wilderness, as attackers could force us to waste time decrypting all records twice.&#xA;The real record type in TLS 1.3 is at the tail of the data. We must either trust that the application will not overwrite the record type after we place it in its buffer (B in the diagram below), or assume there will be no padding and use a kernel address as the destination of that chunk of data (C). Since record padding is also rare - I chose option (C). It improves the single stream performance by around 10%.&#xA;Problem 3 - latency&#xA;Applications tests have also showed that kTLS performs much worse than user space TLS in terms of the p99 RPC response latency. This is due to the fact that kTLS holds the socket lock for very long periods of time, preventing TCP from processing incoming packets. Inserting periodic TCP processing points into the kTLS code fixes the problem. The following graph shows the relationship between the TCP processing frequency (on the x axis in kB of consumed data, 0 = inf), throughput of a single TLS flow (“data”) and TCP socket state.&#xA;TCP CWND SWND&#xA;The TCP-perceived RTT of the connection grows the longer TLS hogs the socket lock without letting TCP process the ingress backlog. TCP responds by growing the congestion window.&#xA;Delaying the TCP processing will prevent TCP from responding to network congestion effectively, therefore I decided to be conservative and use 128kB as the TCP processing threshold.&#xA;Processing the incoming packets has the additional benefit of TLS being able to consume the data as it comes in from the NIC. Previously TLS had access to the data already processed by TCP when the read operation began. Any packets coming in from the NIC while TLS was decrypting would be backlogged at TCP input. On the way to user space TLS would release the socket lock, allowing the TCP backlog processing to kick in. TCP processing would schedule a TLS worker. TLS worker would tell the application there is more data.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>Kernel TLS implements the record encapsulation and cryptography of the TLS protocol.  There are four areas where implementing (a portion of) TLS in the kernel helps:</p>
<ul><li>enabling seamless acceleration (NIC or crypto accelerator offload)</li>
<li>enabling sendfile on encrypted connections</li>
<li>saving extra data copies (data can be encrypted as it is copied into the kernel)</li>
<li>enabling the use of TLS on kernel sockets (nbd, NFS etc.)</li></ul>

<p>Kernel TLS handles only data records turning them into a cleartext data stream, all the control records (TLS handshake etc.) get sent to the application via a side channel for user space (OpenSSL or such) to process.
The first implementation of kTLS was designed in the good old days of TLS 1.2. When TLS 1.3 came into the picture the interest in kTLS had slightly diminished and the implementation, although functional, was rather simple and did not retain all the benefits. This post covers developments in the Linux 5.20 implementation of TLS which claws back the performance lost moving to TLS 1.3.
One of the features we lost in TLS 1.3 was the ability to decrypt data as it was copied into the user buffer during read. TLS 1.3 hides the true type of the record. Recall that kTLS wants to punt control records to a different path than data records. TLS 1.3 always populates the TLS header with <code>application_data</code> as the record type and the real record type is appended at the end, before record padding. This means that the data has to be decrypted for the true record type to be known.</p>

<h2 id="problem-1-cow-on-big-gro-segments-is-inefficient">Problem 1 – CoW on big GRO segments is inefficient</h2>

<p>kTLS was made to dutifully decrypt the TLS 1.3 records first before copying the data to user space. Modern CPUs are relatively good at copying data, so the copy is not a huge problem in itself. What’s more problematic is how the kTLS code went about performing the copy.
The data queued on TCP sockets is considered read-only by the kernel. The pages data sits in may have been zero-copy-sent and for example belong to a file. kTLS tried to decrypt “in place” because it didn’t know how to deal with separate input/output skbs. To decrypt “in place” it calls <code>skb_cow_data()</code>. As the name suggests this function makes a copy of the memory underlying an skb, to make it safe for writing. This function, however, is intended to be run on MTU-sized skbs (individual IP packets), not skbs from the TCP receive queue. The skbs from the receive queue can be much larger than a single TLS record (16kB). As a result TLS would CoW a 64kB skb 4 times to extract the 4 records inside it. Even worse if we consider that the last record will likely straddle skbs so we need to CoW two 64kB skbs to decrypt it “in place”. The diagram below visualizes the problem and the solution.
<img src="https://patchwork.hopto.org/static/nipa/tls1.png" alt="SKB CoW">
The possible solutions are quite obvious – either create a custom version of <code>skb_cow_data()</code> or teach TLS to deal with different input and output skbs. I opted for the latter (due to further optimizations it enables). Now we use a fresh buffer for the decrypted data and there is no need to CoW the big skbs TCP produces. This fix alone results in ~25-45% performance improvement (depending on the exact CPU SKU and available memory bandwidth). A jump in performance from abysmal to comparable with the user space OpenSSL.</p>

<h2 id="problem-2-direct-decrypt">Problem 2 – direct decrypt</h2>

<p>Removing pointless copies is all well and good, but as mentioned we also lost the ability to decrypt directly to the user space buffer. We still need to copy the data to user space after it has been decrypted (<strong>A</strong> in the diagram below, here showing just a single record not full skb).
<img src="https://patchwork.hopto.org/static/nipa/tls2.png" alt="SKB direct decrypt">
We can’t regain the full efficiency of TLS 1.2 because we don’t know the record type upfront. In practice, however, most of the records are data/application records (records carrying the application data rather than TLS control traffic like handshake messages or keys), so we can optimize for that case. We can optimistically decrypt to the user buffer, hoping the record contains data, and then check if we were right. Since decrypt to a user space buffer does not destroy the original encrypted record if we turn out to be wrong we can decrypting again, this time to a kernel skb (which we can then direct to the control message queue). Obviously this sort of optimization would not be acceptable in the Internet wilderness, as attackers could force us to waste time decrypting all records twice.
The real record type in TLS 1.3 is at the tail of the data. We must either trust that the application will not overwrite the record type after we place it in its buffer (<strong>B</strong> in the diagram below), or assume there will be no padding and use a kernel address as the destination of that chunk of data (<strong>C</strong>). Since record padding is also rare – I chose option (<strong>C</strong>). It improves the single stream performance by around 10%.</p>

<h2 id="problem-3-latency">Problem 3 – latency</h2>

<p>Applications tests have also showed that kTLS performs much worse than user space TLS in terms of the p99 RPC response latency. This is due to the fact that kTLS holds the socket lock for very long periods of time, preventing TCP from processing incoming packets. Inserting periodic TCP processing points into the kTLS code fixes the problem. The following graph shows the relationship between the TCP processing frequency (on the x axis in kB of consumed data, 0 = inf), throughput of a single TLS flow (“data”) and TCP socket state.
<img src="https://patchwork.hopto.org/static/nipa/tls3.png" alt="TCP CWND SWND">
The TCP-perceived RTT of the connection grows the longer TLS hogs the socket lock without letting TCP process the ingress backlog. TCP responds by growing the congestion window.
Delaying the TCP processing will prevent TCP from responding to network congestion effectively, therefore I decided to be conservative and use 128kB as the TCP processing threshold.
Processing the incoming packets has the additional benefit of TLS being able to consume the data as it comes in from the NIC. Previously TLS had access to the data already processed by TCP when the read operation began. Any packets coming in from the NIC while TLS was decrypting would be backlogged at TCP input. On the way to user space TLS would release the socket lock, allowing the TCP backlog processing to kick in. TCP processing would schedule a TLS worker. TLS worker would tell the application there is more data.</p>
]]></content:encoded>
      <author>Jakub Kicinski</author>
      <guid>https://people.kernel.org/read/a/p0kp47ip0v</guid>
      <pubDate>Sat, 30 Jul 2022 23:36:46 +0000</pubDate>
    </item>
    <item>
      <title>Rust in Perspective</title>
      <link>https://people.kernel.org/linusw/rust-in-perspective</link>
      <description>&lt;![CDATA[We are discussing and working toward adding the language Rust as a second implementation language in the Linux kernel. A year ago Jake Edge made an excellent summary of the discussions so far on Rust for the Linux kernel and we (or rather Miguel and Wedson) have made further progress since then. For the record I think this is overall a good idea and worth a try. I wanted to add some background that was sketched in a mail thread for the kernel summit.&#xA;&#xA;TL;DR: my claim is that Rust is attempting to raise the abstraction in the programming language and ultimately to join computer science and software engineering into one single discipline, an ambition that has been around since these disciplines were created.&#xA;&#xA;Beginning with ALGOL&#xA;&#xA;The first general high-level language was FORTRAN, which is still in use for some numerical analysis tasks around the world. Then came ALGOL, which attracted a wider audience.&#xA;&#xA;The first &#34;real&#34; operating system (using virtual memory etc) for the Atlas Machine) supervisor in 1962 was as far as I can tell implemented in Atlas autocode which was a dialect of ALGOL, which was the lingua franca at the time. Pure ALGOL could not be used because ALGOL 60 had no input/output primitives, so every real-world application of ALGOL, i.e. any application not solely relying on compiled-in constants, required custom I/O additions.&#xA;&#xA;Algol specifications&#xA;Copies of the first specifications of ALGOL 60, belonging at one time to Carl-Erik Fröberg at Lund University.&#xA;&#xA;ALGOL inspired CPL) that inspired BCPL that inspired the B programming language) that inspired the C programming language), which we use for the Linux kernel.&#xA;&#xA;Between 1958 and 1968 ALGOL was the nexus in a wide attempt to join computer languages with formal logic. In this timespan we saw the ALGOL 58, ALGOL 60 and ALGOL 68 revisions come out. The outcome was that it established computer science as a discipline and people could start building their academic careers on that topic. One notable outcome was the BNF form for describing syntax in languages. This time was in many ways formative for computer science: the first three volumes of Donald Knuths The Art of Computer Programming were published in close proximity to these events.&#xA;&#xA;To realize that ALGOL was popular and widespread at the time that Unix was born, and that C was in no way universally accepted, it would suffice to read a piece of the original Bourne Shell source code tree for example:&#xA;&#xA;setlist(arg,xp)&#xA;&#x9;REG ARGPTR&#x9;arg;&#xA;&#x9;INT&#x9;&#x9;xp;&#xA;{&#xA;&#x9;WHILE arg&#xA;&#x9;DO REG STRING&#x9;s=mactrim(arg-  argval);&#xA;&#x9;   setname(s, xp);&#xA;&#x9;   arg=arg-  argnxt;&#xA;&#x9;   IF flags&amp;execpr&#xA;&#x9;   THEN prs(s);&#xA;&#x9;&#x9;IF arg THEN blank(); ELSE newline(); FI&#xA;&#x9;   FI&#xA;&#x9;OD&#xA;}&#xA;&#xA;This doesn&#39;t look much like C as we know it, it looks much more like ALGOL 68. The ALGOL 68 definition added constructions such as IF/FI, DO/OD etc, which were not present in ALGOL 60. The reason is that Stephen Bourne was an influential contributor to ALGOL 68 and created a set of macros so that the C preprocessor would turn his custom dialect of ALGOL into C, for which I think someone on Reddit suggested to nominate bash for the obfuscated C contest.&#xA;&#xA;This is just one of the instances where we can see that the C programming language was not universally loved. The Bourne Shell scripting language that we all love and use is also quite close to ALGOL 68, so the descendants of this language is used more than we may think.&#xA;&#xA;Around 1970 Niklaus Wirth was working to improve ALGOL68 with what he called ALGOL W. Tired of the slowness of the language committee process he forked ALGOL and created the programming language Pascal) which was a success in its own right. In his very interesting IEEE article named A Brief History of Software Engineering Professor Wirth gives his perspective on some of the events around that time: first he writes about the very influential NATO conference on software engineering 1968 in Garmisch, Germany which served to define software engineering as a distinct discipline. To counter the so-called software crisis - the problems presented by emerging large complex systems - the suggestion was to raise the abstraction in new languages.&#xA;&#xA;To raise the abstraction means to use more mathematical, machine independent constructs in the language. First consider the difference between low-level and high-level languages: a simple operation such as x = x + 1 is not high level, and just a fancy assembly instruction; if we compile it we can readily observe the resulting code in some kind of ADD instruction in the resulting object code.  However a[i] = x + 1 raises abstraction past the point of high-level languages. This is because indexing into an array requires knowledge of the target machine specifics: base addresses, memory layout, etc. This makes the instruction more high-level and thus raises the abstraction of the language. The assumption is that several further higher levels of abstraction exist. We will look into some of these languages in the following sections.&#xA;&#xA;The Garmisch conference is famous in Unix circles because Douglas McIlroy was present and presented his idea of componentized software as a remedy against rising complexity, an idea that was later realized in the form of Unix&#39;s pipes and filters mechanism. D-Bus and similar component interoperation mechanisms are contemporary examples of such software componentry -- another way to counter complexity and make software less fragile, but not the focus in this article.&#xA;&#xA;Wirth makes one very specific and very important observation about the Garmisch conference:&#xA;&#xA;  Ultimately, analytic verification and correctness proofs were supposed to replace testing.&#xA;&#xA;This means exactly what it says: with formally verified programming languages, all the features and constructs that are formally proven need not be tested for. Software engineering is known for advocating test-driven development (TDD) to this day, and the ambition was to make large chunks of TDD completely unnecessary. Software testing has its own chapter in the mentioned report from the Garmisch NATO conference where the authors A.I. Llewelyn and R.F. Wickens conclude:&#xA;&#xA;  There are, fundamentally, two different methods of determining whether a product meets its specification. One can analyse the product in great detail and from this determine if it is in accordance with its specification, or one can measure its performance experimentally and see if the results are in accord with the specification; the number and sophistication of the experiments can be varied to provide the degree of confidence required of the results.&#xA;&#xA;The first part of this paragraph i.e. &#34;analyze in great detail&#34; is what Wirth calls analytic verification and is today called formal verification. The latter part of this paragraph is what we call test-driven development, TDD.  Also: the former is a matter of computer science, while the latter is a matter of software engineering. So here is a fork in the road.&#xA;&#xA;Wirth also claims the discussions in Garmisch had a distinct influence on Pascal. This can be easily spotted in Pascal strings, which was one of his principal improvements over ALGOL: Pascal strings are arrays of char, but unlike C char, a Pascal char is not the same as a byte; instead it is defined as belonging to an &#34;ordered character set&#34;, which can very well be ISO8859-1 or Unicode, less, more or equal to 255 characters in size. Strings stored in memory begin with an positive integer array length which defines how long the string is, but this is none of the programmer&#39;s business, this shall be altered by the language runtime and not by any custom code. Indexing out of bounds is therefore not possible and can be trivially prohibited during compilation and at runtime. This raises the abstraction of strings: they are set-entities, they have clear boundaries, they need special support code to handle the length field in memory. Further Pascal also has set types, such as:&#xA;&#xA;var&#xA;    JanuaryDays : set of 1..31;&#xA;&#xA;Perhaps Pascal&#39;s application to real-world problems didn&#39;t work out as expected, as it has since also defined PChar as a NULL-terminated pointer to a sequence of characters, akin to C strings. However it should be noted that Pascal pointers are persistently typed and cannot be converted: casting is not possible in Pascal. A Pascal pointer to an integer is always a pointer to an integer.&#xA;&#xA;From Wirth&#39;s perspective, C &#34;presented a great leap backward&#34; and he claims &#34;it revealed that the community at large had hardly grasped the true meaning of the term &#39;high-level language&#39; which became an ill-understood buzzword&#34;. He attributes the problem to Unix which he says &#34;acted like a Trojan horse for C&#34;. He further details the actual technical problems with C:&#xA;&#xA;  C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous.&#xA;&#xA;His point about C lacking index checking is especially important: it can be brought into question if C is really a high-level language. It is not fully abstracting away the machine specifics of handling an array. Language theorists can occasionally refer to C as a &#34;big macro assembler&#34;, the only thing abstracted away is really the raw instruction set.&#xA;&#xA;Wirth however also goes on to state the appealing aspects of the C programming language:&#xA;&#xA;  people at large, particularly in academia, found it intriguing and “better than assembly code” (...) its rules could easily be broken, exactly what many programmers cherished. It was possible to manage access to all of a computer’s idiosyncracies, to items that a high-level language would properly hide. C provided freedom, where high-level languages were considered as straight-jackets enforcing unwanted discipline. It was an invitation to use tricks which had been necessary to achieve efficiency in the early days of computers.&#xA;&#xA;We can see why an efficiency-oriented operating system kernel such as Linux will tend toward C.&#xA;&#xA;It&#39;s not like these tricks stopped after the early days of computing. Just the other day I wrote a patch for Linux with two similar code paths, which could be eliminated by cast:ing a &#xA;The language family including C and also Pascal is referred to as imperative programming languages. The defining character is that the programmer &#34;thinks like a computer&#34; or imagine themselves as the program counter to be exact. &#34;First I do this, next I do this, then I do this&#34; - a sequence of statements executed in order, keeping the computer state (such as registers, memory locations and stacks) in the back of your head.&#xA;&#xA;The immediate appeal to operating system programmers should be evident: this closely models what an OS developer needs to keep in mind, such as registers, stacks, cache frames, MMU tables, state transitions in hardware and so on. It is possible to see the whole family of imperative languages as domain specific languages for the domain of writing operating systems, so it would be for operating system developers what OpenGL is for computer graphics software developers.&#xA;&#xA;Lambda Calculus for Defining Languages&#xA;&#xA;In 1966 one of the early adopters and contributors to ALGOL (alongside Peter Naur, Tony Hoare and Niklaus Wirth), Peter Landin, published two articles in the Journal of the ACM titled Correspondence between ALGOL 60 and Church&#39;s Lambda-notation part I and part II. In the first article he begins with a good portion of dry humour:&#xA;&#xA;  Anyone familiar with both Church&#39;s λ-calculi and ALGOL 60 will have noticed a superficial resemblance between the way variables tie up with the λ&#39;s in a nest of λ-expressions, and the way that identifiers tie up with the headings in a nest of procedures and blocks.&#xA;&#xA;He is of course aware that no-one beside himself had been in the position to realize this: the overlap between people familiar with Alonzo Church&#39;s  λ-calculus and with ALGOL 60 was surprisingly down to one person on the planet. What is surprising is that it was even one person.&#xA;&#xA;Alonzo Church was a scholar of mathematical logic and computability, the supervisor of Alan Turing&#39;s doctoral thesis and active in the same field as Kurt Gödel (those men quoted each other in their respective articles). The lambda calculus ties into the type set theory created by Bertrand Russell and the logical-mathematical programme, another universe of history we will not discuss here.&#xA;&#xA;What λ-calculus (Lambda-calculus) does for a programming language definition is analogous to what regular expressions does for a languages syntax, but for it&#39;s semantics. While regular expressions can express how to parse a body of text in a language with regular grammar, expressions in λ-calculus can go on from the abstract syntax tree and express what an addition is, what a subtraction is, or what a bitwise OR is. This exercise is seldomly done in e.g. compiler construction courses, but defining semantics is an inherent part of a programming language definition.&#xA;&#xA;Perhaps the most remembered part of Landin&#39;s papers is his humorous term syntactic sugar which denotes things added to a language to make the life of the programmer easier, but which has no semantic content that cannot be expressed by the basic features of the language. The basic mathematical features of the language, on the other hand, are best expressed with λ-calculus.&#xA;&#xA;A notable invention in Landin&#39;s first article about defining ALGOL in terms of λ-calculus are the keywords let and where chosen to correspond to λ-calculus&#39; Applicable Expressions. These keywords do not exist in ALGOL: they are part of a language to talk about a language, or in more complicated terms: a meta-language. So here we see the first steps toward a new language derived from λ-calculus. Landin does not give this language a name in this article, but just refers to it as &#34;AE&#34;. The AE executes in a theoretical machine called SECD, which is another trick of the trade, like Alan Turings &#34;turing machine&#34;: rather close to a mathematicians statement &#34;let&#39;s assume we have...&#34; The complete framework for defining ALGOL in λ-calculus is called AE/SECD.&#xA;&#xA;Functional Programming&#xA;&#xA;Functional programming languages then, implements lambda calculus. The central idea after some years of experience with defining languages such as ALGOL in terms of lambda calculus, is to just make the language resemble lambda calculus expressions to begin with, and the verification of the semantics will be simple and obvious.&#xA;&#xA;In 1966 Peter Landin followed up his articles using λ-calculus to describe ALGOL with his article The Next 700 Programming Languages. Here he invents functional programming in the form of an invented language called ISWIM (If You See What I Mean), as you can see again with a good dry humour. The language is λ-calculus with &#34;syntactic sugar&#34; on top, so a broad family of languages are possible to create using the framework as a basis. Landin&#39;s article was popular, and people did invent languages. Maybe not 700 of them. Yet.&#xA;&#xA;In section 10 of his article, named Eliminating explicit sequencing, Landin starts speculating and talks about a game that can be played with ALGOL: by removing any goto statements and labels, the program get a less sequential nature, i.e. the program counter is just advancing to the next line or iterating a loop. He quips:&#xA;&#xA;  What other such features are there? This question is considered because, not surprisingly, it turns out that an emphasis on describing things in terms of other things leads to the same kind of requirements as an emphasis against explicit sequencing.&#xA;&#xA;He then goes on to show how to transform an ALGOL program into a purely functional ISWIM program and concludes:&#xA;&#xA;  The special claim of ISWlM is that it grafts procedural notions onto a purely functional base without disturbing many of the desirable properties. (...) This paper can do no more than begin the task of explaining their practical significance.&#xA;&#xA;This reads as a call to action: we need to create functional programming languages akin to ISWIM, and we need to get rid of the J operator (the program control flow operator). Landin never did that himself.&#xA;&#xA;The Meta Language ML&#xA;&#xA;A few years later, in 1974, computer scientist Robin Milner, inspired by ISWIM and as a response to Landin&#39;s challenge, created the language ML), short for Meta Language. This is one of the 700 next languages and clearly recognized Landin&#39;s ideas about a language for defining languages, a grammar for defining grammar: a meta language with a meta grammar.&#xA;&#xA;He implemented the language on the DEC10 computer with the help of Malcolm Newey, Lockwood Morris, Mike Gordon and Chris Wadswort. The language was later ported to the VAX architectures.&#xA;&#xA;The language was based on ISWIM and dropped the so-called J operator (program point operator). It is domain-specific, and intended for authoring a tool for theorem proving called LCF. Standard ML has been fully semantically specified and formally verified. This language became widely popular, both in academia and industry.&#xA;&#xA;Removing the J operator made ML a declarative language, i.e. it does not specify the order of execution of statements, putting it in the same class of languages as Prolog or for that matter: Makefiles: there is no control flow in a Makefile, just a number of conditions that need to be evaluated to arrive at a complete target.&#xA;&#xA;ML still has one imperative language feature: assignment. Around this time, some scholars thought both the J operator and assignment were unnecessary and went on to define purely functional languages such as Haskell. We will not consider them here, they are outside the scope of this article. ML and everything else we discuss can be labelled as impure: a pejorative term invented by people who like purely functional languages. These people dislike not only the sequencing nature of imperative languages but also the assignment (such as happens with the keyword let) and prefer to think about evaluating relationships between abstract entities.&#xA;&#xA;ML can be grasped intuitively. For example this expression in ML evaluates to the integer 64:&#xA;&#xA;let&#xA;    val m : int = 4&#xA;    val n : int = mm&#xA;in&#xA;    mn&#xA;end&#xA;&#xA;Here we see some still prominent AE/SECD, ISIWM features such as the keyword let for binding variables, or rather, associate names with elements such as integers and functions (similar to := assignment in some languages). The we see an implementation section in. We can define functions in ML, like this to compute the square root of five times x:&#xA;&#xA;val rootfivex : real -  real =&#xA;    fn x : real =  Math.sqrt (5.0  x)&#xA;&#xA;Notice absence of constructs such as BEGIN/END or semicolons. ML, like Python and other languages use whitespace to find beginning and end of basic blocks. The notation real -  real clearly states that the function takes a real number as input and produces a real number as output. The name real reflects some kind of mathematical ambition. The language cannot handle the mathematical set of real  numbers -- the ML real is what other languages call a float.&#xA;&#xA;ML has more syntactic sugar, so the following is equivalent using the keyword fun (fun-notation):&#xA;&#xA;fun rootfivex (x:real):real = Math.sqrt (5.0  x)&#xA;&#xA;The syntax should be possible to grasp intuitively. Another feature of ML and other functional languages is that they easily operate on tuples i.e. an ordered sequence of variables, and tuples can also be returned from functions. For example you can calculate the distance between origin and two coordinates in a x/y-oriented plane like this:&#xA;&#xA;fun dist (x:real, y:real):real = Math.sqrt (xx + yy)&#xA;&#xA;This function can then be called elsewhere like this:&#xA;&#xA;val coor (x:real, y:real)&#xA;val d = dist(coor)&#xA;&#xA;The type real of d will be inferred from the fact that the dist() function returns a real.&#xA;&#xA;ML gets much more complex than this. One of the upsides of the language that is universally admired is that ML programs, like most programs written in functional languages can be proven correct in the computational sense. This can be done within certain ramifications: for example input/output operations need to specify exactly which values are input or an undefined behaviour will occur.&#xA;&#xA;CAML and OCaml&#xA;&#xA;In 1987 Ascánder Suárez at the French Institute for Research in Computer Science and Automation (INRIA) reimplemented a compiler and runtime system for ML in LISP and called the result CAML for Categorical Abstract Machine Language, a pun on the fact that it ran on a virtual machine (Category Abstract Machine) and the heritage from ML proper. The abstract machine used was the LLM3 abstract LISP machine, which in turn ran on another computer. It was not fast.&#xA;&#xA;CAML was reimplemented in C in 1990-91 by Xavier Leroy, creating Caml Light, which was faster, because it was not written in a virtual machine running a virtual machine. Caml Light was more like Java and used a bytecode interpreter for its virtual machine.&#xA;&#xA;In 1995, Caml Special Light introduced a native compiler, so the bytecode produced from the Caml compiler could be compiled to object code and executed with no virtual machine overhead,  using a native runtime environment. Didier Rémy,  Jérôme Vouillon and Jacques Garrigue continued the development of Caml.&#xA;&#xA;Objective Caml arrived in 1996 and added some object oriented features to Caml. In 2011 the extended Caml Special Light compiler, and language derivative (dialect) of ML was renamed OCaml. In essence the compiler and language has a symbiotic relationship. There is no second implementation of OCaml.&#xA;&#xA;From the 1990s and forward, what is now the OCaml language and implementation has gained traction. It is a very popular functional programming language, or rather, popular as far as functional programming goes. It has optimized implementations for most architectures. The compiler itself is now written mostly in OCaml, but the runtime in C is still around, to hook into each operating system where the program will eventually run. The language and compiler has been used for a variety of applications. Every major Linux distribution carries packages with the OCaml compiler and libraries. There is even a GTK+ 3 OCaml library binding, so OCaml GUI programs can be created.&#xA;&#xA;OCaml simplifies binding labels to numbers etc, here is bubblesort implemented in OCaml:&#xA;&#xA;( Bubblesort in OCaml, Linus Walleij 2022 )&#xA;let sort v =&#xA;  let newv = Array.make (Array.length v) 0 in&#xA;  for i = 1 to (Array.length v) - 1 do&#xA;    if v.(i - 1)   v.(i) then begin&#xA;      newv.(i - 1) &lt;- v.(i);&#xA;      newv.(i) &lt;- v.(i - 1);&#xA;      ( Copy back so we are working on the same thing )&#xA;      v.(i - 1) &lt;- newv.(i - 1);&#xA;      v.(i) &lt;- newv.(i);&#xA;    end else begin&#xA;      newv.(i - 1) &lt;- v.(i - 1);&#xA;      newv.(i) &lt;- v.(i);&#xA;    end&#xA;  done;&#xA;  newv&#xA;&#xA;let rec ordered v =&#xA;  if Array.length v = 0 then true&#xA;  else if Array.length v = 1 then true&#xA;  ( ... or if the rest of the array is ordered )&#xA;  else if v.(0) &lt; v.(1) &amp;&amp; ordered (Array.sub v 1 (Array.length v - 1)) then true&#xA;  else false;;&#xA;&#xA;let plist v =&#xA;  printstring &#34;V = &#34;;&#xA;  for i = 0 to (Array.length v) - 1 do begin&#xA;    printint v.(i);&#xA;    if i &lt; (Array.length v - 1) then printstring &#34;,&#34;;&#xA;    end&#xA;  done;&#xA;  printendline &#34;&#34;;;&#xA;&#xA;let rec sortme v =&#xA;  if ordered v then v&#xA;  else sortme (sort v);;&#xA;&#xA;let v = [| 14 ; 4 ; 55 ; 100 ; 11 ; 29 ; 76 ; 19 ; 6 ; 82 ; 99 ; 0 ; 57 ; 36 ; 61 ; 30 |];;&#xA;plist v;;&#xA;plist (sortme v);;&#xA;&#xA;My experience with working with this example is that OCaml makes a &#34;bit of resistance&#34; to changing contents of things like arrays by indexing. It &#34;dislikes&#34; any imperative constructs and kind of nudges you in the direction of purely logical constructs such as the &#xA;OCaml is still a dialect of ML. The file ending used on all files is &#xA;Rust then&#xA;&#xA;Rust was initially developed in 2006 as a hobby project by Graydon Hoare who was at the time working at Mozilla. OCaml and ML is mentioned as the biggest influence on the language, apart from C/C++. A typical sign of this influence would be that the first compiler for Rust was written in OCaml. A notable contributor to this codebase, apart from Hoare, is Brendan Eich, one of the founders of the Mozilla project and the inventor of JavaScript. While Brendan did not contribute much code he was at the time CTO of Mozilla, and this shows that when Mozilla started supporting the project in 2009 Rust was certainly well anchored in the organization, and Eich&#39;s early contributions to the language should be noted. (It may be commonplace that people in the CTO position at middle sized companies make commits to complex code bases, but I am not aware in that case.)&#xA;&#xA;Despite the OCaml codebase the first documentation of the language talks more about other functional or declarative languages such as NIL, Hermes, Erlang, Sather, Newsqueak, Limbo and Napier. These origins with extensive quotes from e.g. Joe Armstrong (the inventor of Erlang) have been toned down in contemporary Rust documentation.  It is however very clear that Graydon has a deep interest in historical computer languages and is convinced that they have something to teach us, and the expressed ambition is to draw on these languages to pick the best parts. In his own words:&#xA;&#xA;  I&#39;ve always been a language pluralist -- picture my relationship towards languages like a kid enjoying a wide variety of building blocks, musical instruments or plastic dinosaurs -- and I don&#39;t think evangelism or single-language puritanism is especially helpful.&#xA;&#xA;What is unique about Rust is that it fuses &#34;impure&#34; functional programming with imperative programming, bringing several concepts from ML and OCaml over into the language.&#xA;&#xA;Another characteristic is that Rust compiled to target machine code from day one, rather than using any kind of virtual machine as did Peter Landins ISWIM, or the ML and OCaml languages (and as does say Java, or Python). Graydon probably did this intuitively, but a post he made in 2019 underscores the point: that virtual machines, even as an intermediate step, is bad language engineering and just generally a bad idea.&#xA;&#xA;In 2013 Graydon stepped down as main lead for Rust for personal reasons which he has detailed in a posting on Reddit.&#xA;&#xA;Rust has had the same symbiotic relationship between language and a single compiler implementation as OCaml, but this is changing, as there is now a second, GCC-based implementation in the works.&#xA;&#xA;Here is bubblesort implemented in Rust:&#xA;&#xA;/ Bubblesort in Rust, Linus Walleij 2022 /&#xA;fn sort(array : &amp;mut [i32]) {&#xA;   let mut x : i32;&#xA;   if array.len() == 1 {&#xA;      return;&#xA;   }&#xA;   for i in 1..array.len() {&#xA;      if array[i - 1]   array[i] {&#xA;      &#x9; x = array[i - 1];&#xA;&#x9; array[i - 1] = array[i];&#xA;&#x9; array[i] = x;&#xA;      }&#xA;   }&#xA;}&#xA;&#xA;fn isordered(array : &amp;[i32]) -  bool {&#xA;   if array.len() &lt;= 1 {&#xA;     return true;&#xA;   }&#xA;   for i in 1..array.len() {&#xA;     if array[i - 1]   array[i] {&#xA;       return false;&#xA;     }&#xA;   }&#xA;   return true;&#xA;}&#xA;&#xA;fn parray(array : &amp;[i32]) {&#xA;   let mut x : i32;&#xA;   print!(&#34;V = &#34;);&#xA;   for i in 0..array.len() {&#xA;       x = array[i];&#xA;       print!(&#34;{x}&#34;);&#xA;       if i != (array.len() - 1) {&#xA;       &#x9;  print!(&#34;,&#34;);&#xA;       }&#xA;   }&#xA;   println!(&#34;&#34;);&#xA;}&#xA;&#xA;fn main() {&#xA;   let mut array: [i32; 16] = [14, 4, 55, 100, 11, 29, 76, 19, 6, 82, 99, 0, 57, 36, 61, 30];&#xA;   parray(&amp;array);&#xA;   while !isordered(&amp;array) {&#xA;     sort(&amp;mut array);&#xA;   }&#xA;   parray(&amp;array);&#xA;}&#xA;&#xA;Rust leaves itself to easier imperative programming than OCaml: the keyword mut becomes quite similar to C:s const correctness tagging) in this example. Since &#xA;The stated ambition is improved memory safety, data-race safety (concurrency) and type safety. The article Safe Systems Programming in Rust certainly presents the ambition in a straight-forward manner. Graydon also underscores the focus on memory and concurrency safety in a 2016 blog post.&#xA;&#xA;But make no mistake. The current underlying ambition is definitely nothing different from the ambition of the ALGOL committee between 1958 and 1968: to raise the abstraction of the language through the ambition to join computer programming with formal logic. This comes from the arrival of strong academic support for the language.&#xA;&#xA;A typical indication of this ambition is the well-funded RustBelt project involving a large amount of academic researchers, all familiar with formal logic, and resulting in such artefacts as Ralf Jung&#39;s PhD thesis Understanding and Evolving the Rust Programming Language. Here, formal logic in Rust Belt and the Coq proof assistant is used and concludes (from the abstract):&#xA;&#xA;  Together, these proofs establish that, as long as the only unsafe code in a well-typed λRust program is confined to libraries that satisfy their verification conditions, the program is safe to execute.&#xA;&#xA;What is meant by &#34;safe to execute&#34; is that no use-after-free, dangling pointers, stale references, NULL pointer exceptions etc can ever occur in safe Rust code, because it is proven by formal logic: QED. It does not stop you from e.g. dividing by zero however, that problem is out-of-scope for the exercise.&#xA;&#xA;To me personally the most astonishing fact about Jung&#39;s thesis is that it manages to repeatedly cite and reference the computer scientist Tony Hoare without quoting the inventor of the Rust language, Graydon Hoare, a single time. In a way it confirms Graydon&#39;s own statement that Rust &#34;contains nothing new&#34; from a language point of view.&#xA;&#xA;The C programming language cannot be subject to the same scrutiny as Rust, simply because of all the (ab)use it allows, and which was mentioned by Wirth in his historical perspective: if a type can be changed by a cast and array indexing is not even part of the language, there is nothing much to prove. What has been interesting for scholars to investigate is a well-defined subset of C, such as the eBPF subset, which also partly explains the strong interest in eBPF: like with Rust, the build environment and language runtime has been defined with much stricter constraints and thus can be subject to formal verification.&#xA;&#xA;The ambition of Rust is, as I perceieve it, and whether the people driving it even knows it or not, to finish what the ALGOL committe as primus motor started in 1958, and what the Garmisch NATO conference concluded was necessary in 1968: to develop a language for systems programming that rely on formal logic proof, and to fulfil what ALGOL never could, what Pascal never could, and what the whole maybe-not-700 functional programming languages never could: a language that joins the disciplines of computer science and software Engineering into ONE discipline, where the scholars of each can solve problems together.&#xA;&#xA;That is the ambition of Rust as an implementation language for operating systems, such as Linux: provide a language backed by current top-of-the-line computer science research, for immediate application to software engineering developing the top-of-the-line operating system.&#xA;&#xA;What it offers Linux is raised abstraction to counter the problems of complexity identified in the 1968 Garmisch NATO conference and now bleeding obvious given the recurring security incidents, and thereby would bring the engineering project Linux closer to computer science.&#xA;&#xA;Other approaches to increased Linux (memory- concurrency-) safety are possible: notably increased testing, which is the engineering go-to panacea. And automated testing of Linux has indeed increased a lot in recent years. Raising the abstraction of the implementation language and proving it formally comes with the ambition to make testing less important.&#xA;&#xA;[Mathieu Poirer and Jesper Jansson has helped out in reviewing this blog post, for which I am forever grateful: remaining errors, bugs and biased opinions are my own.]]]&gt;</description>
      <content:encoded><![CDATA[<p>We are discussing and working toward adding the language Rust as a second implementation language in the Linux kernel. A year ago Jake Edge made <a href="https://lwn.net/Articles/862018/" rel="nofollow">an excellent summary</a> of the discussions so far on Rust for the Linux kernel and we (or rather Miguel and Wedson) have made further progress since then. For the record I think this is overall a good idea and worth a try. I wanted to add some background that was sketched <a href="https://lore.kernel.org/ksummit/CANiq72nNKvFqQs9Euy=_McfcHf0-dC_oPB3r8ZJii2L3sfVjaw@mail.gmail.com/" rel="nofollow">in a mail thread for the kernel summit</a>.</p>

<p>TL;DR: my claim is that Rust is attempting to <strong>raise the abstraction</strong> in the programming language and ultimately to join <strong>computer science</strong> and <strong>software engineering</strong> into one single discipline, an ambition that has been around since these disciplines were created.</p>

<h2 id="beginning-with-algol">Beginning with ALGOL</h2>

<p>The first general high-level language was FORTRAN, which is still in use for some numerical analysis tasks around the world. Then came ALGOL, which attracted a wider audience.</p>

<p>The first “real” operating system (using <a href="https://en.wikipedia.org/wiki/Virtual_memory" rel="nofollow">virtual memory</a> etc) for <a href="https://en.wikipedia.org/wiki/Atlas_(computer)" rel="nofollow">the Atlas Machine</a> supervisor in 1962 was as far as I can tell implemented in <em>Atlas autocode</em> which was a dialect of <strong>ALGOL</strong>, which was the lingua franca at the time. Pure ALGOL could not be used because ALGOL 60 had no input/output primitives, so every real-world application of ALGOL, i.e. any application not solely relying on compiled-in constants, required custom I/O additions.</p>

<p><img src="https://dflund.se/~triad/images/Algol-first-copies.jpg" alt="Algol specifications">
<em>Copies of the first specifications of ALGOL 60, belonging at one time to Carl-Erik Fröberg at Lund University.</em></p>

<p><a href="https://en.wikipedia.org/wiki/ALGOL" rel="nofollow">ALGOL</a> inspired <a href="https://en.wikipedia.org/wiki/CPL_(programming_language)" rel="nofollow">CPL</a> that inspired <a href="https://en.wikipedia.org/wiki/BCPL" rel="nofollow">BCPL</a> that inspired <a href="https://en.wikipedia.org/wiki/B_(programming_language)" rel="nofollow">the B programming language</a> that inspired the <a href="https://en.wikipedia.org/wiki/C_(programming_language)" rel="nofollow">C programming language</a>, which we use for the Linux kernel.</p>

<p>Between 1958 and 1968 ALGOL was the nexus in a wide attempt to join computer languages with formal logic. In this timespan we saw the ALGOL 58, ALGOL 60 and ALGOL 68 revisions come out. The outcome was that it established <strong>computer science</strong> as a discipline and people could start building their academic careers on that topic. One notable outcome was <a href="https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form" rel="nofollow">the BNF form</a> for describing syntax in languages. This time was in many ways formative for computer science: the first three volumes of Donald Knuths <em>The Art of Computer Programming</em> were published in close proximity to these events.</p>

<p>To realize that ALGOL was popular and widespread at the time that Unix was born, and that C was in no way universally accepted, it would suffice to read a piece of the original <a href="https://en.wikipedia.org/wiki/Bourne_shell" rel="nofollow">Bourne Shell</a> <a href="https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/sh" rel="nofollow">source code tree</a> for example:</p>

<pre><code class="language-c">setlist(arg,xp)
	REG ARGPTR	arg;
	INT		xp;
{
	WHILE arg
	DO REG STRING	s=mactrim(arg-&gt;argval);
	   setname(s, xp);
	   arg=arg-&gt;argnxt;
	   IF flags&amp;execpr
	   THEN prs(s);
		IF arg THEN blank(); ELSE newline(); FI
	   FI
	OD
}
</code></pre>

<p>This doesn&#39;t look much like C as we know it, it looks much more like ALGOL 68. The ALGOL 68 definition added constructions such as IF/FI, DO/OD etc, which were not present in ALGOL 60. The reason is that Stephen Bourne was an influential contributor to ALGOL 68 and created <a href="https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/sh/mac.h" rel="nofollow">a set of macros</a> so that the C preprocessor would turn his custom dialect of ALGOL into C, for which I think someone on Reddit suggested to nominate <em>bash</em> for the obfuscated C contest.</p>

<p>This is just one of the instances where we can see that the C programming language was not universally loved. The Bourne Shell scripting language that we all love and use is also quite close to ALGOL 68, so the descendants of this language is used more than we may think.</p>

<p>Around 1970 Niklaus Wirth was working to improve ALGOL68 with what he called ALGOL W. Tired of the slowness of the language committee process he forked ALGOL and created <a href="https://en.wikipedia.org/wiki/Pascal_(programming_language)" rel="nofollow">the programming language Pascal</a> which was a success in its own right. In his very interesting IEEE article named <a href="https://people.inf.ethz.ch/wirth/Miscellaneous/IEEE-Annals.pdf" rel="nofollow"><em>A Brief History of Software Engineering</em></a> Professor Wirth gives his perspective on some of the events around that time: first he writes about the very influential <a href="http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PDF" rel="nofollow">NATO conference on software engineering 1968</a> in Garmisch, Germany which served to define <strong>software engineering</strong> as a distinct discipline. To counter the so-called <a href="https://en.wikipedia.org/wiki/Software_crisis" rel="nofollow"><em>software crisis</em></a> – the problems presented by emerging large complex systems – the suggestion was to <strong>raise the abstraction</strong> in new languages.</p>

<p>To <em>raise the abstraction</em> means to use more mathematical, machine independent constructs in the language. First consider the difference between low-level and high-level languages: a simple operation such as <strong>x = x + 1</strong> is not high level, and just a fancy assembly instruction; if we compile it we can readily observe the resulting code in some kind of <em>ADD</em> instruction in the resulting object code.  However <strong>a[i] = x + 1</strong> raises abstraction past the point of <em>high-level languages</em>. This is because indexing into an array requires knowledge of the target machine specifics: base addresses, memory layout, etc. This makes the instruction <em>more high-level</em> and thus raises the abstraction of the language. The assumption is that several further higher levels of abstraction exist. We will look into some of these languages in the following sections.</p>

<p>The Garmisch conference is famous in Unix circles because <a href="https://en.wikipedia.org/wiki/Douglas_McIlroy" rel="nofollow">Douglas McIlroy</a> was present and presented his idea of componentized software as a remedy against rising complexity, an idea that was later realized in the form of Unix&#39;s pipes and filters mechanism. D-Bus and similar component interoperation mechanisms are contemporary examples of such software componentry — another way to counter complexity and make software less fragile, but not the focus in this article.</p>

<p>Wirth makes one very specific and very important observation about the Garmisch conference:</p>

<blockquote><p>Ultimately, analytic verification and correctness proofs were supposed to replace testing.</p></blockquote>

<p>This means exactly what it says: with formally verified programming languages, all the features and constructs that are formally proven need not be tested for. Software engineering is known for advocating <a href="https://en.wikipedia.org/wiki/Test-driven_development" rel="nofollow">test-driven development (TDD)</a> to this day, and the ambition was to make large chunks of TDD completely unnecessary. Software testing has its own chapter in the mentioned report from the Garmisch NATO conference where the authors A.I. Llewelyn and R.F. Wickens conclude:</p>

<blockquote><p>There are, fundamentally, two different methods of determining whether a product meets its specification. One can analyse the product in great detail and from this determine if it is in accordance with its specification, or one can measure its performance experimentally and see if the results are in accord with the specification; the number and sophistication of the experiments can be varied to provide the degree of confidence required of the results.</p></blockquote>

<p>The first part of this paragraph i.e. “analyze in great detail” is what Wirth calls <em>analytic verification</em> and is today called <em>formal verification</em>. The latter part of this paragraph is what we call test-driven development, TDD.  Also: the former is a matter of computer science, while the latter is a matter of software engineering. So here is a fork in the road.</p>

<p>Wirth also claims the discussions in Garmisch had a distinct influence on Pascal. This can be easily spotted in Pascal strings, which was one of his principal improvements over ALGOL: Pascal strings are arrays of <em>char</em>, but unlike <em>C char</em>, a <em>Pascal char</em> is not the same as a byte; instead it is defined as belonging to an “ordered character set”, which can very well be ISO8859-1 or Unicode, less, more or equal to 255 characters in size. Strings stored in memory begin with an positive integer array length which defines how long the string is, but this is none of the programmer&#39;s business, this shall be altered by the language runtime and not by any custom code. Indexing out of bounds is therefore not possible and can be trivially prohibited during compilation and at runtime. This raises the abstraction of strings: they are set-entities, they have clear boundaries, they need special support code to handle the length field in memory. Further Pascal also has <strong>set</strong> types, such as:</p>

<pre><code>var
    JanuaryDays : set of 1..31;
</code></pre>

<p>Perhaps Pascal&#39;s application to real-world problems didn&#39;t work out as expected, as it has since also defined <em>PChar</em> as a NULL-terminated pointer to a sequence of characters, akin to C strings. However it should be noted that Pascal pointers are persistently typed and cannot be converted: casting is not possible in Pascal. A Pascal pointer to an integer is <em>always</em> a pointer to an integer.</p>

<p>From Wirth&#39;s perspective, C “presented <em>a great leap backward</em>” and he claims “it revealed that the community at large had hardly grasped the true meaning of the term &#39;high-level language&#39; which became an ill-understood buzzword”. He attributes the problem to Unix which he says “acted like a Trojan horse for C”. He further details the actual technical problems with C:</p>

<blockquote><p>C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous.</p></blockquote>

<p>His point about C lacking index checking is especially important: it can be brought into question if C is really a high-level language. It is not fully abstracting away the machine specifics of handling an array. Language theorists can occasionally refer to C as a “big macro assembler”, the only thing abstracted away is really the raw instruction set.</p>

<p>Wirth however also goes on to state the appealing aspects of the C programming language:</p>

<blockquote><p>people at large, particularly in academia, found it intriguing and “better than assembly code” (...) its rules could easily be broken, exactly what many programmers cherished. It was possible to manage access to all of a computer’s idiosyncracies, to items that a high-level language would properly hide. C provided freedom, where high-level languages were considered as straight-jackets enforcing unwanted discipline. It was an invitation to use tricks which had been necessary to achieve efficiency in the early days of computers.</p></blockquote>

<p>We can see why an efficiency-oriented operating system kernel such as Linux will tend toward C.</p>

<p>It&#39;s not like these tricks stopped after the early days of computing. Just the other day I wrote <a href="https://lore.kernel.org/lkml/20220725085822.2360234-1-linus.walleij@linaro.org/" rel="nofollow">a patch</a> for Linux with two similar code paths, which could be eliminated by cast:ing a <code>(const void *)</code> into a <code>(void *)</code> which I then quipped about in the commit message of <a href="https://lore.kernel.org/lkml/20220725141036.2399822-1-linus.walleij@linaro.org/" rel="nofollow">the revised patch</a>. The reason for violating formal rules in this case — is that of a choice between two evils, and chosing the lesser evil: in a choice between formal correctness and code reuse I chose code reuse. And C enables that kind of choice. The languages presented later in this article <em>absolutely do not</em> allow that kind of choice, and C casts are seen as nothing less than an abomination.</p>

<p>The language family including C and also Pascal is referred to as <a href="https://en.wikipedia.org/wiki/Imperative_programming" rel="nofollow"><em>imperative programming languages</em></a>. The defining character is that the programmer “thinks like a computer” or imagine themselves as the program counter to be exact. “First I do this, next I do this, then I do this” – a sequence of statements executed in order, keeping the computer <em>state</em> (such as registers, memory locations and stacks) in the back of your head.</p>

<p>The immediate appeal to operating system programmers should be evident: this closely models what an OS developer needs to keep in mind, such as registers, stacks, cache frames, MMU tables, state transitions in hardware and so on. It is possible to see the whole family of imperative languages as <a href="https://en.wikipedia.org/wiki/Domain-specific_language" rel="nofollow"><em>domain specific languages</em></a> for the domain of writing operating systems, so it would be for operating system developers what OpenGL is for computer graphics software developers.</p>

<h2 id="lambda-calculus-for-defining-languages">Lambda Calculus for Defining Languages</h2>

<p>In 1966 one of the early adopters and contributors to ALGOL (alongside Peter Naur, Tony Hoare and Niklaus Wirth), <a href="https://en.wikipedia.org/wiki/Peter_Landin" rel="nofollow">Peter Landin</a>, published two articles in the Journal of the ACM titled <em>Correspondence between ALGOL 60 and Church&#39;s Lambda-notation</em> <a href="https://fi.ort.edu.uy/innovaportal/file/20124/1/22-landin_correspondence-between-algol-60-and-churchs-lambda-notation.pdf" rel="nofollow">part I</a> and <a href="https://dl.acm.org/doi/10.1145/363791.363804" rel="nofollow">part II</a>. In the first article he begins with a good portion of dry humour:</p>

<blockquote><p>Anyone familiar with both Church&#39;s λ-calculi and ALGOL 60 will have noticed a superficial resemblance between the way variables tie up with the λ&#39;s in a nest of λ-expressions, and the way that identifiers tie up with the headings in a nest of procedures and blocks.</p></blockquote>

<p>He is of course aware that no-one beside himself had been in the position to realize this: the overlap between people familiar with <a href="https://en.wikipedia.org/wiki/Alonzo_Church" rel="nofollow">Alonzo Church</a>&#39;s  λ-calculus <em>and</em> with ALGOL 60 was <em>surprisingly</em> down to one person on the planet. What is surprising is that it was even one person.</p>

<p>Alonzo Church was a scholar of mathematical logic and computability, the supervisor of Alan Turing&#39;s doctoral thesis and active in the same field as Kurt Gödel (those men quoted each other in their respective articles). The lambda calculus ties into the type set theory created by Bertrand Russell and the logical-mathematical programme, another universe of history we will not discuss here.</p>

<p>What λ-calculus (Lambda-calculus) does for a programming language definition is analogous to what regular expressions does for a languages <em>syntax</em>, but for it&#39;s <em>semantics</em>. While regular expressions can express how to parse a body of text in a language with <a href="https://en.wikipedia.org/wiki/Regular_grammar" rel="nofollow">regular</a> grammar, expressions in λ-calculus can go on from the abstract syntax tree and express what an addition is, what a subtraction is, or what a bitwise OR is. This exercise is seldomly done in e.g. compiler construction courses, but defining semantics is an inherent part of a programming language definition.</p>

<p>Perhaps the most remembered part of Landin&#39;s papers is his humorous term <a href="https://en.wikipedia.org/wiki/Syntactic_sugar" rel="nofollow"><em>syntactic sugar</em></a> which denotes things added to a language to make the life of the programmer easier, but which has no semantic content that cannot be expressed by the basic features of the language. The basic mathematical features of the language, on the other hand, are best expressed with λ-calculus.</p>

<p>A notable invention in Landin&#39;s first article about defining ALGOL in terms of λ-calculus are the keywords <strong>let</strong> and <strong>where</strong> chosen to correspond to λ-calculus&#39; <em>Applicable Expressions</em>. These keywords do not exist in ALGOL: they are part of a language to talk about a language, or in more complicated terms: a <em>meta-language</em>. So here we see the first steps toward a new language derived from λ-calculus. Landin does not give this language a name in this article, but just refers to it as “AE”. The AE executes in a theoretical machine called SECD, which is another trick of the trade, like Alan Turings “turing machine”: rather close to a mathematicians statement “let&#39;s assume we have...” The complete framework for defining ALGOL in λ-calculus is called AE/SECD.</p>

<h2 id="functional-programming">Functional Programming</h2>

<p><a href="https://en.wikipedia.org/wiki/Functional_programming" rel="nofollow">Functional programming languages</a> then, <em>implements</em> lambda calculus. The central idea after some years of experience with defining languages such as ALGOL in terms of lambda calculus, is to just make the language resemble lambda calculus expressions to begin with, and the verification of the semantics will be simple and obvious.</p>

<p>In 1966 Peter Landin followed up his articles using λ-calculus to describe ALGOL with his article <a href="https://www.cs.cmu.edu/~crary/819-f09/Landin66.pdf" rel="nofollow"><em>The Next 700 Programming Languages</em></a>. Here he invents functional programming in the form of an invented language called <strong>ISWIM</strong> (If You See What I Mean), as you can see again with a good dry humour. The language is λ-calculus with “syntactic sugar” on top, so a broad family of languages are possible to create using the framework as a basis. Landin&#39;s article was popular, and people did invent languages. Maybe not 700 of them. Yet.</p>

<p>In section 10 of his article, named <em>Eliminating explicit sequencing</em>, Landin starts speculating and talks about a game that can be played with ALGOL: by removing any <strong>goto</strong> statements and labels, the program get a less sequential nature, i.e. the program counter is just advancing to the next line or iterating a loop. He quips:</p>

<blockquote><p>What other such features are there? This question is considered because, not surprisingly, it turns out that an emphasis on describing things in terms of other things leads to the same kind of requirements as an emphasis against explicit sequencing.</p></blockquote>

<p>He then goes on to show how to transform an ALGOL program into a purely functional ISWIM program and concludes:</p>

<blockquote><p>The special claim of ISWlM is that it grafts procedural notions onto a purely functional base without disturbing many of the desirable properties. (...) This paper can do no more than begin the task of explaining their practical significance.</p></blockquote>

<p>This reads as a call to action: we need to create functional programming languages akin to ISWIM, and we need to get rid of the J operator (the program control flow operator). Landin never did that himself.</p>

<h2 id="the-meta-language-ml">The Meta Language ML</h2>

<p>A few years later, in 1974, computer scientist <a href="https://en.wikipedia.org/wiki/Robin_Milner" rel="nofollow">Robin Milner</a>, inspired by ISWIM and as a response to Landin&#39;s challenge, created the language <a href="https://en.wikipedia.org/wiki/ML_(programming_language)" rel="nofollow"><strong>ML</strong></a>, short for <strong>Meta Language</strong>. This is one of the 700 next languages and clearly recognized Landin&#39;s ideas about a language for defining languages, a grammar for defining grammar: a <em>meta language</em> with a <em>meta grammar</em>.</p>

<p>He implemented the language on the DEC10 computer with the help of Malcolm Newey, Lockwood Morris, Mike Gordon and Chris Wadswort. The language was later ported to the VAX architectures.</p>

<p>The language was based on ISWIM and dropped the so-called <a href="https://en.wikipedia.org/wiki/J_operator" rel="nofollow"><em>J operator</em></a> (program point operator). It is domain-specific, and intended for authoring <a href="https://en.wikipedia.org/wiki/Logic_for_Computable_Functions" rel="nofollow">a tool for theorem proving called LCF</a>. Standard ML has been <a href="https://smlfamily.github.io/sml97-defn.pdf" rel="nofollow">fully semantically specified</a> and formally verified. This language became widely popular, both in academia and industry.</p>

<p>Removing the J operator made ML a <a href="https://en.wikipedia.org/wiki/Declarative_programming" rel="nofollow">declarative language</a>, i.e. it does not specify the order of execution of statements, putting it in the same class of languages as Prolog or for that matter: Makefiles: there is no control flow in a Makefile, just a number of conditions that need to be evaluated to arrive at a complete target.</p>

<p>ML still has one imperative language feature: assignment. Around this time, some scholars thought both the <em>J operator</em> and <em>assignment</em> were unnecessary and went on to define <a href="https://en.wikipedia.org/wiki/Purely_functional_programming" rel="nofollow">purely functional languages</a> such as Haskell. We will not consider them here, they are outside the scope of this article. ML and everything else we discuss can be labelled as <em>impure</em>: a pejorative term invented by people who like purely functional languages. These people dislike not only the sequencing nature of imperative languages but also the assignment (such as happens with the keyword <strong>let</strong>) and prefer to think about evaluating relationships between abstract entities.</p>

<p>ML can be grasped intuitively. For example this expression in ML evaluates to the integer 64:</p>

<pre><code>let
    val m : int = 4
    val n : int = m*m
in
    m*n
end
</code></pre>

<p>Here we see some still prominent AE/SECD, ISIWM features such as the keyword <strong>let</strong> for binding variables, or rather, associate names with elements such as integers and functions (similar to <strong>:=</strong> assignment in some languages). The we see an implementation section <strong>in</strong>. We can define functions in ML, like this to compute the square root of five times <em>x</em>:</p>

<pre><code>val rootfivex : real -&gt; real =
    fn x : real =&gt; Math.sqrt (5.0 * x)
</code></pre>

<p>Notice absence of constructs such as BEGIN/END or semicolons. ML, like Python and other languages use whitespace to find beginning and end of basic blocks. The notation <em>real –&gt; real</em> clearly states that the function takes a real number as input and produces a real number as output. The name <em>real</em> reflects some kind of mathematical ambition. The language cannot handle the mathematical set of real  numbers — the ML <em>real</em> is what other languages call a <em>float</em>.</p>

<p>ML has more syntactic sugar, so the following is equivalent using the keyword <strong>fun</strong> (fun-notation):</p>

<pre><code>fun rootfivex (x:real):real = Math.sqrt (5.0 * x)
</code></pre>

<p>The syntax should be possible to grasp intuitively. Another feature of ML and other functional languages is that they easily operate on <em>tuples</em> i.e. an ordered sequence of variables, and tuples can also be returned from functions. For example you can calculate the distance between origin and two coordinates in a x/y-oriented plane like this:</p>

<pre><code>fun dist (x:real, y:real):real = Math.sqrt (x*x + y*y)
</code></pre>

<p>This function can then be called elsewhere like this:</p>

<pre><code>val coor (x:real, y:real)
val d = dist(coor)
</code></pre>

<p>The type real of <em>d</em> will be inferred from the fact that the <em>dist()</em> function returns a real.</p>

<p>ML gets much more complex than this. One of the upsides of the language that is universally admired is that ML programs, like most programs written in functional languages can be <em>proven correct</em> in the computational sense. This can be done within certain ramifications: for example input/output operations need to specify exactly which values are input or an undefined behaviour will occur.</p>

<h2 id="caml-and-ocaml">CAML and OCaml</h2>

<p>In 1987 Ascánder Suárez at the French Institute for Research in Computer Science and Automation (INRIA) <a href="https://caml.inria.fr/about/history.en.html" rel="nofollow">reimplemented a compiler and runtime system for ML in LISP</a> and called the result <strong>CAML</strong> for <em>Categorical Abstract Machine Language</em>, a pun on the fact that it ran on a virtual machine (Category Abstract Machine) and the heritage from ML proper. The abstract machine used was the LLM3 abstract LISP machine, which in turn ran on another computer. It was not fast.</p>

<p>CAML was reimplemented in C in 1990-91 by Xavier Leroy, creating <em>Caml Light</em>, which was faster, because it was not written in a virtual machine running a virtual machine. Caml Light was more like Java and used a <a href="https://en.wikipedia.org/wiki/Bytecode" rel="nofollow">bytecode interpreter</a> for its virtual machine.</p>

<p>In 1995, Caml Special Light introduced a native compiler, so the bytecode produced from the Caml compiler could be compiled to object code and executed with no virtual machine overhead,  using a <a href="https://github.com/ocaml/ocaml/tree/trunk/runtime" rel="nofollow">native <em>runtime environment</em></a>. Didier Rémy,  Jérôme Vouillon and Jacques Garrigue continued the development of Caml.</p>

<p>Objective Caml arrived in 1996 and added some object oriented features to Caml. In 2011 the extended Caml Special Light compiler, and language derivative (dialect) of ML was renamed <strong>OCaml</strong>. In essence the compiler and language has a symbiotic relationship. There is no second implementation of OCaml.</p>

<p>From the 1990s and forward, what is now <a href="https://github.com/ocaml/ocaml" rel="nofollow">the OCaml language and implementation</a> has gained traction. It is a very popular functional programming language, or rather, popular as far as functional programming goes. It has <a href="https://github.com/ocaml/ocaml/tree/trunk/asmcomp" rel="nofollow">optimized implementations</a> for most architectures. The compiler itself is now written mostly in OCaml, but the runtime in C is still around, to hook into each operating system where the program will eventually run. The language and compiler has been used for a variety of applications. Every major Linux distribution carries packages with the OCaml compiler and libraries. There is even <a href="https://garrigue.github.io/lablgtk/" rel="nofollow">a GTK+ 3 OCaml library binding</a>, so OCaml GUI programs can be created.</p>

<p>OCaml simplifies binding labels to numbers etc, here is bubblesort implemented in OCaml:</p>

<pre><code>(* Bubblesort in OCaml, Linus Walleij 2022 *)
let sort v =
  let newv = Array.make (Array.length v) 0 in
  for i = 1 to (Array.length v) - 1 do
    if v.(i - 1) &gt; v.(i) then begin
      newv.(i - 1) &lt;- v.(i);
      newv.(i) &lt;- v.(i - 1);
      (* Copy back so we are working on the same thing *)
      v.(i - 1) &lt;- newv.(i - 1);
      v.(i) &lt;- newv.(i);
    end else begin
      newv.(i - 1) &lt;- v.(i - 1);
      newv.(i) &lt;- v.(i);
    end
  done;
  newv

let rec ordered v =
  if Array.length v = 0 then true
  else if Array.length v = 1 then true
  (* ... or if the rest of the array is ordered *)
  else if v.(0) &lt; v.(1) &amp;&amp; ordered (Array.sub v 1 (Array.length v - 1)) then true
  else false;;

let plist v =
  print_string &#34;V = &#34;;
  for i = 0 to (Array.length v) - 1 do begin
    print_int v.(i);
    if i &lt; (Array.length v - 1) then print_string &#34;,&#34;;
    end
  done;
  print_endline &#34;&#34;;;

let rec sortme v =
  if ordered v then v
  else sortme (sort v);;

let v = [| 14 ; 4 ; 55 ; 100 ; 11 ; 29 ; 76 ; 19 ; 6 ; 82 ; 99 ; 0 ; 57 ; 36 ; 61 ; 30 |];;
plist v;;
plist (sortme v);;
</code></pre>

<p>My experience with working with this example is that OCaml makes a “bit of resistance” to changing contents of things like arrays by indexing. It “dislikes” any imperative constructs and kind of nudges you in the direction of purely logical constructs such as the <code>ordered</code> function above. This is just my personal take.</p>

<p>OCaml is still a dialect of ML. The file ending used on all files is <code>.ml</code> as well. OCaml – like Pythons <em>pip</em> or Perls <em>CPAN</em> has its own package system and library called <a href="https://opam.ocaml.org/" rel="nofollow"><strong>opam</strong></a>. The prime application is still the <a href="https://opam.ocaml.org/packages/alt-ergo-lib-free/" rel="nofollow">OCaml Ergo Library</a>, a library for automatic theorem proving. If your first and foremost use of computers is theorem proving, ML and OCaml continue to deliver since 1974. The more recent and widely popular <a href="https://en.wikipedia.org/wiki/Coq" rel="nofollow">Coq theorem prover</a> is also written in OCaml.</p>

<h2 id="rust-then">Rust then</h2>

<p>Rust was initially developed in 2006 as a hobby project by Graydon Hoare who was at the time working at Mozilla. OCaml and ML <a href="https://doc.rust-lang.org/reference/influences.html" rel="nofollow">is mentioned</a> as the biggest influence on the language, apart from C/C++. A typical sign of this influence would be that <a href="https://github.com/graydon/rust-prehistory/tree/master/src/boot/fe" rel="nofollow">the first compiler for Rust</a> was written in OCaml. A notable contributor to this codebase, apart from Hoare, is <a href="https://en.wikipedia.org/wiki/Brendan_Eich" rel="nofollow">Brendan Eich</a>, one of the founders of the Mozilla project and the inventor of JavaScript. While Brendan did not contribute much code he was at the time CTO of Mozilla, and this shows that when Mozilla started supporting the project in 2009 Rust was certainly well anchored in the organization, and Eich&#39;s early contributions to the language should be noted. (It may be commonplace that people in the CTO position at middle sized companies make commits to complex code bases, but I am not aware in that case.)</p>

<p>Despite the OCaml codebase <a href="https://github.com/graydon/rust-prehistory/blob/master/doc/rust.texi" rel="nofollow">the first documentation</a> of the language talks more about other functional or declarative languages such as NIL, Hermes, Erlang, Sather, Newsqueak, Limbo and Napier. These origins with extensive quotes from e.g. Joe Armstrong (the inventor of Erlang) have been toned down in contemporary Rust documentation.  It is however very clear that Graydon has a deep interest in historical computer languages and is convinced that they have something to teach us, and the expressed ambition is to draw on these languages to pick the best parts. In his own words:</p>

<blockquote><p>I&#39;ve always been a language pluralist — picture my relationship towards languages like a kid enjoying a wide variety of building blocks, musical instruments or plastic dinosaurs — and I don&#39;t think evangelism or single-language puritanism is especially helpful.</p></blockquote>

<p>What is unique about Rust is that it fuses “impure” functional programming with imperative programming, bringing several concepts from ML and OCaml over into the language.</p>

<p>Another characteristic is that Rust compiled to target machine code from day one, rather than using any kind of virtual machine as did Peter Landins ISWIM, or the ML and OCaml languages (and as does say Java, or Python). Graydon probably did this intuitively, but <a href="https://graydon2.dreamwidth.org/264181.html" rel="nofollow">a post he made in 2019</a> underscores the point: that virtual machines, even as an intermediate step, is bad language engineering and just generally a <em>bad idea</em>.</p>

<p>In 2013 Graydon stepped down as main lead for Rust for personal reasons which he has detailed <a href="https://www.reddit.com/r/rust/comments/7qels2/i_wonder_why_graydon_hoare_the_author_of_rust/" rel="nofollow">in a posting on Reddit</a>.</p>

<p>Rust has had the same symbiotic relationship between language and a single compiler implementation as OCaml, but this is changing, as there is now a second, GCC-based implementation in the works.</p>

<p>Here is bubblesort implemented in Rust:</p>

<pre><code>/* Bubblesort in Rust, Linus Walleij 2022 */
fn sort(array : &amp;mut [i32]) {
   let mut x : i32;
   if array.len() == 1 {
      return;
   }
   for i in 1..array.len() {
      if array[i - 1] &gt; array[i] {
      	 x = array[i - 1];
	 array[i - 1] = array[i];
	 array[i] = x;
      }
   }
}

fn is_ordered(array : &amp;[i32]) -&gt; bool {
   if array.len() &lt;= 1 {
     return true;
   }
   for i in 1..array.len() {
     if array[i - 1] &gt; array[i] {
       return false;
     }
   }
   return true;
}

fn parray(array : &amp;[i32]) {
   let mut x : i32;
   print!(&#34;V = &#34;);
   for i in 0..array.len() {
       x = array[i];
       print!(&#34;{x}&#34;);
       if i != (array.len() - 1) {
       	  print!(&#34;,&#34;);
       }
   }
   println!(&#34;&#34;);
}

fn main() {
   let mut array: [i32; 16] = [14, 4, 55, 100, 11, 29, 76, 19, 6, 82, 99, 0, 57, 36, 61, 30];
   parray(&amp;array);
   while !is_ordered(&amp;array) {
     sort(&amp;mut array);
   }
   parray(&amp;array);
}
</code></pre>

<p>Rust leaves itself to easier imperative programming than OCaml: the keyword <em>mut</em> becomes quite similar to C:s <a href="https://en.wikipedia.org/wiki/Const_(computer_programming)" rel="nofollow"><em>const correctness tagging</em></a> in this example. Since <code>is_ordered</code> and <code>parray</code> isn&#39;t altering the contents of the array these functions do not need to be marked with <em>mut</em>. You see some familiar virtues from Pascal: arrays “know” their length, and we use a method to obtain it: <code>array.len()</code>.</p>

<p>The stated ambition is improved memory safety, data-race safety (concurrency) and type safety. The article <a href="https://iris-project.org/pdfs/2021-rustbelt-cacm-final.pdf" rel="nofollow"><em>Safe Systems Programming in Rust</em></a> certainly presents the ambition in a straight-forward manner. Graydon also underscores the focus on memory and concurrency safety <a href="https://graydon2.dreamwidth.org/247406.html" rel="nofollow">in a 2016 blog post</a>.</p>

<p>But <em>make no mistake</em>. The current underlying ambition is <em>definitely</em> nothing different from the ambition of the ALGOL committee between 1958 and 1968: to <em>raise the abstraction</em> of the language through the ambition to <em>join computer programming with formal logic</em>. This comes from the arrival of strong academic support for the language.</p>

<p>A typical indication of this ambition is the <a href="https://plv.mpi-sws.org/rustbelt/" rel="nofollow">well-funded RustBelt project</a> involving a large amount of academic researchers, all familiar with formal logic, and resulting in such artefacts as Ralf Jung&#39;s PhD thesis <a href="https://research.ralfj.de/phd/thesis-screen.pdf" rel="nofollow"><em>Understanding and Evolving the Rust Programming Language</em></a>. Here, formal logic in Rust Belt and the Coq proof assistant is used and concludes (from the abstract):</p>

<blockquote><p>Together, these proofs establish that, as long as the only unsafe code in a well-typed λRust program is confined to libraries that satisfy their verification conditions, the program is safe to execute.</p></blockquote>

<p>What is meant by “safe to execute” is that no use-after-free, dangling pointers, stale references, NULL pointer exceptions etc can ever occur in safe Rust code, because it is proven by formal logic: QED. It does not stop you from e.g. dividing by zero however, that problem is out-of-scope for the exercise.</p>

<p>To me personally the most astonishing fact about Jung&#39;s thesis is that it manages to repeatedly cite and reference the computer scientist Tony Hoare without quoting the inventor of the Rust language, Graydon Hoare, a single time. In a way it confirms Graydon&#39;s own statement that Rust “contains nothing new” from a language point of view.</p>

<p>The C programming language cannot be subject to the same scrutiny as Rust, simply because of all the (ab)use it allows, and which was mentioned by Wirth in his historical perspective: if a type can be changed by a cast and array indexing is not even part of the language, there is nothing much to prove. What has been interesting for scholars to investigate is a well-defined subset of C, such as <a href="https://www.kernel.org/doc/html/latest/bpf/index.html" rel="nofollow">the eBPF subset</a>, which also partly explains the strong interest in eBPF: like with Rust, the build environment and language runtime has been defined with much stricter constraints and thus can be subject to formal verification.</p>

<p>The ambition of Rust is, as I perceieve it, and whether the people driving it even knows it or not, to finish what the ALGOL committe as <em>primus motor</em> started in 1958, and what the Garmisch NATO conference concluded was necessary in 1968: to develop a language for systems programming that rely on formal logic proof, and to fulfil what ALGOL never could, what Pascal never could, and what the whole maybe-not-700 functional programming languages never could: a language that joins the disciplines of <strong>computer science</strong> and <strong>software Engineering</strong> into <strong>ONE</strong> discipline, where the scholars of each can solve problems together.</p>

<p>That is the ambition of Rust as an implementation language for operating systems, such as Linux: provide a language backed by current top-of-the-line computer science research, for immediate application to software engineering developing the top-of-the-line operating system.</p>

<p>What it offers Linux is <strong>raised abstraction</strong> to counter the problems of complexity identified in the 1968 Garmisch NATO conference and now bleeding obvious given the recurring security incidents, and thereby would bring the engineering project Linux closer to computer science.</p>

<p>Other approaches to increased Linux (memory- concurrency-) safety are possible: notably increased testing, which is the engineering go-to panacea. And automated testing of Linux has indeed increased a lot in recent years. Raising the abstraction of the implementation language and proving it formally comes with the ambition to make testing <em>less</em> important.</p>

<p>[Mathieu Poirer and Jesper Jansson has helped out in reviewing this blog post, for which I am forever grateful: remaining errors, bugs and biased opinions are my own.]</p>
]]></content:encoded>
      <author>linusw</author>
      <guid>https://people.kernel.org/read/a/gcmk4vjm5y</guid>
      <pubDate>Thu, 14 Jul 2022 12:10:20 +0000</pubDate>
    </item>
    <item>
      <title>Anonymous VMA merging improvements WIP</title>
      <link>https://people.kernel.org/vbabka/anonymous-vma-merging-improvements-wip</link>
      <description>&lt;![CDATA[In this post I would like to raise the awareness a bit about an effort to reduce the limitations of anonymous VMA merging, in the form of an ongoing master thesis by Jakub Matena, which I&#39;m supervising. I suspect there might be userspace projects that would benefit and maybe their authors are aware of the current limitations and would welcome if they were relaxed, but they don&#39;t read the linux-mm mailing list - the last version of the RFC posted there is here&#xA;&#xA;In a high-level summary, merging of anonymous VMAs in Linux generally happens as soon as they become adjacent in the address space and have compatible access protection bits (and also mempolicies etc.). However due to internal implementation details (involving VMA and page offsets) some operations such as mremap() that moves VMAs around the address space can cause anonymous VMAs not to merge even if everything else is compatible. This is then visible as extra entries in /proc/pid/maps that could be in theory be one larger entry, the associated larger memory and CPU overhead of VMA operations, or even hitting the limit of VMAs per process, set by the vm.maxmapcount sysctl. A related issue is that mremap() syscall itself cannot currently process multiple VMAs, so a process that needs to further mremap() the non-merged areas would need to somehow learn the extra boundaries first and perform a sequence of multiple mremap()&#39;s to achieve its goal.&#xA;&#xA;Does any of the above sound familiar because you found that out already while working on a Linux application? Then we would love your feedback on the RFC linked above (or privately). The issue is that while in many scenarios the merging limitations can be lifted by the RFC, it doesn&#39;t come for free in both of some overhead of e.g. mremap(), and especially the extra complexity of an already complex code. Thus identifying workloads that would benefit a lot would be helpful. Thanks!]]&gt;</description>
      <content:encoded><![CDATA[<p>In this post I would like to raise the awareness a bit about an effort to reduce the limitations of anonymous VMA merging, in the form of an ongoing master thesis by Jakub Matena, which I&#39;m supervising. I suspect there might be userspace projects that would benefit and maybe their authors are aware of the current limitations and would welcome if they were relaxed, but they don&#39;t read the linux-mm mailing list – the last version of the RFC posted there is <a href="https://lore.kernel.org/all/20220516125405.1675-1-matenajakub@gmail.com/" rel="nofollow">here</a></p>

<p>In a high-level summary, merging of anonymous VMAs in Linux generally happens as soon as they become adjacent in the address space and have compatible access protection bits (and also mempolicies etc.). However due to internal implementation details (involving VMA and page offsets) some operations such as mremap() that moves VMAs around the address space can cause anonymous VMAs not to merge even if everything else is compatible. This is then visible as extra entries in <code>/proc/pid/maps</code> that could be in theory be one larger entry, the associated larger memory and CPU overhead of VMA operations, or even hitting the limit of VMAs per process, set by the <code>vm.max_map_count</code> sysctl. A related issue is that mremap() syscall itself cannot currently process multiple VMAs, so a process that needs to further mremap() the non-merged areas would need to somehow learn the extra boundaries first and perform a sequence of multiple mremap()&#39;s to achieve its goal.</p>

<p>Does any of the above sound familiar because you found that out already while working on a Linux application? Then we would love your feedback on the RFC linked above (or privately). The issue is that while in many scenarios the merging limitations can be lifted by the RFC, it doesn&#39;t come for free in both of some overhead of e.g. mremap(), and especially the extra complexity of an already complex code. Thus identifying workloads that would benefit a lot would be helpful. Thanks!</p>
]]></content:encoded>
      <author>Vlastimil Babka</author>
      <guid>https://people.kernel.org/read/a/pdy9qm9lta</guid>
      <pubDate>Fri, 24 Jun 2022 10:24:16 +0000</pubDate>
    </item>
    <item>
      <title>Test timeout and runtime</title>
      <link>https://people.kernel.org/metan/test-timeout-and-runtime</link>
      <description>&lt;![CDATA[The new LTP release will include changes that have introduced concept of a test maximal runtime so let me briefly explain what exactly that is. To begin with let&#39;s make an observation about a LTP test duration. Most of the LTP tests do fall into two categories when duration of the test is considered. First type of tests is fast, generally under a second or two and most of the time even fraction of that. These tests mostly prepare simple environment, call a syscall or two, clean up and are done. The second type of tests runs for longer and their duration is usually counted in minutes. These tests include I/O stress test, various regression tests that are looping in order to hit a race, timer precision tests that have to sample time intervals and so on.&#xA;&#xA;Historically in LTP the test duration was limited by a single value called timeout, that defaulted to a compromise of 5 minutes, which is the worst value for both classes of the tests. That is because it&#39;s clearly too long for short running tests and at the same time too short for significant fraction of the long running tests. This was clear just by checking the tests that actually adjusted the default timeout. Quite a few short running tests that were prone to deadlocks decreased the default timeout to a much shorter interval and at the same time quite a few long running tests did increase it as well.&#xA;&#xA;But back at how the test duration was handled in the long running tests. The test duration for long running tests is usually bounded by a time limit as well as a limit on a number of iterations and the test exits on whichever is hit first. In order to exit the test before the timeout these tests watched the elapsed runtime and did exit the main loop if the runtime got close enough to the test timeout. The problem was that close enough was loosely defined and implemented in each test differently. That obviously leads to a different problems. For instance if test looped until there was 10 seconds left to the timeout and the test cleanup did take more than 10 seconds on a slower hardware, there was no way how to avoid triggering the timeout which resulted in test failure. If test timeout was increased the test simply run for longer duration and hit the timeout at the end either way. At the same time if the test did use proportion of the timeout left out for the test cleanup things didn&#39;t work out when the timeout was scaled down in order to shorten the test duration.&#xA;&#xA;After careful analysis it became clear that the test duration has to be bound by a two distinct values. The new values are now called timeout and max\runtime and the test duration is bound by a sum of these two. The idea behind this should be clear to the reader at this point. The max\runtime limits the test active part, that is the part where the actual test loop is executed and the timeout covers the test setup and cleanup and all inaccuracies in the accounting. Each of them can be scaled separately which gives us enough flexibility to be able to scale from small embedded boards all the way up to the supercomputers. This change also allowed us to change the default test timeout to 30 seconds. And if you are asking yourself a question how max\runtime is set for short running tests the answer is simple it&#39;s set to zero since the default timeout is more than enough to cope with these.&#xA;&#xA;All of this also helps to kill the misbehaving tests much faster since we have much better estimation for the expected test duration. And yes this is a big deal when you are running thousands of testcases, it may speed up the testrun quite significantly even with a few deadlocked tests.&#xA;&#xA;But things does not end here, there is a bit of added complexity on the top of this. Some of the testcases will call the main test loop more than once. That is because we have a few &#34;multipliers&#34; flags that can increase test coverage quite a bit. For instance we have so called .all\filesystems flag, that when set, will execute the test on the top of the most commonly used filesystems. There is also flag that can run the test for a different variants, which is sometimes used to run the test for a more than one syscall variant, e.g. for clock\_gettime() we run the same test for both syscall and VDSO. All these multipliers have to be taken into an account when overall test duration is computed. However we do have all these flags in the metadata file now hence we are getting really close to a state where we will have a tool that can compute an accurate upper bound for duration for a given test. However that is completely different story for a different short article.]]&gt;</description>
      <content:encoded><![CDATA[<p>The new LTP release will include changes that have introduced concept of a test maximal runtime so let me briefly explain what exactly that is. To begin with let&#39;s make an observation about a LTP test duration. Most of the LTP tests do fall into two categories when duration of the test is considered. First type of tests is fast, generally under a second or two and most of the time even fraction of that. These tests mostly prepare simple environment, call a syscall or two, clean up and are done. The second type of tests runs for longer and their duration is usually counted in minutes. These tests include I/O stress test, various regression tests that are looping in order to hit a race, timer precision tests that have to sample time intervals and so on.</p>

<p>Historically in LTP the test duration was limited by a single value called timeout, that defaulted to a compromise of 5 minutes, which is the worst value for both classes of the tests. That is because it&#39;s clearly too long for short running tests and at the same time too short for significant fraction of the long running tests. This was clear just by checking the tests that actually adjusted the default timeout. Quite a few short running tests that were prone to deadlocks decreased the default timeout to a much shorter interval and at the same time quite a few long running tests did increase it as well.</p>

<p>But back at how the test duration was handled in the long running tests. The test duration for long running tests is usually bounded by a time limit as well as a limit on a number of iterations and the test exits on whichever is hit first. In order to exit the test before the timeout these tests watched the elapsed runtime and did exit the main loop if the runtime got close enough to the test timeout. The problem was that close enough was loosely defined and implemented in each test differently. That obviously leads to a different problems. For instance if test looped until there was 10 seconds left to the timeout and the test cleanup did take more than 10 seconds on a slower hardware, there was no way how to avoid triggering the timeout which resulted in test failure. If test timeout was increased the test simply run for longer duration and hit the timeout at the end either way. At the same time if the test did use proportion of the timeout left out for the test cleanup things didn&#39;t work out when the timeout was scaled down in order to shorten the test duration.</p>

<p>After careful analysis it became clear that the test duration has to be bound by a two distinct values. The new values are now called timeout and max_runtime and the test duration is bound by a sum of these two. The idea behind this should be clear to the reader at this point. The max_runtime limits the test active part, that is the part where the actual test loop is executed and the timeout covers the test setup and cleanup and all inaccuracies in the accounting. Each of them can be scaled separately which gives us enough flexibility to be able to scale from small embedded boards all the way up to the supercomputers. This change also allowed us to change the default test timeout to 30 seconds. And if you are asking yourself a question how max_runtime is set for short running tests the answer is simple it&#39;s set to zero since the default timeout is more than enough to cope with these.</p>

<p>All of this also helps to kill the misbehaving tests much faster since we have much better estimation for the expected test duration. And yes this is a big deal when you are running thousands of testcases, it may speed up the testrun quite significantly even with a few deadlocked tests.</p>

<p>But things does not end here, there is a bit of added complexity on the top of this. Some of the testcases will call the main test loop more than once. That is because we have a few “multipliers” flags that can increase test coverage quite a bit. For instance we have so called .all_filesystems flag, that when set, will execute the test on the top of the most commonly used filesystems. There is also flag that can run the test for a different variants, which is sometimes used to run the test for a more than one syscall variant, e.g. for clock_gettime() we run the same test for both syscall and VDSO. All these multipliers have to be taken into an account when overall test duration is computed. However we do have all these flags in the metadata file now hence we are getting really close to a state where we will have a tool that can compute an accurate upper bound for duration for a given test. However that is completely different story for a different short article.</p>
]]></content:encoded>
      <author>metan&#39;s blog</author>
      <guid>https://people.kernel.org/read/a/g1r66qpdjl</guid>
      <pubDate>Wed, 25 May 2022 15:30:40 +0000</pubDate>
    </item>
    <item>
      <title>Cross-fork object sharing in git (is not a bug)</title>
      <link>https://people.kernel.org/monsieuricon/cross-fork-object-sharing-in-git-is-not-a-bug</link>
      <description>&lt;![CDATA[Once every couple of years someone unfailingly takes advantage of the following two facts:&#xA;&#xA;most large git hosting providers set up object sharing between forks of the same repository in order to save both storage space and improve user experience &#xA;git&#39;s loose internal structure allows any shared object to be accessed from any other repository &#xA;&#xA;Thus, hilarity ensues on a fairly regular basis: &#xA;&#xA;https://github.com/torvalds/linux/blob/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9/Makefile#L6 &#xA;https://github.com/torvalds/linux/blob/8bcab0346d4fcf21b97046eb44db8cf37ddd6da0/README&#xA;&#xA;Every time this happens, many wonder how come this isn&#39;t treated like a nasty security bug, and the answer, inevitably, is &#34;it&#39;s complicated.&#34; &#xA;&#xA;Blobs, trees, commits, oh my&#xA;&#xA;Under the hood, git repositories are a bunch of objects -- blobs, trees, and commits. Blobs are file contents, trees are directory listings that establish the relationship between file names and the blobs, and commits are like still frames in a movie reel that show where all the trees and blobs were at a specific point in time. Each next commit refers to the hash of the previous commit, which is how we know in what order these still frames should be put together to make a movie.&#xA;&#xA;Each of these objects has a hash value, which is how they are stored inside the git directory itself (look in .git/objects). When git was originally designed, over a decade ago, it didn&#39;t really have a concept of &#34;branches&#34; -- there was just a symlink HEAD pointing to the latest commit. If you wanted to work on several things at once, you simply cloned the repository and did it in a separate directory with its own HEAD. Cloning was a very efficient operation, as through the magic of hardlinking, hundreds of clones would take up about as much room on your disk as a single one.&#xA;&#xA;Fast-forward to today&#xA;&#xA;Git is a lot more complicated these days, but the basic concepts are the same. You still have blobs, trees, commits, and they are all still stored internally as hashes. Under the hood, git has developed quite a bit over the past decade to make it more efficient to store and retrieve millions and tens of millions of repository objects. Most of them are now stored inside special pack files, which are organized rather similar to compressed video clips -- formats like webm don&#39;t really store each frame in a separate image, as there is usually very little difference between any two adjacent frames. It makes much more sense to store just the difference (&#34;delta&#34;) between two still images until you come to a designated &#34;key frame&#34;. &#xA;&#xA;Similarly, when generating pack files, git will try to calculate the deltas between objects and only store their incremental differences -- at least until it decides that it&#39;s time to start from a new &#34;key frame&#34; just so checking out a tag from a year ago doesn&#39;t require replaying a year worth of diffs. At the same time, there has been a lot of work to make the act of pushing/pulling objects more efficient. When someone sends you a pull request and you want to review their changes, you don&#39;t want to download their entire tree. Your git client and the remote git server compare what objects they already have on each end, with the goal to send you just the objects that you are lacking.&#xA;&#xA;Optimizing public forks&#xA;&#xA;If you look at the GitHub links above, check out how many forks torvalds/linux has on that hosting service. Right now, that number says &#34;41.1k&#34;. With the best kinds of optimizations in place, a bare linux.git repository takes up roughtly 3 GB on disk. Doing quick math, if each one of these 41.1k forks were completely standalone, that would require about 125 TB of disk storage. Throw in a few hundred terabytes for all the forks of Chromium, Android, and Gecko, and soon you&#39;re talking Real Large Numbers. Which is why nobody actually does it this way.&#xA;&#xA;Remember how I said that git forks were designed to be extremely efficient and reuse the objects between clones? This is how forks are actually organized on GitHub (and git.kernel.org, for that matter), except it&#39;s a bit more complicated these days than simply hardlinking the contents of .git/objects around.&#xA;&#xA;On git.kernel.org side of things we store the objects from all forks of linux.git in a single &#34;object storage&#34; repository (see https://pypi.org/project/grokmirror/ for the gory details). This has many positive side-effects: &#xA;&#xA;all of git.kernel.org, with its hundreds of linux.git forks takes up just 30G of disk space&#xA;when Linus merges his usual set of pull requests and performs &#34;git push&#34;, he only has to send a very small subset of those objects, because we probably already have most of them &#xA;similarly, when maintainers pull, rebase, and push their own forks, they don&#39;t have to send any of the objects back to us, as we already have them&#xA;&#xA;Object sharing allows to greatly improve not only the backend infrastructure on our end, but also the experience of git&#39;s end-users who directly benefit from not having to push around nearly as many bits.&#xA;&#xA;The dark side of object sharing &#xA;&#xA;With all the benefits of object sharing comes one important downside -- namely, you can access any shared object through any of the forks. So, if you fork linux.git and push your own commit into it, any of the 41.1k forks will have access to the objects referenced by your commit. If you know the hash of that object, and if the web ui allows to access arbitrary repository objects by their hash, you can even view and link to it from any of the forks, making it look as if that object is actually part of that particular repository (which is how we get the links at the start of this article).&#xA;&#xA;So, why can&#39;t GitHub (or git.kernel.org) prevent this from happening? Remember when I said that a git repository is like a movie full of adjacent still frames? When you look at a scene in a movie, it is very easy for you to identify all objects in any given still frame -- there is a street, a car, and a person. However, if I show you a picture of a car and ask you &#34;does this car show up in this movie,&#34; the only way you can answer this question is by watching the entire thing from the beginning to the end, carefully scrutinizing every shot. &#xA;&#xA;In just the same way, to check if a blob from the shared repository actually belongs in a fork, git has to look at all that repository&#39;s tips and work its way backwards, commit by commit, to see if any of the tree objects reference that particular blob. Needless to say, this is an extremely expensive operation, which, if enabled, would allow anyone to easily DoS a git server with only a handful of requests.&#xA;&#xA;This may change in the future, though. For example, if you access a commit that is not part of a repository, GitHub will now show you a warning message:&#xA;&#xA;https://github.com/torvalds/linux/commit/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9&#xA;&#xA;Looking up &#34;does this commit belong in this repository&#34; used to be a very expensive operation, too, until git learned to generate commit graphs (see man git-commit-graph). It is possible that at some point in the future a similar feature will land that will make it easy to perform a similar check for the blob, which will allow GitHub to show a similar warning when someone accesses shared blobs by their hash from the wrong repo.&#xA;&#xA;Why this isn&#39;t a security bug &#xA;&#xA;Just because an object is part of the shared storage doesn&#39;t really have any impact on the forks. When you perform a git-aware operation like &#34;git clone&#34; or &#34;git pull,&#34; git-daemon will only send the objects actually belonging to that repository. Furthermore, your git client deliberately doesn&#39;t trust the remote to send the right stuff, so it will perform its own connectivity checks before accepting anything from the server.&#xA;&#xA;If you&#39;re extra paranoid, you&#39;re encouraged to set receive.fsckObjects for some additional protection against in-flight object corruption, and if you&#39;re really serious about securing your repositories, then you should set up and use git object signing: &#xA;&#xA;https://git-scm.com/docs/git-config#Documentation/git-config.txt-receivefsckObjects &#xA;https://people.kernel.org/monsieuricon/what-does-a-pgp-signature-on-a-git-commit-prove&#xA;&#xA;This is, incidentally, also how you would be able to verify whether commits were made by the actual Linus Torvalds or merely by someone pretending to be him.&#xA;&#xA;Parting words&#xA;&#xA;This neither proves nor disproves the identity of &#34;Satoshi.&#34; However, given Linus&#39;s widely known negative opinions of C++, it&#39;s probably not very likely that it&#39;s the language he&#39;d pick to write some proof of concept code.]]&gt;</description>
      <content:encoded><![CDATA[<p>Once every couple of years someone unfailingly takes advantage of the following two facts:</p>
<ol><li>most large git hosting providers set up object sharing between forks of the same repository in order to save both storage space and improve user experience</li>
<li>git&#39;s loose internal structure allows any shared object to be accessed from any other repository</li></ol>

<p>Thus, hilarity ensues on a fairly regular basis:</p>
<ul><li><a href="https://github.com/torvalds/linux/blob/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9/Makefile#L6" rel="nofollow">https://github.com/torvalds/linux/blob/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9/Makefile#L6</a></li>
<li><a href="https://github.com/torvalds/linux/blob/8bcab0346d4fcf21b97046eb44db8cf37ddd6da0/README" rel="nofollow">https://github.com/torvalds/linux/blob/8bcab0346d4fcf21b97046eb44db8cf37ddd6da0/README</a></li></ul>

<p>Every time this happens, many wonder how come this isn&#39;t treated like a nasty security bug, and the answer, inevitably, is “it&#39;s complicated.”</p>

<h2 id="blobs-trees-commits-oh-my">Blobs, trees, commits, oh my</h2>

<p>Under the hood, git repositories are a bunch of objects — blobs, trees, and commits. Blobs are file contents, trees are directory listings that establish the relationship between file names and the blobs, and commits are like still frames in a movie reel that show where all the trees and blobs were at a specific point in time. Each next commit refers to the hash of the previous commit, which is how we know in what order these still frames should be put together to make a movie.</p>

<p>Each of these objects has a hash value, which is how they are stored inside the git directory itself (look in <code>.git/objects</code>). When git was originally designed, over a decade ago, it didn&#39;t really have a concept of “branches” — there was just a symlink <code>HEAD</code> pointing to the latest commit. If you wanted to work on several things at once, you simply cloned the repository and did it in a separate directory with its own <code>HEAD</code>. Cloning was a very efficient operation, as through the magic of hardlinking, hundreds of clones would take up about as much room on your disk as a single one.</p>

<h2 id="fast-forward-to-today">Fast-forward to today</h2>

<p>Git is a lot more complicated these days, but the basic concepts are the same. You still have blobs, trees, commits, and they are all still stored internally as hashes. Under the hood, git has developed quite a bit over the past decade to make it more efficient to store and retrieve millions and tens of millions of repository objects. Most of them are now stored inside special pack files, which are organized rather similar to compressed video clips — formats like webm don&#39;t really store each frame in a separate image, as there is usually very little difference between any two adjacent frames. It makes much more sense to store just the difference (“delta”) between two still images until you come to a designated “key frame”.</p>

<p>Similarly, when generating pack files, git will try to calculate the deltas between objects and only store their incremental differences — at least until it decides that it&#39;s time to start from a new “key frame” just so checking out a tag from a year ago doesn&#39;t require replaying a year worth of diffs. At the same time, there has been a lot of work to make the act of pushing/pulling objects more efficient. When someone sends you a pull request and you want to review their changes, you don&#39;t want to download their entire tree. Your git client and the remote git server compare what objects they already have on each end, with the goal to send you <em>just</em> the objects that you are lacking.</p>

<h2 id="optimizing-public-forks">Optimizing public forks</h2>

<p>If you look at the GitHub links above, check out how many forks torvalds/linux has on that hosting service. Right now, that number says “41.1k”. With the best kinds of optimizations in place, a bare linux.git repository takes up roughtly 3 GB on disk. Doing quick math, if each one of these 41.1k forks were completely standalone, that would require about 125 TB of disk storage. Throw in a few hundred terabytes for all the forks of Chromium, Android, and Gecko, and soon you&#39;re talking Real Large Numbers. Which is why nobody actually does it this way.</p>

<p>Remember how I said that git forks were designed to be extremely efficient and reuse the objects between clones? This is how forks are actually organized on GitHub (and git.kernel.org, for that matter), except it&#39;s a bit more complicated these days than simply hardlinking the contents of <code>.git/objects</code> around.</p>

<p>On git.kernel.org side of things we store the objects from all forks of linux.git in a single “object storage” repository (see <a href="https://pypi.org/project/grokmirror/" rel="nofollow">https://pypi.org/project/grokmirror/</a> for the gory details). This has many positive side-effects:</p>
<ul><li>all of git.kernel.org, with its hundreds of linux.git forks takes up just 30G of disk space</li>
<li>when Linus merges his usual set of pull requests and performs “git push”, he only has to send a very small subset of those objects, because we probably already have most of them</li>
<li>similarly, when maintainers pull, rebase, and push their own forks, they don&#39;t have to send any of the objects back to us, as we already have them</li></ul>

<p>Object sharing allows to greatly improve not only the backend infrastructure on our end, but also the experience of git&#39;s end-users who directly benefit from not having to push around nearly as many bits.</p>

<h2 id="the-dark-side-of-object-sharing">The dark side of object sharing</h2>

<p>With all the benefits of object sharing comes one important downside — namely, you can access any shared object through any of the forks. So, if you fork linux.git and push your own commit into it, any of the 41.1k forks will have access to the objects referenced by your commit. If you know the hash of that object, and if the web ui allows to access arbitrary repository objects by their hash, you can even view and link to it from any of the forks, making it look as if that object is actually part of that particular repository (which is how we get the links at the start of this article).</p>

<p>So, why can&#39;t GitHub (or git.kernel.org) prevent this from happening? Remember when I said that a git repository is like a movie full of adjacent still frames? When you look at a scene in a movie, it is very easy for you to identify all objects in any given still frame — there is a street, a car, and a person. However, if I show you a picture of a car and ask you “does this car show up in this movie,” the only way you can answer this question is by watching the entire thing from the beginning to the end, carefully scrutinizing every shot.</p>

<p>In just the same way, to check if a blob from the shared repository actually belongs in a fork, git has to look at all that repository&#39;s tips and work its way backwards, commit by commit, to see if any of the tree objects reference that particular blob. Needless to say, this is an extremely expensive operation, which, if enabled, would allow anyone to easily DoS a git server with only a handful of requests.</p>

<p>This may change in the future, though. For example, if you access a commit that is not part of a repository, GitHub will now show you a warning message:</p>
<ul><li><a href="https://github.com/torvalds/linux/commit/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9" rel="nofollow">https://github.com/torvalds/linux/commit/ac632c504d0b881d7cfb44e3fdde3ec30eb548d9</a></li></ul>

<p>Looking up “does this commit belong in this repository” used to be a very expensive operation, too, until git learned to generate commit graphs (see <code>man git-commit-graph</code>). It is possible that at some point in the future a similar feature will land that will make it easy to perform a similar check for the blob, which will allow GitHub to show a similar warning when someone accesses shared blobs by their hash from the wrong repo.</p>

<h2 id="why-this-isn-t-a-security-bug">Why this isn&#39;t a security bug</h2>

<p>Just because an object is part of the shared storage doesn&#39;t really have any impact on the forks. When you perform a git-aware operation like “git clone” or “git pull,” <code>git-daemon</code> will only send the objects actually belonging to that repository. Furthermore, your git client deliberately doesn&#39;t trust the remote to send the right stuff, so it will perform its own connectivity checks before accepting anything from the server.</p>

<p>If you&#39;re extra paranoid, you&#39;re encouraged to set <code>receive.fsckObjects</code> for some additional protection against in-flight object corruption, and if you&#39;re really serious about securing your repositories, then you should set up and use git object signing:</p>
<ul><li><a href="https://git-scm.com/docs/git-config#Documentation/git-config.txt-receivefsckObjects" rel="nofollow">https://git-scm.com/docs/git-config#Documentation/git-config.txt-receivefsckObjects</a></li>
<li><a href="https://people.kernel.org/monsieuricon/what-does-a-pgp-signature-on-a-git-commit-prove" rel="nofollow">https://people.kernel.org/monsieuricon/what-does-a-pgp-signature-on-a-git-commit-prove</a></li></ul>

<p>This is, incidentally, also how you would be able to verify whether commits were made by the actual Linus Torvalds or merely by someone pretending to be him.</p>

<h2 id="parting-words">Parting words</h2>

<p>This neither proves nor disproves the identity of “Satoshi.” However, given Linus&#39;s widely known negative opinions of C++, it&#39;s probably not very likely that it&#39;s the language he&#39;d pick to write some proof of concept code.</p>
]]></content:encoded>
      <author>Konstantin Ryabitsev</author>
      <guid>https://people.kernel.org/read/a/38qr4y50b2</guid>
      <pubDate>Fri, 28 Jan 2022 18:42:00 +0000</pubDate>
    </item>
    <item>
      <title>FOSDEM 2022</title>
      <link>https://people.kernel.org/metan/fosdem-2022</link>
      <description>&lt;![CDATA[Unfortunately FOSDEM is going to be virtual again this year, but that does not stop us from organizing the testing and automation devroom. Have a look at our CfP and if you have something interesting to present go ahead and fill in a submission!]]&gt;</description>
      <content:encoded><![CDATA[<p>Unfortunately FOSDEM is going to be virtual again this year, but that does not stop us from organizing the testing and automation devroom. Have a look at our <a href="https://lists.fosdem.org/pipermail/fosdem/2021q4/003318.html" rel="nofollow">CfP</a> and if you have something interesting to present go ahead and fill in a submission!</p>
]]></content:encoded>
      <author>metan&#39;s blog</author>
      <guid>https://people.kernel.org/read/a/skllquanqq</guid>
      <pubDate>Thu, 09 Dec 2021 11:40:44 +0000</pubDate>
    </item>
    <item>
      <title>Switching trees and maintainer rotation</title>
      <link>https://people.kernel.org/nmenon/switching-trees-and-maintainer-rotation</link>
      <description>&lt;![CDATA[One of the cool things with kernel.org is the fact that we can rotate maintainership depending on workload. So,&#xA;https://git.kernel.org/pub/scm/linux/kernel/git/nmenon/linux.git/ is now my personal tree and we have picked up https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git/ as a co-maintained TI tree that Vignesh and I rotate responsibilities with Tony Lindgren and Tero in backup.&#xA;&#xA;Thanks to Konstantin and Stephen in making this happen.!&#xA;&#xA;NOTE: No change in Tony&#39;s tree @ https://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git/]]&gt;</description>
      <content:encoded><![CDATA[<p>One of the cool things with kernel.org is the fact that we can rotate maintainership depending on workload. So,
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/nmenon/linux.git/" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/nmenon/linux.git/</a> is now my personal tree and we have picked up <a href="https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git/" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git/</a> as a co-maintained TI tree that Vignesh and I rotate responsibilities with Tony Lindgren and Tero in backup.</p>

<p>Thanks to Konstantin and Stephen in making this happen.!</p>

<p>NOTE: No change in Tony&#39;s tree @ <a href="https://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git/" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git/</a></p>
]]></content:encoded>
      <author>nmenon</author>
      <guid>https://people.kernel.org/read/a/12gi1gkbmi</guid>
      <pubDate>Tue, 23 Nov 2021 01:35:25 +0000</pubDate>
    </item>
    <item>
      <title>Record breaking LTP release</title>
      <link>https://people.kernel.org/metan/record-breaking-ltp-release</link>
      <description>&lt;![CDATA[As usual we had a LTP release at the end of the September. What was unusual though is the number of patches that went it, we got 483 patches, which is about +150 than the last three releases. And the number of patches did slowly grow even before that.&#xA;&#xA;While it&#39;s great and I&#39;m happy that the project is growing, there is a catch, grow like this puts additional strain on the maintainers, particularly on the patch reviewers. For me it was +120 patches reviewed during the four months period and that only counts the final versions of patches that were accepted to the repository, it&#39;s not unusual to have three or more revisions before the work is ready to be merged.&#xA;&#xA;While I managed to cope with it reasonably fine the work that I had on TODO for the project was stalled. One of the things I finally want to move forward is making the runltp-ng official LTP test runner, but there is much more. So the obvious question is how to make things better and one of the things we came up was automation.&#xA;&#xA;What we implemented for LTP is &#39;make check&#39; that runs different tools on the test source code that is supposed to be used before patch is sent for a review. For C code we use the well known checkpatch.pl and custom sparse based checker to identify most common problems. The tooling is set up automatically when you call &#39;make check&#39; for a first time and we tried to make it as effortless as possible, so that there is no reason not to use during the development. We also use checkbashism.pl for shell code and hopefully the number of checks will grow over the time. Hopefully this should eliminate on average at least one revision for a patchset which would be hundreds of patches during our development cycle.&#xA;&#xA;Ideally this will fix the problem for a while and we will make more effective use of our resources, but eventually we will get to a point where more maintainers and reviewers are needed, which is problem that is hard to solve without your help.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>As usual we had a LTP release at the end of the September. What was unusual though is the number of patches that went it, we got 483 patches, which is about +150 than the last three releases. And the number of patches did slowly grow even before that.</p>

<p>While it&#39;s great and I&#39;m happy that the project is growing, there is a catch, grow like this puts additional strain on the maintainers, particularly on the patch reviewers. For me it was +120 patches reviewed during the four months period and that only counts the final versions of patches that were accepted to the repository, it&#39;s not unusual to have three or more revisions before the work is ready to be merged.</p>

<p>While I managed to cope with it reasonably fine the work that I had on TODO for the project was stalled. One of the things I finally want to move forward is making the runltp-ng official LTP test runner, but there is much more. So the obvious question is how to make things better and one of the things we came up was automation.</p>

<p>What we implemented for LTP is &#39;make check&#39; that runs different tools on the test source code that is supposed to be used before patch is sent for a review. For C code we use the well known checkpatch.pl and custom sparse based checker to identify most common problems. The tooling is set up automatically when you call &#39;make check&#39; for a first time and we tried to make it as effortless as possible, so that there is no reason not to use during the development. We also use checkbashism.pl for shell code and hopefully the number of checks will grow over the time. Hopefully this should eliminate on average at least one revision for a patchset which would be hundreds of patches during our development cycle.</p>

<p>Ideally this will fix the problem for a while and we will make more effective use of our resources, but eventually we will get to a point where more maintainers and reviewers are needed, which is problem that is hard to solve without your help.</p>
]]></content:encoded>
      <author>metan&#39;s blog</author>
      <guid>https://people.kernel.org/read/a/56809dxdrn</guid>
      <pubDate>Tue, 12 Oct 2021 11:20:14 +0000</pubDate>
    </item>
    <item>
      <title>The state of LTP after ten years of development</title>
      <link>https://people.kernel.org/metan/the-state-of-ltp-after-ten-years-of-development</link>
      <description>&lt;![CDATA[We have reached an important milestone with latest LTP release - the amount of testcases written in the new test library finally outnumbers the amount of old library tests. Which is nice opportunity for a small celebration and also to look back a bit into a history and try to summarize what has happened over the last 10 years in LTP.&#xA;&#xA;I&#39;ve joined LTP development a bit more than 10 years ago in 2009. At that point we were really struggling with the basics. The build system was collection of random Makefiles and the build often failed for very random reasons. The were pieces of shell code embedded in Makefiles for instance to check for devel libraries, manually written shell loops over directories that prevented parallel build, and all kind of ugly mess like that. This has changed and at the end of 2009 as the build system was rewritten, with that LTP supported proper parallel build, started to use autoconf for feature checks, etc. We also switched from CVS to GIT at the end of the 2009, which was huge improvement as well.&#xA;&#xA;However that was only a start, LTP was easier to build, git was nicer to use, but we still had tests that were mostly failing and fair amount of the tests were producing nothing but noise. There were also tests that didn&#39;t produce real results and always passed but it&#39;s really hard to count these unless you review the code carefully one testcase at a time, which is part of what we still do even after ten years of work.&#xA;&#xA;From that point on it took us a few years to clear the worst parts and to deal with most of the troublemakers and the results from LTP were gradually getting greener and more stable as well. We are far from being bugless, there are still parts covered in dust that are waiting for attention, but we are getting there. For instance in this release we finally got a nice cgroup test library that simplifies cgroup testcases and we should fix rest of the cgroup tests ideally before the next one. Also I&#39;m quite happy that the manpower put into LTP development slowly increases, however compared to the efforts put into the kernel development the situation is still dire. I used to tell people that the amount of work put into Linux automated testing is a bad joke back then. These days it&#39;s much better but still hardly optimal as we struggle to keep up with covering the newly introduced kernel features.&#xA;&#xA;At the start I&#39;ve mentioned new test library so I should explain how we came to this and why it&#39;s superior to what we had previously. First of all there was a test library in LTP that could be traced back to SGI and was released under GPL more than 20 years ago, it&#39;s probably even older than that though. The main problems with the library was that it was cumbersome to use. There were some API reporting functions, but these were not thread safe nor could be used in child processes. You had to propagate test results manually in these two cases which was prone to errors. Even worse since the test implemented the main() function you had to return the overall result manually as well and forgetting to do so was one of the common mistakes. At a point where most of the broken tests were finally fixed I had a bit of time to invest into a future and after seven years of dealing with a common test mistakes and I had a pretty good picture of what a test library should look like and what should be avoided. Hence I&#39;ve sat down and designed library that is nice and fun to use and makes tests much easier to write. This library still evolves over the time, the version introduced in 2016 wasn&#39;t as nice as it is now, but even when it was introduced it included the most important bits, for instance thread safe and automatic test result propagation or synchronization primitives that could be used even to synchronize shell code against C binary.&#xA;&#xA;The old library is still present in LTP since we are a bit more than halfway done converting the tests, which is no easy task since we have still more than 600 tests to go. And as we are converting the test we are also reviewing them to ensure that the assertions are correct and the coverage isn&#39;t lacking. We still find tests that fail to report results from time to time even now, which only show how hard is to eliminate mistakes like this and why preventing them in the first place is right thing to do. And if things will go well the rest of tests should be converted in about 5 years and LTP should be finally free of the historical baggage. At that point I guess that I will throw a small celebration since that would conclude a huge task I&#39;ve been working on for a decade now.]]&gt;</description>
      <content:encoded><![CDATA[<p>We have reached an important milestone with <a href="https://github.com/linux-test-project/ltp/releases/tag/20210524" rel="nofollow">latest LTP release</a> – the amount of testcases written in the new test library finally outnumbers the amount of old library tests. Which is nice opportunity for a small celebration and also to look back a bit into a history and try to summarize what has happened over the last 10 years in LTP.</p>

<p>I&#39;ve joined LTP development a bit more than 10 years ago in 2009. At that point we were really struggling with the basics. The build system was collection of random Makefiles and the build often failed for very random reasons. The were pieces of shell code embedded in Makefiles for instance to check for devel libraries, manually written shell loops over directories that prevented parallel build, and all kind of ugly mess like that. This has changed and at the end of 2009 as the build system was rewritten, with that LTP supported proper parallel build, started to use autoconf for feature checks, etc. We also switched from CVS to GIT at the end of the 2009, which was huge improvement as well.</p>

<p>However that was only a start, LTP was easier to build, git was nicer to use, but we still had tests that were mostly failing and fair amount of the tests were producing nothing but noise. There were also tests that didn&#39;t produce real results and always passed but it&#39;s really hard to count these unless you review the code carefully one testcase at a time, which is part of what we still do even after ten years of work.</p>

<p>From that point on it took us a few years to clear the worst parts and to deal with most of the troublemakers and the results from LTP were gradually getting greener and more stable as well. We are far from being bugless, there are still parts covered in dust that are waiting for attention, but we are getting there. For instance in this release we finally got a nice cgroup test library that simplifies cgroup testcases and we should fix rest of the cgroup tests ideally before the next one. Also I&#39;m quite happy that the manpower put into LTP development slowly increases, however compared to the efforts put into the kernel development the situation is still dire. I used to tell people that the amount of work put into Linux automated testing is a bad joke back then. These days it&#39;s much better but still hardly optimal as we struggle to keep up with covering the newly introduced kernel features.</p>

<p>At the start I&#39;ve mentioned new test library so I should explain how we came to this and why it&#39;s superior to what we had previously. First of all there was a test library in LTP that could be traced back to SGI and was released under GPL more than 20 years ago, it&#39;s probably even older than that though. The main problems with the library was that it was cumbersome to use. There were some API reporting functions, but these were not thread safe nor could be used in child processes. You had to propagate test results manually in these two cases which was prone to errors. Even worse since the test implemented the main() function you had to return the overall result manually as well and forgetting to do so was one of the common mistakes. At a point where most of the broken tests were finally fixed I had a bit of time to invest into a future and after seven years of dealing with a common test mistakes and I had a pretty good picture of what a test library should look like and what should be avoided. Hence I&#39;ve sat down and designed library that is nice and fun to use and makes tests much easier to write. This library still evolves over the time, the version introduced in 2016 wasn&#39;t as nice as it is now, but even when it was introduced it included the most important bits, for instance thread safe and automatic test result propagation or synchronization primitives that could be used even to synchronize shell code against C binary.</p>

<p>The old library is still present in LTP since we are a bit more than halfway done converting the tests, which is no easy task since we have still more than 600 tests to go. And as we are converting the test we are also reviewing them to ensure that the assertions are correct and the coverage isn&#39;t lacking. We still find tests that fail to report results from time to time even now, which only show how hard is to eliminate mistakes like this and why preventing them in the first place is right thing to do. And if things will go well the rest of tests should be converted in about 5 years and LTP should be finally free of the historical baggage. At that point I guess that I will throw a small celebration since that would conclude a huge task I&#39;ve been working on for a decade now.</p>
]]></content:encoded>
      <author>metan&#39;s blog</author>
      <guid>https://people.kernel.org/read/a/sisgg6avz0</guid>
      <pubDate>Thu, 27 May 2021 14:36:56 +0000</pubDate>
    </item>
    <item>
      <title>FOSDEM Testing and Automation CfP</title>
      <link>https://people.kernel.org/metan/fosdem-testing-and-automation-cfp</link>
      <description>&lt;![CDATA[FOSDEM Testing and Automation CfP&#xA;&#xA;We are organizing Testing and Automation devroom on FOSDEM this year again details at: https://fosdem-testingautomation.github.io/]]&gt;</description>
      <content:encoded><![CDATA[<p><a href="https://fosdem-testingautomation.github.io/" rel="nofollow"><img src="https://kiwitcms.eu/images/fosdem/2021/banner.png" alt="FOSDEM Testing and Automation CfP"></a></p>

<p>We are organizing Testing and Automation devroom on FOSDEM this year again details at: <a href="https://fosdem-testingautomation.github.io/" rel="nofollow">https://fosdem-testingautomation.github.io/</a></p>
]]></content:encoded>
      <author>metan&#39;s blog</author>
      <guid>https://people.kernel.org/read/a/8tx8sn0ydu</guid>
      <pubDate>Tue, 08 Dec 2020 14:23:17 +0000</pubDate>
    </item>
    <item>
      <title>C++ rvalue references</title>
      <link>https://people.kernel.org/joelfernandes/c-rvalue-references</link>
      <description>&lt;![CDATA[The writer works in the ChromeOS kernel team, where most of the system libraries, low-level components and user space is written in C++. Thus the writer has no choice but to be familiar with C++. It is not that hard, but some things are confusing. rvalue references are definitely confusing.&#xA;&#xA;In this post, I wish to document rvalue references by simple examples, before I forget it.&#xA;&#xA;Refer to this article for in-depth coverage on rvalue references.&#xA;&#xA;In a nutshell: An rvalue reference can be used to construct a C++ object efficiently using a &#34;move constructor&#34;. This efficiency is achieved by the object&#39;s move constructor by moving the underlying memory of the object efficiently to the destination instead of a full copy. Typically the move constructor of the object will copy pointers within the source object into the destination object, and null the pointer within the source object.&#xA;&#xA;An rvalue reference is denoted by a double ampersand (&amp;&amp;) when you want to create an rvalue reference as a variable.&#xA;&#xA;For example T &amp;&amp;y; defines a variable y which holds an rvalue reference of type T. I have almost never seen an rvalue reference variable created this way in real code. I also have no idea when it can be useful. Almost always they are created by either of the 2 methods in the next section. These methods create an &#34;unnamed&#34; rvalue reference which can be passed to a class&#39;s move constructor.&#xA;&#xA;When is an rvalue reference created?&#xA;&#xA;In the below example, we create an rvalue reference to a vector, and create another vector object from this.&#xA;&#xA;This can happen in 2 ways (that I know off):&#xA;1. Using std::move&#xA;This converts an lvalue reference to an rvalue reference.&#xA;&#xA;Example:&#xA;include iostream&#xA;include vector&#xA;&#xA;int main()&#xA;{&#xA;    int px, py;&#xA;    std::vectorint x = {4,3};&#xA;    px = &amp;(x[0]);&#xA; &#xA;    // Convert lvalue &#39;x&#39; to rvalue reference and pass&#xA;    // it to vector&#39;s overloaded move constructor.&#xA;    std::vectorint y(std::move(x)); &#xA;    py = &amp;(y[0]);&#xA;&#xA;    // Confirm the new vector uses same storage&#xA;    printf(&#34;same vector? : %d\n&#34;, px == py); // prints 1&#xA;}&#xA;&#xA;2. When returning something from a function&#xA;The returned object from the function can be caught as an rvalue reference to that object.&#xA;include iostream&#xA;include vector&#xA;&#xA;int pret;&#xA;int py;&#xA;&#xA;std::vectorint myf(int a)&#xA;{&#xA;    vectorint ret;&#xA;&#xA;    ret.pushback(a  a);&#xA;&#xA;    pret = &amp;(ret[0]);&#xA;&#xA;    // Return is caught as an rvalue ref: vectorint &amp;&amp;&#xA;    return ret;&#xA;}&#xA;&#xA;int main()&#xA;{&#xA;    // Invoke vector&#39;s move constructor.&#xA;    std::vectorint y(myf(4)); &#xA;    py = &amp;(y[0]);&#xA;&#xA;    // Confirm the vectors share the same underlying storage&#xA;    printf(&#34;same vector? : %d\n&#34;, pret == py); // prints 1&#xA;}&#xA;&#xA;Note on move asssignment&#xA;Interestingly, if you construct vector &#39;y&#39; using the assignment operator: std::vectorint y = myf(4);, the compiler may decide to use the move constructor automatically even though assignment is chosen. I believe this is because of vector&#39;s move assignment operator overload.&#xA;&#xA;Further, the compiler may even not invoke a constructor at all and just perform RVO (Return Value Optimization).&#xA;&#xA;Quiz&#xA;Question:&#xA;If I create a named rvalue reference using std::move and then use this to create a vector, the underlying storage of the new vector is different. Why?&#xA;&#xA;include iostream&#xA;include vector&#xA;&#xA;int pret;&#xA;int py;&#xA;&#xA;std::vectorint myf(int a)&#xA;{&#xA;    vectorint ret;&#xA;&#xA;    ret.push_back(a  a);&#xA;&#xA;    pret = &amp;(ret[0]);&#xA;&#xA;    // Return is caught as an rvalue ref: vectorint &amp;&amp;&#xA;    return ret;&#xA;}&#xA;&#xA;int main()&#xA;{&#xA;    // Invoke vector&#39;s move constructor.&#xA;    std::vectorint&amp;&amp; ref = myf(4);&#xA;    std::vectorint y(ref); &#xA;    py = &amp;(y[0]);&#xA;&#xA;    // Confirm the vectors share the same underlying storage&#xA;    printf(&#34;same vector? : %d\n&#34;, pret == py); // prints 0&#xA;}&#xA;Answer&#xA;The answer is: because the value category of the id-expression &#39;ref&#39; is lvalue, the copy constructor will be chosen. To use the move constructor, it has to be std::vectorint y(std::move(ref));.&#xA;&#xA;Conclusion&#xA;rvalue references are confusing and sometimes the compiler can do different optimizations to cause further confusion. It is best to follow well known design patterns when designing your code. It may be best to also try to avoid rvalue references altogether but hopefully this article helps you understand it a bit more when you come across large C++ code bases.&#xA;&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>The writer works in the ChromeOS kernel team, where most of the system libraries, low-level components and user space is written in C++. Thus the writer has no choice but to be familiar with C++. It is not that hard, but some things are confusing. rvalue references are definitely confusing.</p>

<p>In this post, I wish to document rvalue references by simple examples, before I forget it.</p>

<p>Refer to <a href="https://www.chromium.org/rvalue-references" rel="nofollow">this article</a> for in-depth coverage on rvalue references.</p>

<p>In a nutshell: An rvalue reference can be used to construct a C++ object efficiently using a “move constructor”. This efficiency is achieved by the object&#39;s move constructor by <em>moving</em> the underlying memory of the object efficiently to the destination instead of a full copy. Typically the move constructor of the object will copy pointers within the source object into the destination object, and null the pointer within the source object.</p>

<p>An rvalue reference is denoted by a double ampersand (&amp;&amp;) when you want to create an rvalue reference as a variable.</p>

<p>For example <code>T &amp;&amp;y;</code> defines a variable y which holds an rvalue reference of type T. I have almost never seen an rvalue reference variable created this way in real code. I also have no idea when it can be useful. Almost always they are created by either of the 2 methods in the next section. These methods create an “unnamed” rvalue reference which can be passed to a class&#39;s move constructor.</p>

<h2 id="when-is-an-rvalue-reference-created">When is an rvalue reference created?</h2>

<p>In the below example, we create an rvalue reference to a vector, and create another vector object from this.</p>

<p>This can happen in 2 ways (that I know off):</p>

<h3 id="1-using-std-move">1. Using std::move</h3>

<p>This converts an lvalue reference to an rvalue reference.</p>

<p>Example:</p>

<pre><code>#include &lt;iostream&gt;
#include &lt;vector&gt;

int main()
{
    int *px, *py;
    std::vector&lt;int&gt; x = {4,3};
    px = &amp;(x[0]);
 
    // Convert lvalue &#39;x&#39; to rvalue reference and pass
    // it to vector&#39;s overloaded move constructor.
    std::vector&lt;int&gt; y(std::move(x)); 
    py = &amp;(y[0]);

    // Confirm the new vector uses same storage
    printf(&#34;same vector? : %d\n&#34;, px == py); // prints 1
}
</code></pre>

<h3 id="2-when-returning-something-from-a-function">2. When returning something from a function</h3>

<p>The returned object from the function can be caught as an rvalue reference to that object.</p>

<pre><code>#include &lt;iostream&gt;
#include &lt;vector&gt;

int *pret;
int *py;

std::vector&lt;int&gt; myf(int a)
{
    vector&lt;int&gt; ret;

    ret.push_back(a * a);

    pret = &amp;(ret[0]);

    // Return is caught as an rvalue ref: vector&lt;int&gt; &amp;&amp;
    return ret;
}

int main()
{
    // Invoke vector&#39;s move constructor.
    std::vector&lt;int&gt; y(myf(4)); 
    py = &amp;(y[0]);

    // Confirm the vectors share the same underlying storage
    printf(&#34;same vector? : %d\n&#34;, pret == py); // prints 1
}
</code></pre>

<h3 id="note-on-move-asssignment">Note on move asssignment</h3>

<p><a href="https://stackoverflow.com/questions/4986673/c11-rvalues-and-move-semantics-confusion-return-statement" rel="nofollow">Interestingly</a>, if you construct vector &#39;y&#39; using the assignment operator: <code>std::vector&lt;int&gt; y = myf(4);</code>, the compiler may decide to use the move constructor automatically even though assignment is chosen. I believe this is because of vector&#39;s <a href="https://en.cppreference.com/w/cpp/language/move_assignment" rel="nofollow">move assignment operator overload</a>.</p>

<p>Further, the compiler may even not invoke a constructor at all and just perform RVO (Return Value Optimization).</p>

<h2 id="quiz">Quiz</h2>

<h4 id="question">Question:</h4>

<p>If I create a named rvalue reference using std::move and then use this to create a vector, the underlying storage of the new vector is different. Why?</p>

<pre><code>#include &lt;iostream&gt;
#include &lt;vector&gt;

int *pret;
int *py;

std::vector&lt;int&gt; myf(int a)
{
    vector&lt;int&gt; ret;

    ret.push_back(a * a);

    pret = &amp;(ret[0]);

    // Return is caught as an rvalue ref: vector&lt;int&gt; &amp;&amp;
    return ret;
}

int main()
{
    // Invoke vector&#39;s move constructor.
    std::vector&lt;int&gt;&amp;&amp; ref = myf(4);
    std::vector&lt;int&gt; y(ref); 
    py = &amp;(y[0]);

    // Confirm the vectors share the same underlying storage
    printf(&#34;same vector? : %d\n&#34;, pret == py); // prints 0
}
</code></pre>

<h4 id="answer">Answer</h4>

<p>The answer is: because the value category of the id-expression &#39;ref&#39; is lvalue, the copy constructor will be chosen. To use the move constructor, it has to be <code>std::vector&lt;int&gt; y(std::move(ref));</code>.</p>

<h2 id="conclusion">Conclusion</h2>

<p>rvalue references are confusing and sometimes the compiler can do different optimizations to cause further confusion. It is best to follow well known design patterns when designing your code. It may be best to also try to avoid rvalue references altogether but hopefully this article helps you understand it a bit more when you come across large C++ code bases.</p>
]]></content:encoded>
      <author>joelfernandes</author>
      <guid>https://people.kernel.org/read/a/iko4coh2ge</guid>
      <pubDate>Mon, 26 Oct 2020 02:01:48 +0000</pubDate>
    </item>
    <item>
      <title>XDP vs OVS</title>
      <link>https://people.kernel.org/dsahern/xdp-vs-ovs</link>
      <description>&lt;![CDATA[Long overdue blog post on XDP; so many details uncovered during testing causing tests to be redone.&#xA;&#xA;This post focuses on a comparison of XDP and OVS in delivering packets to a VM from the perspective of CPU cycles spent by the host in processing those packets. There are a lot of variables at play, and changing any one of them radically affects the outcome, though it should be no surprise XDP is always lighter and faster.&#xA;&#xA;Setup&#xA;I believe I am covering all of the settings here that I discovered over the past few months that caused variations in the data.&#xA;&#xA;Host&#xA;The host is a standard, modern server (Dell PowerEdge R640) with an Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz with 96 hardware threads (48 cores + hyper threading to yield 96 logical cpus in the host). The server is running Ubuntu 18.04 with the recently released 5.8.0 kernel. It has a Mellanox Connectx4-LX ethernet card with 2 25G ports into an 802.3ad (LACP) bond, and the bond is connected to an OVS bridge.&#xA;&#xA;                      Host setup&#xA;&#xA;As discussed in [1] to properly compare the CPU costs of the 2 networking solutions, we need to consolidate packet processing to a single CPU. Handling all packets destined to the same VM on the same CPU avoids lock contention on the tun ring, so consolidating packets to a single CPU is actually best case performance.&#xA;&#xA;Ensure RPS is disabled in the host:&#xA;for d in eth0 eth1; do&#xA;    find /sys/class/net/${d}/queues -name rpscpus |&#xA;    while read f; do&#xA;            echo 0 | sudo tee ${f}&#xA;    done&#xA;done&#xA;and add flow rules in the NIC to push packets for the VM under test to a single CPU:&#xA;sudo ethtool -N eth0 flow-type ether dst 12:34:de:ad:ca:fe action 2&#xA;sudo ethtool -N eth1 flow-type ether dst 12:34:de:ad:ca:fe action 2&#xA;For this host and ethernet card, packets for queue 2 are handled on CPU 5 (consult /proc/interrupts for the mapping on your host).&#xA;  &#xA;XDP bypasses the qdisc layer, so to have a fair comparison make noqueue the default qdisc before starting the VM:&#xA;sudo sysctl -w net.core.defaultqdisc=noqueue&#xA;(or add a udev rule [2]).&#xA;&#xA;Finally, the host is fairly quiet with only one VM running (the one under test) and very little network traffic outside of the VM under test and a few, low traffic ssh sessions used to run commands to collect data about the tests.&#xA;&#xA;Virtual Machine&#xA;&#xA;The VM has 8 cpus and is also running Ubuntu 18.04 with a 5.8.0 kernel. It uses tap+vhost for networking with the tap device a port in the OVS bridge as shown in the picture above. The tap device has a single queue, and RPS is also disabled in the guest:&#xA;echo 00 | sudo tee /sys/class/net/eth0/queues -name rpscpus&#xA;The VM is also quiet with no load running in the guest OS.&#xA;&#xA;The point of this comparison is host side processing of packets, so packets are dropped in the guest as soon as possible using a bpf program [3] attached to eth0 as a tc filter.  (Note: Theoretically, XDP should be used to drop the packets in the guest OS since it truly is the fewest cycles per packet. However, XDP in the VM requires a multi-queue NIC[5], and adding queues to the guest NIC has a huge affect on the results.)&#xA;&#xA;In the host, the qemu threads corresponding to the guest CPUs (vcpus) are affined (as a set) to 8 hardware threads in the same NUMA node as CPU 5 (the host CPU processing packets per the RSS rules mentioned earlier). The vhost thread for the VM&#39;s tap device is also affined to a small set of host CPUs in the same NUMA node to avoid scheduling collisions with the vcpu threads, the CPU processing packets (5) and its sibling hardware thread (CPU 53 in my case) - all of which add variability to the results.&#xA;&#xA;Forwarding with XDP&#xA;&#xA;Packet forwarding with XDP is done by attaching an L2 forwarding program [4] to eth0 and eth1. The program pulls the VLAN and destination mac from the ethernet header, and uses the pair as a key for a lookup in a hash map. The lookup returns the next device index for the packet which for packets destined to the VM is the index of its tap device. If an entry is found, the packet is redirected to the device via XDPREDIRECT. The use case was presented in depth at netdevconf 0x14 [5].&#xA;&#xA;Packet generator&#xA;&#xA;Packets are generated using 2 VMs on a server that is directly connected to the same TOR switches as the hypervisor running the VM under test. The point of the investigation is to measure the overhead of delivering packets to a VM, so memcpy is kept to a minimum by having the packet generator [6] in the VMs send 1-byte UDP packets.&#xA;&#xA;Test setup&#xA;&#xA;Each VM can generate a little over 1 million packets per sec (1M pps), for a maximum load of 2.2M pps based on 2 separate source addresses.&#xA;&#xA;CPU Measurement&#xA;&#xA;As discussed in [1] a fair number of packets are processed in the context of some interrupted, victim process or when handled on an idle CPU the cycles are not fully accounted in the softirq time shown in tools like mpstat.&#xA;&#xA;This test binds openssl speed, a purely userspace command[1], to the CPU handling packets to fully consume 100% of all CPU cycles which makes the division of CPU time between user, system and softirq more transparent. In this case, the output of mpstat -P 5 shows how all of the cycles for CPU 5 were spent (within the resolution of system accounting):&#xA;%softirq is the time spent handling packets. This data is shown in the graphs below.&#xA;%usr represents the usable CPU time for processes to make progress on their workload. In this test, it shows the percentage of CPU consumed by openssl and compares to the times shown by openssl within 1-2%.&#xA;%sys is the percentage of kernel time and for the data shown below was always &lt;0.2%.&#xA;&#xA;As an example, in this mpstat output openssl is only getting 14.2% of the CPU while 85.8% was spent handling the packet load:&#xA;CPU    %usr   %nice    %sys  %iowait   %irq   %soft   %idle&#xA;  5   14.20    0.00    0.00     0.00   0.00   85.80   0.00&#xA;(%steal, %guest and %gnice dropped were always 0 and dropped for conciseness.)&#xA;&#xA;Let&#39;s get to the data.&#xA;&#xA;CPU Comparison&#xA;&#xA;This chart shows a comparison of the %softirq required to handle various PPS rates for both OVS and XDP. Lower numbers are better (higher percentages mean more CPU cycles).&#xA;&#xA;1-VM softirq&#xA;&#xA;There is 1-2% variability in ksoftirqd percentages despite the 5-second averaging, but the variability does not really affect the important points of this comparison. &#xA;&#xA;The results should not be that surprising. OVS has well established scaling problems and the chart shows that as packet rates increase. In my tests it was not hard to saturate a CPU with OVS, reaching a maximum packet rate to the VM of 1.2M pps. The 100% softirq at 1.5M pps and up is saturation of ksoftirqd alone with nothing else running on that CPU. Running another process on CPU 5 immediately affects the throughput rate as the CPU splits time between processing packets and running that process. With openssl, the packet rate to the VM is cut in half with packet drops at the host ingress as it can no longer keep up with the packet rate given the overhead of OVS.&#xA;&#xA;XDP on the other hand could push 2M pps to the VM before the guest could no longer keep up with packet drops at the tap device (ie., no room in the tun ring meaning the guest has not processed the previous packets). As shown above, the host still has plenty of CPU to handle more packets or run workloads (preferred condition for a cloud host).&#xA;&#xA;One thing to notice about the chart above is the apparent flat lining of CPU usage between 50k pps and 500k pps. That is not a typo, and the results are very repeatable. This needs more investigation, but I believe it shows the efficiencies kicking in from a combination of more packets getting handled per napi poll cycle (closer to maximum of the netdev budget) and the kernel side bulking in XDP before a flush is required.&#xA;&#xA;Hosts typically run more than 1 VM, so let&#39;s see the effect of adding a second VM to the mix. For this case a second VM is started with the same setup as mentioned earlier, but now the traffic load is split equally between 2 VMs. The key point here is a single CPU processing interleaved network traffic for 2 different destinations.&#xA;&#xA;2-VM softirq&#xA;&#xA;For OVS, CPU saturation with ksoftirqd happens with a maximum  packet rate to each VM of 800k pps (compared to 1.2M with only a single VM). The saturation is in the host with packet drops shown at host ingress, and again any competition for the CPU processing packets cuts the rate in half.&#xA;&#xA;Meanwhile, XDP is barely affected by the second VM with a modest increase of 3-4% in softirq at the upper packet rates. In this case, the redirected packets are just hitting separate bulking queues in the kernel. The two packet generators are not able to hit 4+M pps to find the maximum per-VM rate.&#xA;&#xA;Final Thoughts&#xA;&#xA;CPU cycles are only the beginning for comparing network solutions. A full OVS-vs-XDP comparison needs to consider all the resources consumed - e.g., memory as well as CPU. For example, OVS has ovs-vswitchd which consumes a high amount of memory (  750MB RSS on this server with only the 2 VMs) and additional CPU cycles to handle upcalls (flow misses) and revalidate flow entries in the kernel which on an active hypervisor can easily consume 50+% cpu (not counting increased usage from various bugs[7]).&#xA;&#xA;Meanwhile, XDP is still early in its lifecycle. Right now, using XDP for this setup requires VLAN acceleration in the NIC [5] to be disabled meaning the VLAN header has to be removed by the ebpf program before forwarding to the VM. Using the proposed hardware hints solution reduces the softirq time by another 1-2% meaning 1-2% more usable CPU by leveraging hardware acceleration with XDP. This is just an example of how XDP will continue to get faster as it works better with hardware offloads.&#xA;&#xA;Acronyms&#xA;LACP   Link Aggregation Control Protocol&#xA;NIC    Nework Interface Card&#xA;NUMA   Non-Uniform Memory Access&#xA;OVS    Open VSwitch&#xA;PPS    Packets per Second&#xA;RPS    Receive Packet Steering&#xA;RSS    Receive Side Scaling&#xA;TOR    Top-of-Rack&#xA;VM     Virtual Machine&#xA;XDP    Express Data Path in Linux&#xA;&#xA;References&#xA;[1] https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host&#xA;[2] https://people.kernel.org/dsahern/rss-rps-locking-qdisc&#xA;[3] https://github.com/dsahern/bpf-progs/blob/master/ksrc/rxacl.c&#xA;[4] https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdpl2fwd.c&#xA;[5] https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud&#xA;[6] https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c&#xA;[7] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html]]&gt;</description>
      <content:encoded><![CDATA[<p>Long overdue blog post on XDP; so many details uncovered during testing causing tests to be redone.</p>

<p>This post focuses on a comparison of XDP and OVS in delivering packets <strong>to</strong> a VM from the perspective of CPU cycles spent by the host in processing those packets. There are a lot of variables at play, and changing any one of them radically affects the outcome, though it should be no surprise XDP is always lighter and faster.</p>

<h2 id="setup">Setup</h2>

<p>I believe I am covering all of the settings here that I discovered over the past few months that caused variations in the data.</p>

<h3 id="host">Host</h3>

<p>The host is a standard, modern server (Dell PowerEdge R640) with an Intel® Xeon® Platinum 8168 CPU @ 2.70GHz with 96 hardware threads (48 cores + hyper threading to yield 96 logical cpus in the host). The server is running Ubuntu 18.04 with the recently released 5.8.0 kernel. It has a Mellanox Connectx4-LX ethernet card with 2 25G ports into an 802.3ad (LACP) bond, and the bond is connected to an OVS bridge.</p>

<p>                      <img src="https://raw.githubusercontent.com/dsahern/blog/master/people.kernel.org/xdp-vs-ovs/host-setup.png" alt="Host setup"></p>

<p>As discussed in [1] to properly compare the CPU costs of the 2 networking solutions, we need to consolidate packet processing to a single CPU. Handling all packets destined to the same VM on the same CPU avoids lock contention on the tun ring, so consolidating packets to a single CPU is actually best case performance.</p>

<p>Ensure RPS is disabled in the host:</p>

<pre><code>for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done
</code></pre>

<p>and add flow rules in the NIC to push packets for the VM under test to a single CPU:</p>

<pre><code>sudo ethtool -N eth0 flow-type ether dst 12:34:de:ad:ca:fe action 2
sudo ethtool -N eth1 flow-type ether dst 12:34:de:ad:ca:fe action 2
</code></pre>

<p>For this host and ethernet card, packets for queue 2 are handled on CPU 5 (consult /proc/interrupts for the mapping on your host).</p>

<p>XDP bypasses the qdisc layer, so to have a fair comparison make <strong>noqueue</strong> the default qdisc <em>before</em> starting the VM:</p>

<pre><code>sudo sysctl -w net.core.default_qdisc=noqueue
</code></pre>

<p>(or add a udev rule [2]).</p>

<p>Finally, the host is fairly quiet with only one VM running (the one under test) and very little network traffic outside of the VM under test and a few, low traffic ssh sessions used to run commands to collect data about the tests.</p>

<h3 id="virtual-machine">Virtual Machine</h3>

<p>The VM has 8 cpus and is also running Ubuntu 18.04 with a 5.8.0 kernel. It uses tap+vhost for networking with the tap device a port in the OVS bridge as shown in the picture above. The tap device has a single queue, and RPS is also disabled in the guest:</p>

<pre><code>echo 00 | sudo tee /sys/class/net/eth0/queues -name rps_cpus
</code></pre>

<p>The VM is also quiet with no load running in the guest OS.</p>

<p>The point of this comparison is host side processing of packets, so packets are dropped in the guest as soon as possible using a bpf program [3] attached to eth0 as a tc filter.  (Note: Theoretically, XDP should be used to drop the packets in the guest OS since it truly is the fewest cycles per packet. However, XDP in the VM requires a multi-queue NIC[5], and adding queues to the guest NIC has a <strong>huge</strong> affect on the results.)</p>

<p>In the host, the qemu threads corresponding to the guest CPUs (vcpus) are affined (as a set) to 8 hardware threads in the same NUMA node as CPU 5 (the host CPU processing packets per the RSS rules mentioned earlier). The vhost thread for the VM&#39;s tap device is also affined to a small set of host CPUs in the same NUMA node to avoid scheduling collisions with the vcpu threads, the CPU processing packets (5) and its sibling hardware thread (CPU 53 in my case) – all of which add variability to the results.</p>

<h2 id="forwarding-with-xdp">Forwarding with XDP</h2>

<p>Packet forwarding with XDP is done by attaching an L2 forwarding program [4] to eth0 and eth1. The program pulls the VLAN and destination mac from the ethernet header, and uses the pair as a key for a lookup in a hash map. The lookup returns the next device index for the packet which for packets destined to the VM is the index of its tap device. If an entry is found, the packet is redirected to the device via XDP_REDIRECT. The use case was presented in depth at netdevconf 0x14 [5].</p>

<h2 id="packet-generator">Packet generator</h2>

<p>Packets are generated using 2 VMs on a server that is directly connected to the same TOR switches as the hypervisor running the VM under test. The point of the investigation is to measure the overhead of delivering packets to a VM, so memcpy is kept to a minimum by having the packet generator [6] in the VMs send 1-byte UDP packets.</p>

<p><img src="https://raw.githubusercontent.com/dsahern/blog/master/people.kernel.org/xdp-vs-ovs/test-setup.png" alt="Test setup"></p>

<p>Each VM can generate a little over 1 million packets per sec (1M pps), for a maximum load of 2.2M pps based on 2 separate source addresses.</p>

<h2 id="cpu-measurement">CPU Measurement</h2>

<p>As discussed in [1] a fair number of packets are processed in the context of some interrupted, victim process or when handled on an idle CPU the cycles are not fully accounted in the softirq time shown in tools like mpstat.</p>

<p>This test binds <code>openssl speed</code>, a purely userspace command[1], to the CPU handling packets to fully consume 100% of all CPU cycles which makes the division of CPU time between user, system and softirq more transparent. In this case, the output of <code>mpstat -P 5</code> shows how all of the cycles for CPU 5 were spent (within the resolution of system accounting):
* %softirq is the time spent handling packets. This data is shown in the graphs below.
* %usr represents the usable CPU time for processes to make progress on their workload. In this test, it shows the percentage of CPU consumed by openssl and compares to the times shown by openssl within 1-2%.
* %sys is the percentage of kernel time and for the data shown below was always &lt;0.2%.</p>

<p>As an example, in this <code>mpstat</code> output openssl is only getting 14.2% of the CPU while 85.8% was spent handling the packet load:</p>

<pre><code>CPU    %usr   %nice    %sys  %iowait   %irq   %soft   %idle
  5   14.20    0.00    0.00     0.00   0.00   85.80   0.00
</code></pre>

<p>(%steal, %guest and %gnice dropped were always 0 and dropped for conciseness.)</p>

<p>Let&#39;s get to the data.</p>

<h2 id="cpu-comparison">CPU Comparison</h2>

<p>This chart shows a comparison of the %softirq required to handle various PPS rates for both OVS and XDP. Lower numbers are better (higher percentages mean more CPU cycles).</p>

<p><img src="https://raw.githubusercontent.com/dsahern/blog/master/people.kernel.org/xdp-vs-ovs/softirq-pps-1vm.png" alt="1-VM softirq"></p>

<p>There is 1-2% variability in ksoftirqd percentages despite the 5-second averaging, but the variability does not really affect the important points of this comparison.</p>

<p>The results should not be that surprising. OVS has well established scaling problems and the chart shows that as packet rates increase. In my tests it was not hard to saturate a CPU with OVS, reaching a maximum packet rate to the VM of 1.2M pps. The 100% softirq at 1.5M pps and up is saturation of ksoftirqd alone with nothing else running on that CPU. Running another process on CPU 5 immediately affects the throughput rate as the CPU splits time between processing packets and running that process. With openssl, the packet rate to the VM is cut in half with packet drops at the host ingress as it can no longer keep up with the packet rate given the overhead of OVS.</p>

<p>XDP on the other hand could push 2M pps to the VM before the <strong>guest</strong> could no longer keep up with packet drops at the tap device (ie., no room in the tun ring meaning the guest has not processed the previous packets). As shown above, the host still has plenty of CPU to handle more packets or run workloads (preferred condition for a cloud host).</p>

<p>One thing to notice about the chart above is the apparent flat lining of CPU usage between 50k pps and 500k pps. That is not a typo, and the results are very repeatable. This needs more investigation, but I believe it shows the efficiencies kicking in from a combination of more packets getting handled per napi poll cycle (closer to maximum of the netdev budget) and the kernel side bulking in XDP before a flush is required.</p>

<p>Hosts typically run more than 1 VM, so let&#39;s see the effect of adding a second VM to the mix. For this case a second VM is started with the same setup as mentioned earlier, but now the traffic load is split equally between 2 VMs. The key point here is a single CPU processing interleaved network traffic for 2 different destinations.</p>

<p><img src="https://raw.githubusercontent.com/dsahern/blog/master/people.kernel.org/xdp-vs-ovs/softirq-pps-2vm.png" alt="2-VM softirq"></p>

<p>For OVS, CPU saturation with ksoftirqd happens with a maximum  packet rate to each VM of 800k pps (compared to 1.2M with only a single VM). The saturation is in the host with packet drops shown at host ingress, and again any competition for the CPU processing packets cuts the rate in half.</p>

<p>Meanwhile, XDP is barely affected by the second VM with a modest increase of 3-4% in softirq at the upper packet rates. In this case, the redirected packets are just hitting separate bulking queues in the kernel. The two packet generators are not able to hit 4+M pps to find the maximum per-VM rate.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>CPU cycles are only the beginning for comparing network solutions. A full OVS-vs-XDP comparison needs to consider all the resources consumed – e.g., memory as well as CPU. For example, OVS has ovs-vswitchd which consumes a high amount of memory (&gt;750MB RSS on this server with only the 2 VMs) and additional CPU cycles to handle upcalls (flow misses) and revalidate flow entries in the kernel which on an active hypervisor can easily consume 50+% cpu (not counting increased usage from various bugs[7]).</p>

<p>Meanwhile, XDP is still early in its lifecycle. Right now, using XDP for this setup requires VLAN acceleration in the NIC [5] to be disabled meaning the VLAN header has to be removed by the ebpf program before forwarding to the VM. Using the proposed hardware hints solution reduces the softirq time by another 1-2% meaning 1-2% more usable CPU by leveraging hardware acceleration with XDP. This is just an example of how XDP will continue to get faster as it works better with hardware offloads.</p>

<h2 id="acronyms">Acronyms</h2>

<p>LACP   Link Aggregation Control Protocol
NIC    Nework Interface Card
NUMA   Non-Uniform Memory Access
OVS    Open VSwitch
PPS    Packets per Second
RPS    Receive Packet Steering
RSS    Receive Side Scaling
TOR    Top-of-Rack
VM     Virtual Machine
XDP    Express Data Path in Linux</p>

<h2 id="references">References</h2>

<p>[1] <a href="https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host" rel="nofollow">https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host</a>
[2] <a href="https://people.kernel.org/dsahern/rss-rps-locking-qdisc" rel="nofollow">https://people.kernel.org/dsahern/rss-rps-locking-qdisc</a>
[3] <a href="https://github.com/dsahern/bpf-progs/blob/master/ksrc/rx_acl.c" rel="nofollow">https://github.com/dsahern/bpf-progs/blob/master/ksrc/rx_acl.c</a>
[4] <a href="https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdp_l2fwd.c" rel="nofollow">https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdp_l2fwd.c</a>
[5] <a href="https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud" rel="nofollow">https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud</a>
[6] <a href="https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c" rel="nofollow">https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c</a>
[7] <a href="https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html" rel="nofollow">https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html</a></p>
]]></content:encoded>
      <author>David Ahern</author>
      <guid>https://people.kernel.org/read/a/3r68b853rg</guid>
      <pubDate>Mon, 17 Aug 2020 22:42:05 +0000</pubDate>
    </item>
    <item>
      <title>The Seccomp Notifier - Cranking up the crazy with bpf()</title>
      <link>https://people.kernel.org/brauner/the-seccomp-notifier-cranking-up-the-crazy-with-bpf</link>
      <description>&lt;![CDATA[In my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))&#xA;&#xA;Last time I mentioned two new features that we had landed:&#xA;&#xA;Retrieving file descriptors from another task via pidfdgetfd()&#xA;Injection file descriptors via the new SECCOMPIOCTLNOTIFADDFD ioctl on the seccomp notifier&#xA;&#xA;The 2. feature just landed in the merge window for v5.9. So what better time than now to boot a v5.9 pre-rc1 kernel and play with the new features.&#xA;&#xA;I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are open(), connect(), dup2(), but also bpf().&#xA;People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than bpf().&#xA;&#xA;As we know, bpf() and unprivileged containers don&#39;t get along too well. But that doesn&#39;t need to be the case. For the demo you&#39;re about to see I enabled LXD to supervise the bpf() syscalls for tasks running in unprivileged containers. We will intercept the bpf() syscalls for the BPFPROGLOAD command for BPFPROGTYPECGROUPDEVICE program types and the BPFPROGATTACH, and BPFPROGDETACH commands for the BPFCGROUPDEVICE attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.&#xA;&#xA;This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up here. Let&#39;s see if the rest of the team thinks I&#39;m going crazy. :)&#xA;&#xA;asciicast]]&gt;</description>
      <content:encoded><![CDATA[<p>In my last article I looked at the <a href="https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development" rel="nofollow">seccomp notifier</a> in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))</p>

<p>Last time I mentioned two new features that we had landed:</p>
<ol><li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=83fa805bcbfc53ae82eedd65132794ae324798e5" rel="nofollow">Retrieving file descriptors from another task via <code>pidfd_getfd()</code></a></li>
<li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ecc6ea491f0c0531ad81ef9466284df260b2227" rel="nofollow">Injection file descriptors via the new <code>SECCOMP_IOCTL_NOTIF_ADDFD</code> ioctl on the seccomp notifier</a></li></ol>

<p>The 2. feature just landed in the merge window for <code>v5.9</code>. So what better time than now to boot a <code>v5.9</code> pre-rc1 kernel and play with the new features.</p>

<p>I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are <code>open()</code>, <code>connect()</code>, <code>dup2()</code>, but also <code>bpf()</code>.
People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than <code>bpf()</code>.</p>

<p>As we know, <code>bpf()</code> and unprivileged containers don&#39;t get along too well. But that doesn&#39;t need to be the case. For the demo you&#39;re about to see I enabled LXD to supervise the <code>bpf()</code> syscalls for tasks running in unprivileged containers. We will intercept the <code>bpf()</code> syscalls for the <code>BPF_PROG_LOAD</code> command for <code>BPF_PROG_TYPE_CGROUP_DEVICE</code> program types and the <code>BPF_PROG_ATTACH</code>, and <code>BPF_PROG_DETACH</code> commands for the <code>BPF_CGROUP_DEVICE</code> attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.</p>

<p>This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up <a href="https://github.com/lxc/lxd/pull/7743" rel="nofollow">here</a>. Let&#39;s see if the rest of the team thinks I&#39;m going crazy. :)</p>

<p><a href="https://asciinema.org/a/352191" rel="nofollow"><img src="https://asciinema.org/a/352181.svg" alt="asciicast"></a></p>
]]></content:encoded>
      <author>Christian Brauner</author>
      <guid>https://people.kernel.org/read/a/cbm5jfm473</guid>
      <pubDate>Fri, 07 Aug 2020 13:38:04 +0000</pubDate>
    </item>
    <item>
      <title>RSS/RPS + locking qdisc</title>
      <link>https://people.kernel.org/dsahern/rss-rps-locking-qdisc</link>
      <description>&lt;![CDATA[I recently learned this fun fact: With RSS or RPS enabled [1] and a lock-based qdisc on a VM&#39;s tap device (e.g., fqcodel) a UDP packet storm targeted at the VM can severely impact the entire server.&#xA;&#xA;The point of RSS/RPS is to distribute the packet processing load across all hardware threads (CPUs) in a server / host. However, when those packets are forwarded to a single device that has a lock-based qdisc (e.g., virtual machines and a tap device or a container and veth based device) that distributed processing causes heavy spinlock contention resulting in ksoftirqd spinning on all CPUs trying to handle the packet load.&#xA;&#xA;As an example, my server has 96 cpus and 1 million udp packets per second targeted at the VM is enough to push all of the ksoftirqd threads to near 100%:&#xA;&#xA;  PID %CPU COMMAND               P&#xA;   58 99.9 ksoftirqd/9           9&#xA;  128 99.9 ksoftirqd/23         23&#xA;  218 99.9 ksoftirqd/41         41&#xA;  278 99.9 ksoftirqd/53         53&#xA;  318 99.9 ksoftirqd/61         61&#xA;  328 99.9 ksoftirqd/63         63&#xA;  358 99.9 ksoftirqd/69         69&#xA;  388 99.9 ksoftirqd/75         75&#xA;  408 99.9 ksoftirqd/79         79&#xA;  438 99.9 ksoftirqd/85         85&#xA; 7411 99.9 CPU 7/KVM            64&#xA;   28 99.9 ksoftirqd/3           3&#xA;   38 99.9 ksoftirqd/5           5&#xA;   48 99.9 ksoftirqd/7           7&#xA;   68 99.9 ksoftirqd/11         11&#xA;   78 99.9 ksoftirqd/13         13&#xA;   88 99.9 ksoftirqd/15         15&#xA;   ...&#xA;perf top shows the spinlock contention:&#xA;    96.79%  [kernel]          [k] queuedspinlockslowpath&#xA;     0.40%  [kernel]          [k] rawspinlock&#xA;     0.23%  [kernel]          [k] netifreceiveskbcore&#xA;     0.23%  [kernel]          [k] _devqueuexmit&#xA;     0.20%  [kernel]          [k] qdiscrun&#xA;With the callchain leading to&#xA;    94.25%  [kernel.vmlinux]    [k] queuedspinlockslowpath&#xA;            |&#xA;             --94.25%--queuedspinlockslowpath&#xA;                       |&#xA;                        --93.83%--_devqueuexmit&#xA;                                  doexecuteactions&#xA;                                  ovsexecuteactions&#xA;                                  ovsdpprocesspacket&#xA;                                  ovsvportreceive&#xA;A little code analysis shows this is the qdisc lock in \\dev\xmit\skb.&#xA;&#xA;The overloaded ksoftirqd threads means it takes longer to process packets resulting in budget limits getting hit and packet drops at ingress. The packet drops can cause ssh sessions to stall or drop or cause disruptions in protocols like LACP.&#xA;&#xA;Changing the qdisc on the device to a lockless one (e.g., noqueue) dramatically lowers the ksoftirqd load.  perf top still shows the hot spot as a spinlock:&#xA;    25.62%  [kernel]          [k] queuedspinlockslowpath&#xA;     6.87%  [kernel]          [k] taskletactioncommon.isra.21&#xA;     3.28%  [kernel]          [k] rawspinlock&#xA;     3.15%  [kernel]          [k] tunnetxmit&#xA;but this time it is the lock for the tun ring:&#xA;    25.10%  [kernel.vmlinux]    [k] queuedspinlockslowpath&#xA;            |&#xA;             --25.05%--queuedspinlockslowpath&#xA;                       |&#xA;                        --24.93%--tunnetxmit&#xA;                                  devhardstartxmit&#xA;                                  devqueuexmit&#xA;                                  doexecuteactions&#xA;                                  ovsexecuteactions&#xA;                                  ovsdpprocesspacket&#xA;                                  ovsvportreceive&#xA;which is a much lighter lock in the sense of the amount of work done with the lock held.&#xA;&#xA;systemd commit e6c253e363dee, released in systemd 217, changed the default qdisc from pfifo\fast (kernel default) to fq\codel (/usr/lib/sysctl.d/50-default.conf for Ubuntu). As of v5.8 kernel fq\codel still has a lock to enqueue packets, so systems using fq\codel with RSS/RPS are hitting this lock contention which affects overall system performance. pfifo\fast is lockless as of v4.16 so for newer kernels the kernel&#39;s default is best.&#xA;&#xA;But, it begs the question why have a qdisc for a VM tap device (or a container&#39;s veth device) at all? To the VM a host is just part of the network. You would not want a top-of-rack switch to buffer packets for the server, so why have the host buffer packets for a VM? (The &#34;Tx&#34; path for a tap device represents packets going to the VM.)&#xA;&#xA;You can change the default via:&#xA;sysctl -w net.core.defaultqdisc=noqueue&#xA;or add that to a sysctl file (e.g., /etc/sysctl.d/90-local.conf). sysctl changes affect new devices only.&#xA;&#xA;Alternatively, the default can be changed for selected devices via a udev rule:&#xA;cat   /etc/udev/rules.d/90-tap.rules &lt;&lt;EOF&#xA;ACTION==&#34;add|change&#34;, SUBSYSTEM==&#34;net&#34;, KERNEL==&#34;tap*&#34;, PROGRAM=&#34;/sbin/tc qdisc add dev $env{INTERFACE} root handle 1000: noqueue&#34;&#xA;Running sudo udevadm trigger should update existing devices. Check using tc qdisc sh dev NAME:&#xA;$ tc qdisc sh dev tapext4798884&#xA;qdisc noqueue 1000: root refcnt 2&#xA;qdisc ingress ffff: parent ffff:fff1 ----------------&#xA;&#xA;[1] https://www.kernel.org/doc/Documentation/networking/scaling.txt]]&gt;</description>
      <content:encoded><![CDATA[<p>I recently learned this fun fact: With RSS or RPS enabled [1] and a lock-based qdisc on a VM&#39;s tap device (e.g., fq_codel) a UDP packet storm targeted at the VM can severely impact the entire server.</p>

<p>The point of RSS/RPS is to distribute the packet processing load across all hardware threads (CPUs) in a server / host. However, when those packets are forwarded to a single device that has a lock-based qdisc (e.g., virtual machines and a tap device or a container and veth based device) that distributed processing causes heavy spinlock contention resulting in ksoftirqd spinning on all CPUs trying to handle the packet load.</p>

<p>As an example, my server has 96 cpus and 1 million udp packets per second targeted at the VM is enough to push all of the ksoftirqd threads to near 100%:</p>

<pre><code>  PID %CPU COMMAND               P
   58 99.9 ksoftirqd/9           9
  128 99.9 ksoftirqd/23         23
  218 99.9 ksoftirqd/41         41
  278 99.9 ksoftirqd/53         53
  318 99.9 ksoftirqd/61         61
  328 99.9 ksoftirqd/63         63
  358 99.9 ksoftirqd/69         69
  388 99.9 ksoftirqd/75         75
  408 99.9 ksoftirqd/79         79
  438 99.9 ksoftirqd/85         85
 7411 99.9 CPU 7/KVM            64
   28 99.9 ksoftirqd/3           3
   38 99.9 ksoftirqd/5           5
   48 99.9 ksoftirqd/7           7
   68 99.9 ksoftirqd/11         11
   78 99.9 ksoftirqd/13         13
   88 99.9 ksoftirqd/15         15
   ...
</code></pre>

<p><code>perf top</code> shows the spinlock contention:</p>

<pre><code>    96.79%  [kernel]          [k] queued_spin_lock_slowpath
     0.40%  [kernel]          [k] _raw_spin_lock
     0.23%  [kernel]          [k] __netif_receive_skb_core
     0.23%  [kernel]          [k] __dev_queue_xmit
     0.20%  [kernel]          [k] __qdisc_run
</code></pre>

<p>With the callchain leading to</p>

<pre><code>    94.25%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --94.25%--queued_spin_lock_slowpath
                       |
                        --93.83%--__dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive
</code></pre>

<p>A little code analysis shows this is the qdisc lock in __dev_xmit_skb.</p>

<p>The overloaded ksoftirqd threads means it takes longer to process packets resulting in budget limits getting hit and packet drops at ingress. The packet drops can cause ssh sessions to stall or drop or cause disruptions in protocols like LACP.</p>

<p>Changing the qdisc on the device to a lockless one (e.g., noqueue) dramatically lowers the ksoftirqd load.  <code>perf top</code> still shows the hot spot as a spinlock:</p>

<pre><code>    25.62%  [kernel]          [k] queued_spin_lock_slowpath
     6.87%  [kernel]          [k] tasklet_action_common.isra.21
     3.28%  [kernel]          [k] _raw_spin_lock
     3.15%  [kernel]          [k] tun_net_xmit
</code></pre>

<p>but this time it is the lock for the tun ring:</p>

<pre><code>    25.10%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --25.05%--queued_spin_lock_slowpath
                       |
                        --24.93%--tun_net_xmit
                                  dev_hard_start_xmit
                                  __dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive
</code></pre>

<p>which is a much lighter lock in the sense of the amount of work done with the lock held.</p>

<p>systemd commit e6c253e363dee, released in systemd 217, changed the default qdisc from pfifo_fast (kernel default) to fq_codel (/usr/lib/sysctl.d/50-default.conf for Ubuntu). As of v5.8 kernel fq_codel still has a lock to enqueue packets, so systems using fq_codel with RSS/RPS are hitting this lock contention which affects overall system performance. pfifo_fast is lockless as of v4.16 so for newer kernels the kernel&#39;s default is best.</p>

<p>But, it begs the question why have a qdisc for a VM tap device (or a container&#39;s veth device) at all? To the VM a host is just part of the network. You would not want a top-of-rack switch to buffer packets for the server, so why have the host buffer packets for a VM? (The “Tx” path for a tap device represents packets going <em>to</em> the VM.)</p>

<p>You can change the default via:</p>

<pre><code>sysctl -w net.core.default_qdisc=noqueue
</code></pre>

<p>or add that to a sysctl file (e.g., /etc/sysctl.d/90-local.conf). sysctl changes affect new devices only.</p>

<p>Alternatively, the default can be changed for selected devices via a udev rule:</p>

<pre><code>cat &gt; /etc/udev/rules.d/90-tap.rules &lt;&lt;EOF
ACTION==&#34;add|change&#34;, SUBSYSTEM==&#34;net&#34;, KERNEL==&#34;tap*&#34;, PROGRAM=&#34;/sbin/tc qdisc add dev $env{INTERFACE} root handle 1000: noqueue&#34;
</code></pre>

<p>Running <code>sudo udevadm trigger</code> <em>should</em> update existing devices. Check using <code>tc qdisc sh dev &lt;NAME&gt;</code>:</p>

<pre><code>$ tc qdisc sh dev tapext4798884
qdisc noqueue 1000: root refcnt 2
qdisc ingress ffff: parent ffff:fff1 ----------------
</code></pre>

<p>[1] <a href="https://www.kernel.org/doc/Documentation/networking/scaling.txt" rel="nofollow">https://www.kernel.org/doc/Documentation/networking/scaling.txt</a></p>
]]></content:encoded>
      <author>David Ahern</author>
      <guid>https://people.kernel.org/read/a/ycklgg1p8d</guid>
      <pubDate>Thu, 06 Aug 2020 21:40:28 +0000</pubDate>
    </item>
    <item>
      <title>Seccomp Notify - New Frontiers in Unprivileged Container Development</title>
      <link>https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development</link>
      <description>&lt;![CDATA[Introduction&#xA;&#xA;As most people know by know we do a lot of upstream kernel development.  This stretches over multiple areas and of course we also do a lot of kernel work around containers.  In this article I&#39;d like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users.  I&#39;ve talked about this feature quite a few times at various conferences (just recently again at OSS NA)) over the last two years but never actually sat down to write a blogpost about it.  This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.&#xA;&#xA;The Limits of Unprivileged Containers&#xA;&#xA;That (Linux) Containers are a userspace fiction is a well-known dictum nowadays.  It simply expresses the fact that there is no container kernel object in the Linux kernel.  Instead, userspace is relatively free to define what a container is.  But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system.  This is achieved by combining a multitude of Linux kernel features.  One of the better known kernel features that is used to build containers are namespaces.  The number of namespaces the kernel supports has grown over time and we are currently at eight.  Before you go and look them up on namespaces(7) here they are:&#xA;&#xA;cgroup: cgroupnamespaces(7)&#xA;ipc: ipcnamespaces(7)&#xA;network: networknamespaces(7)&#xA;mount: mountnamespaces(7)&#xA;pid: pidnamespaces(7)&#xA;time: timenamespaces(7)&#xA;user: usernamespaces(7)&#xA;uts: utsnamespaces(7)&#xA;&#xA;Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.&#xA;&#xA;Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. CAPMKNOD is required to create device nodes) and it seems that when the answer is &#34;yes&#34; then the task is actually a privileged task.  But as usual things aren&#39;t that simple.  What the task thinks it&#39;s checking for and what the kernel really is checking for are possibly two very different things.  A naive task, i.e. a task not aware of user namespaces, might think it&#39;s asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.&#xA;&#xA;In most cases the kernel will not check whether the task is privileged with respect to the whole system.  Instead, it will almost always call a function called nscapable() which is the kernel&#39;s way of checking whether the calling task has privilege in its current user namespace.&#xA;&#xA;For example, when a new user namespace is created by setting the CLONENEWUSER flag in unshare(2) or in clone3(2) the kernel will grant a full set of capabilities to the task that called unshare(2) or the newly created child task via clone3(2) within the new user namespace.  When this task now e.g. checks whether it has the CAPMKNOD capability the kernel will report back that it indeed has that capability.  The key point though is that this &#34;yes&#34; is not a global &#34;yes&#34;, i.e. the question &#34;Am I privileged enough to perform this operation?&#34; only applies to the current user namespace (and technically any nested user namespaces) not the host itself.&#xA;&#xA;This distinction is important when  trying to understand why a task running as root in a new user namespace with all capabilities raised will still see EPERM when e.g. trying to call mknod(&#34;/dev/mem&#34;, makedev(1, 1)) even though it seems to have all necessary privileges.  The reason for this counterintuitive behavior is that the kernel isn&#39;t always checking whether you are privileged against your current user namespace.  Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host&#39;s user namespace.&#xA;&#xA;Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create /dev/kmem or any other critical device and use the device to take over the host.  So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace.  This is of course technically inconsistent since capabilities are per user namespace as we observed above.&#xA;&#xA;Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices.  So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it.  On the other hand, some filesystems like cgroup, cgroup2, tmpfs, proc, sysfs, and fuse can be mounted in user namespace (with some caveats for proc and sys but we&#39;re ignoring those details for now) because the kernel can guarantee that this is safe.&#xA;&#xA;But of course these restrictions are annoying.  Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely.  Quite often a container manager like LXD will know better than the kernel when an operation that a container tries to perform is safe.&#xA;&#xA;A good example are device nodes.  Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:&#xA;&#xA;/dev/console&#xA;/dev/full&#xA;/dev/null&#xA;/dev/random&#xA;/dev/tty&#xA;/dev/urandom&#xA;/dev/zero&#xA;&#xA;Allowing a container to create these devices would be safe.  Of course, the container will simply bind-mount these devices during container startup into the container so this isn&#39;t really a serious problem.  But any program running inside the container that wants to create these harmless devices nodes would fail.&#xA;&#xA;The other example that was mentioned earlier is mounting of block-based filesystems.  Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe.  For example, they could have a dedicated disk for the container or they want to share data with or among containers.  But the container could not mount any of those disks.&#xA;&#xA;For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers.  For example, here is how you&#39;d hotplug /dev/zero into a running container:&#xA;&#xA; brauner@wittgenstein|~&#xA;  lxc exec f5 -- ls -al /my/zero&#xA;&#xA;brauner@wittgenstein|~&#xA;  lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero&#xA;Device zero-device added to f5&#xA;&#xA;brauner@wittgenstein|~&#xA;  lxc exec f5 -- ls -al /my/zero&#xA;crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero&#xA;&#xA;But of course, that doesn&#39;t help at all when a random application inside the container calls mknod(2) itself.  In these cases LXD has no way of helping the application by hotplugging the device as it&#39;s unaware that a mknod syscall has been performed.&#xA;&#xA;So the root of the problem seems to be:&#xA;A task inside the container performs a syscall that will fail.&#xA;The syscall would not need to fail since the container manager knows that it is safe.&#xA;The container manager has no way of knowing when such a syscall is performed.&#xA;Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.&#xA;&#xA;So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail.  So somehow we need to be able to interact with the syscalls of another task.&#xA;&#xA;Seccomp - The Basics of Syscall Interception&#xA;&#xA;The obvious candidate to look at is seccomp.  Short for &#34;secure computing&#34; it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question.  But seccomp allows even more advanced configurations through so-called &#34;filters&#34;.  Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel.  For example, a task could use a seccomp filter to only allow the mount() syscall and only those mount syscalls that create bind mounts.  This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs.  Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully.  Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.&#xA;&#xA;At its core seccomp is nothing but a syscall interception mechanism.  One way or another every operating system has something that is at least roughly comparable.  The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths.  So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.&#xA;&#xA;Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.&#xA;&#xA;But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture&#39;s syscall table the kernel will first call into seccomp and run the loaded seccomp filter.&#xA;&#xA;Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task.  If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code.  This way, it is for example possible to have the kernel pretend like it doesn&#39;t know the mount(2) syscall by creating a seccomp filter that reports back ENOSYS whenever the task tries to call mount(2).&#xA;&#xA;But the way seccomp used to work isn&#39;t very dynamic.  Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter.  So there is no way to make a case-by-case decision which might come in handy in some scenarios.&#xA;&#xA;In addition seccomp itself can&#39;t make a syscall actually succeed other than in the trivial way of reporting back success to the caller.  So seccomp will only allow the kernel to pretend that a syscall succeeded.  So while it is possible to instruct the kernel to return 0 for the mount(2) syscall it cannot actually be instructed to make the mount(2) syscall succeed.  So just making the seccomp filter return 0 for mounting a dedicated ext4 disk device to /mnt will still not actually mount it at /mnt; it just pretends to the caller that it did.  Of course that is in itself already a useful property for a bunch of use-cases but it doesn&#39;t really help with the mknod(2) or mount(2) problem outlined above.&#xA;&#xA;Extending Seccomp&#xA;&#xA;So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our mknod(2) and mount(2) problem.  Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task.  What is missing though is a way to bring another task such as the LXD container manager into the picture.  Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.&#xA;&#xA;The answer to these questions is seccomp notify.  This is as good a time as any to bring in some historical context.  The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized.  The first concrete details around seccomp notify were conceived in early 2017 in the LXD team.  The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container&#39;s Microconference in Los Angeles.  The details of this talk are still listed here here and I&#39;m sure Stéphane can still provide the slides we came up with.  I didn&#39;t find a video recording even though I somehow thought we did have one.  If someone is really curious I can try to investigate with the Linux Plumbers committee.  After this talk implementation specifics were discussed in a hallway meeting later that day.  And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team.  The rest is history^wchangelog.&#xA;&#xA;Seccomp Notify - Syscall Interception 2.0&#xA;&#xA;In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter.  When a container starts it will usually load a seccomp filter to restrict its attack surface.  That is even done for unprivileged containers even though it is not strictly necessary.&#xA;&#xA;With the addition of seccomp notify a container wishing to have a subset of syscalls handled by another process can set the new SECCOMPRETUSERNOTIF flag on its seccomp filter.  This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter.  This file descriptor is a seccomp notify file descriptor.&#xA;&#xA;Of course, the seccomp notify fd is not very useful to the task itself.  First, since it doesn&#39;t make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls.  Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.&#xA;&#xA;But what the task can do with the seccomp notify fd is to hand to another task.  Usually the task that it will hand the seccomp notify fd to will be more privileged than itself.  For a container the most obvious candidate would be the container manager of course.&#xA;&#xA;Since the seccomp notify fd is pollable it is possible to put it into an event loop such as epoll(7), poll(2), or select(2) and wait for the file descriptor to become readable, i.e. for the kernel to return EPOLLIN to userspace.  For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements.  This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. mknod(2) or mount(2).&#xA;&#xA;Put another way, this means the container manager can listen for syscall events for tasks running in the container.  Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.&#xA;&#xA;After the seccomp notify fd indicates that it is readable the container manager can use the new SECCOMPIOCTLNOTIFRECV ioctl() associated with seccomp notify fds to read a struct seccompnotif message for the syscall.  Currently the data to be read from the seccomp notify fd includes the following pieces.  But please be aware that we are in the process of discussing potentially intrusive changes for future versions:&#xA;&#xA;struct seccompnotif {&#xA;&#x9;u64 id;&#xA;&#x9;u32 pid;&#xA;&#x9;_u32 flags;&#xA;&#x9;struct seccompdata data;&#xA;};&#xA;&#xA;Let&#39;s look at this in a little more detail.  The pid field is the pid of the task that performed the syscall as seen in the caller&#39;s pid namespace.  To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called mknod(2) as seen in the pid namespace of the container manager.  The id field is a unique identifier for the performed syscall.  This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling.  The flags argument is currently unused and reserved for future extensions.&#xA;&#xA;The struct seccompdata argument is probably the most interesting one as it contains the really exciting bits and pieces:&#xA;&#xA;struct seccompdata {&#xA;&#x9;int nr;&#xA;&#x9;u32 arch;&#xA;&#x9;u64 instructionpointer;&#xA;&#x9;u64 args[6];&#xA;};&#xA;&#xA;The int field is the syscall number which can only be correctly interpreted relative to the arch field.  The arch field is the (audit) architecture for which this syscall was made.  This field is very relevant since compatible architectures (For the x86 architectures this encompasses at least x32, i386, and x8664. The arm, mips, and power architectures also have compatible &#34;sub&#34; architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall foo() might have syscall number 1 on 32 bit and 2 on 64 bit.  So the task reading the seccomp data can&#39;t simply assume that since it itself is running in a 32 bit environment the syscall number must be 1.  Rather, it must check what the audit arch is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit.  Otherwise the container manager might end up emulating mount() when it should be emulating mknod().).  The instructionpointer is set to the address of the instruction that performed the syscall. This is of course also architecture specific.  And last the args member are the syscall arguments that the task performed the syscall with.&#xA;&#xA;The args need to be interpreted and treated differently depending on the syscall layout and their type.  If they are non-pointer arguments (unsigned int etc.) they can be copied into a local variable and interpreted right away.  But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall.  In the latter case the memory needs to be read and copied before it can be interpreted.&#xA;&#xA;Let&#39;s look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments.  Say the performed syscall was mount(2).  In order to interpret the args field correctly we look at the syscall layout of mount().  (Please note, that I&#39;m stressing that we need to look at the layout of syscall and the only reliable source for this is actually the kernel source code.  The Linux manpages often list the wrapper provided by the system&#39;s libc and these wrapper do not necessarily line-up with the syscall itself (compare the waitid() wrapper and the waitid() syscall or the various clone() syscall layouts).) From the layout of mount(2) we see that args[0] is a pointer argument identifying the source path, args[1] is another pointer argument identifying the target path, args[2] is a pointer argument identifying the filesystem type, args[3] is a non-pointer argument identifying the options, and args[4] is another pointer argument identifying additional mount options.&#xA;&#xA;So if we were to be interested in the source path of this mount(2) syscall we would need to open the /proc/pid/mem file of the task that performed this syscall and e.g. use the pread(2) function with args[0] as the offset into the task&#39;s virtual memory and read it into a buffer at least the length of a standard path.  Alternatively, we can use a single syscall like processvmreadv(2) to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.&#xA;&#xA;A friendly advice: in general it is a good idea for the container manager to read all syscall arguments once into a local buffer and base its decisions on how to proceed on the data in this local buffer.  Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it&#39;s also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to /proc/pid/mem and change the contents of e.g. args[0] or any other syscall argument.  Also note, that the container manager should ensure that /proc/pid still refers to the same task after opening it by checking the validity of the syscall request via the id field and the associated SECCOMPIOCTLNOTIFIDVALID ioctl() to exclude the possibility of the task having exited, been reaped and its pid having been recycled.&#xA;&#xA;But let&#39;s assume we have done all that.  Now that the container manager has the task&#39;s syscall arguments available in a local buffer it can interpret the syscall arguments.  While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed.  After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the SECCOMPIOCTLNOTIFSEND ioctl() on the seccomp notify fd to tell the kernel what it should do with the blocked task&#39;s syscall.  The response is given in the form struct seccompnotifresp:&#xA;&#xA;struct seccompnotifresp {&#xA;&#x9;u64 id;&#xA;&#x9;s64 val;&#xA;&#x9;s32 error;&#xA;&#x9;u32 flags;&#xA;};&#xA;&#xA;Let&#39;s look at this struct in a little more detail too.  The id field is set to the id of the syscall request to respond to and should correspond to the received id in the struct seccompnotif that the container manager read via the SECCOMPIOCTLNOTIFRECV ioctl() when the seccomp notify fd became readable.  The val field is the return value of the syscall and is only set if the error field is set to 0.  The error field is the error to return from the syscall and should be set to a negative errno(3) code if the syscall is supposed to fail (For example, to trick the caller into thinking that mount(2) is not supported on this kernel set error to -ENOSYS.).  The flags value can be used to tell the kernel to continue the syscall by setting the SECCOMPUSERNOTIFFLAGCONTINUE flag which I added to be able to intercept mount(2) and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments.  More on that in a little bit.&#xA;&#xA;With this machinery in place we are for now ;) done with the kernel bits.&#xA;&#xA;Emulating Syscalls In Userspace&#xA;&#xA;So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue.  Probably emulate it.  Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a very bad idea.).&#xA;&#xA;Emulating syscalls in userspace is not a very new thing to do.  It has been done for a long time.  For example, libc&#39;s can choose to emulate the execveat(2) syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path.  On a kernel that doesn&#39;t support the execveat(2) syscall the libc can emulate it by calling exec(3) with the path set to /proc/self/fd/nr.  The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (fexecve(3) for our example).  Any task using syscall(_NRexecveat, [...]) to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn&#39;t know that the task wants to perform the execveat(2) syscall and will not be able to emulate it in case the kernel doesn&#39;t support it.&#xA;&#xA;Seccomp notify doesn&#39;t suffer from this problem since its syscall interception abilities aren&#39;t located in userspace at the library level but directly in the syscall path as we have seen.  This greatly expands the abilities to emulate syscalls.&#xA;&#xA;So now we have all the kernel pieces in place to solve our mknod(2) and mount(2) problem in unprivileged containers.  Instead of simply letting the container fail on such harmless requests as creating the /dev/zero device node we can use seccomp notify to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it.  Similarly, we can intercept mount(2) requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container.  We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem.  We actually support this feature in LXD.  Since fuse is a safe way for unprivileged users to mount filesystems rewriting mount(2) requests is a great way to expose filesystems to containers.&#xA;&#xA;In general, the possibilities of seccomp notify can&#39;t be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC.  As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with seccomp notify seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to seccomp notify.&#xA;&#xA;Seccomp Notify in action - LXD&#xA;&#xA;After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses.  Instead of simply exposing the raw seccomp notify fd for the container&#39;s seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients.  Clients here are new containers who the administrator has signed up for syscall supervisions through LXD.  Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.&#xA;&#xA;When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd.  The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials.  LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments.  If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall.  For mknod(2) it will create the device node for the container and for mount(2) it will mount the filesystem for the container.  Either by directly mounting it or by using a specified fuse binary for additional security.&#xA;&#xA;If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container.  The container then parses the message, verifying the credentials and will use the SECCOMPIOCTLNOTIFSEND ioctl() sending a struct seccompnotifresp causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded.  Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.&#xA;&#xA;Show Me!&#xA;&#xA;Ok, enough talk.  Let&#39;s intercept some syscalls.  The following demo shows how LXD uses the seccomp notify fd to emulate the mknod(2) and mount(2) syscalls for an unprivileged container:&#xA;&#xA;asciicast&#xA;&#xA;Current Work and Future Directions&#xA;&#xA;SECCOMPUSERNOTIFFLAGCONTINUE&#xA;&#xA;After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly.  We realized we couldn&#39;t intercept the mount syscall.  Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept.  This is caused by seccomp not being able to handle pointer arguments.  They are opaque for seccomp.  So while it is possible to tell seccomp to only intercept mount(2) requests for real filesystems by only intercepting mount(2) syscalls where the MSBIND flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about mount(2) syscalls for the ext4 or btrfs filesystem because the filesystem argument is a pointer.&#xA;&#xA;But this means we will inadvertently intercept syscalls that we didn&#39;t intend to intercept.  That is a generic problem but for some syscalls it&#39;s not really a big deal.  For example, we know that mknod(2) fails for all character and block devices in unprivileged containers.  So as long was we write a seccomp filter that intercepts only character and block device mknod(2) syscalls but no socket or fifo mknod() syscalls we don&#39;t have a problem.  For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report EPERM and since the syscalls would fail anyway there&#39;s no problem.&#xA;&#xA;But any system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace.  But this would of course include all mount(2) syscalls for filesystems that can be mounted in unprivileged containers.  I&#39;ve listed a subset of them above.  It includes at least tmpfs, proc, sysfs, devpts, cgroup, cgroup2 and probably a few others I&#39;m forgetting.  That&#39;s not ideal.  We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.&#xA;&#xA;The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall.  To instruct the kernel to continue the syscall the SECCOMPUSERNOTIFFLAGCONTINUE flag can be set in struct seccompnotifresp&#39;s flag argument when instructing the kernel to unblock the task.&#xA;&#xA;This is of course a very exciting feature and has a few readers probably thinking &#34;Hm, I could implement a dynamic userspace seccomp policy.&#34; to which I want to very loudly respond &#34;No, you can&#39;t!&#34;.  In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace.  I&#39;m now going to mostly quote verbatim from my comment for the extension: The SECCOMPUSERNOTIFFLAGCONTINUE flag must be used with extreme caution!  If set by the task supervising the syscalls of another task the syscall will continue.  This is problematic is inherent because of TOCTOU (Time of Check-Time of Use).  An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall.  It should be absolutely clear that this means that seccomp notify cannot be used to implement a security policy on syscalls that read from dereferenced pointers in user space!  It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe.  In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.&#xA;&#xA;Similar precautions should be applied when stacking SECCOMPRETUSERNOTIF or SECCOMPRETTRACE.  For SECCOMPRETUSERNOTIF filters acting on the same syscall, the most recently added filter takes precedence.  This means that the new SECCOMPRETUSERNOTIF filter can override any SECCOMPIOCTLNOTIFSEND from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response SECCOMPUSERNOTIFFLAGCONTINUE.  Note that SECCOMPRETTRACE can equally be overriden by SECCOMPUSERNOTIFFLAGCONTINUE.&#xA;&#xA;Retrieving file descriptors pidfdgetfd()&#xA;&#xA;Another extension that was added by Sargun Dhillon recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task.  This works even without seccomp notify since it is a new syscall but is of course especially useful in conjunction with it.&#xA;&#xA;Often we would like to intercept syscalls such as connect(2).  For example, the container manager might want to rewrite the connect(2) request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint.  In these cases pidfdgetfd(2) can be used to retrieve a copy of the file descriptor of the task and perform the connect(2) for it.  This unblocks another wide range of use-cases.&#xA;&#xA;For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run getsockopt(2) on the file descriptor, and you can use options like TCPINFO to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like TCPNODELAY, to tune the socket without requiring the user&#39;s intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where connect(2) calls are intercepted, and the destination is changed to a real server instead.&#xA;&#xA;Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.&#xA;&#xA;div&#xA;    a href=&#34;https://plotly.com/~sargun/63/?sharekey=TBxaZob2h9GiGD9LxVuFuE&#34; target=&#34;blank&#34; title=&#34;Plot 63&#34; style=&#34;display: block; text-align: center;&#34;img src=&#34;https://plotly.com/~sargun/63.png?sharekey=TBxaZob2h9GiGD9LxVuFuE&#34; alt=&#34;Plot 63&#34; style=&#34;max-width: 100%;width: 600px;&#34;  width=&#34;600&#34; onerror=&#34;this.onerror=null;this.src=&#39;https://plotly.com/404.png&#39;;&#34; //a&#xA;/div&#xA;&#xA;Injecting file descriptors SECCOMPNOTIFYIOCTLADDFD&#xA;&#xA;Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task.  As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as open(2) that cause new file descriptors to be installed in the task performing the syscall.&#xA;&#xA;The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on.  This way it is possible to intercept syscalls such as open(2) or accept(2), and install (or replace, like dup2(2)) the container manager&#39;s resulting fd in the target task.&#xA;&#xA;This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to perfeventopen(2), and bpf(2) for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.&#xA;&#xA;Last, I want to note that Sargun Dhillon was kind enough to contribute paragraphs to the pidfdgetfd(2) and SECCOMPNOTIFYIOCTLADDFD sections. He also provided the graphic in the pidfd_getfd(2) sections to illustrate the performance benefits of this solution.&#xA;&#xA;Christian]]&gt;</description>
      <content:encoded><![CDATA[<h4 id="introduction">Introduction</h4>

<p>As most people know by know we do a lot of upstream kernel development.  This stretches over multiple areas and of course we also do a lot of kernel work around containers.  In this article I&#39;d like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users.  I&#39;ve talked about this feature quite a few times at various conferences (just recently again at <a href="https://ossna2020.sched.com/event/c3WE/making-unprivileged-containers-more-useable-christian-brauner-canonical]" rel="nofollow">OSS NA</a>) over the last two years but never actually sat down to write a blogpost about it.  This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.</p>

<h4 id="the-limits-of-unprivileged-containers">The Limits of Unprivileged Containers</h4>

<p>That (Linux) Containers are a userspace fiction is a well-known dictum nowadays.  It simply expresses the fact that there is no container kernel object in the Linux kernel.  Instead, userspace is relatively free to define what a container is.  But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system.  This is achieved by combining a multitude of Linux kernel features.  One of the better known kernel features that is used to build containers are namespaces.  The number of namespaces the kernel supports has grown over time and we are currently at eight.  Before you go and look them up on <code>namespaces(7)</code> here they are:</p>
<ul><li>cgroup: <code>cgroup_namespaces(7)</code></li>
<li>ipc: <code>ipc_namespaces(7)</code></li>
<li>network: <code>network_namespaces(7)</code></li>
<li>mount: <code>mount_namespaces(7)</code></li>
<li>pid: <code>pid_namespaces(7)</code></li>
<li>time: <code>time_namespaces(7)</code></li>
<li>user: <code>user_namespaces(7)</code></li>
<li>uts: <code>uts_namespaces(7)</code></li></ul>

<p>Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.</p>

<p>Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. <code>CAP_MKNOD</code> is required to create device nodes) and it seems that when the answer is “yes” then the task is actually a privileged task.  But as usual things aren&#39;t that simple.  What the task thinks it&#39;s checking for and what the kernel really is checking for are possibly two very different things.  A naive task, i.e. a task not aware of user namespaces, might think it&#39;s asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.</p>

<p>In most cases the kernel will not check whether the task is privileged with respect to the whole system.  Instead, it will almost always call a function called <code>ns_capable()</code> which is the kernel&#39;s way of checking whether the calling task has privilege in its current user namespace.</p>

<p>For example, when a new user namespace is created by setting the <code>CLONE_NEWUSER</code> flag in <code>unshare(2)</code> or in <code>clone3(2)</code> the kernel will grant a full set of capabilities to the task that called <code>unshare(2)</code> or the newly created child task via <code>clone3(2)</code> <em>within</em> the new user namespace.  When this task now e.g. checks whether it has the <code>CAP_MKNOD</code> capability the kernel will report back that it indeed has that capability.  The key point though is that this “yes” is not a global “yes”, i.e. the question “Am I privileged enough to perform this operation?” only applies to the current user namespace (and technically any nested user namespaces) not the host itself.</p>

<p>This distinction is important when  trying to understand why a task running as root in a new user namespace with all capabilities raised will still see <code>EPERM</code> when e.g. trying to call <code>mknod(&#34;/dev/mem&#34;, makedev(1, 1))</code> even though it seems to have all necessary privileges.  The reason for this counterintuitive behavior is that the kernel isn&#39;t always checking whether you are privileged against your current user namespace.  Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host&#39;s user namespace.</p>

<p>Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create <code>/dev/kmem</code> or any other critical device and use the device to take over the host.  So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace.  This is of course technically inconsistent since capabilities are per user namespace as we observed above.</p>

<p>Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices.  So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it.  On the other hand, some filesystems like <code>cgroup</code>, <code>cgroup2</code>, <code>tmpfs</code>, <code>proc</code>, <code>sysfs</code>, and <code>fuse</code> can be mounted in user namespace (with some caveats for <code>proc</code> and <code>sys</code> but we&#39;re ignoring those details for now) because the kernel can guarantee that this is safe.</p>

<p>But of course these restrictions are annoying.  Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely.  Quite often a container manager like <code>LXD</code> will know better than the kernel when an operation that a container tries to perform is safe.</p>

<p>A good example are device nodes.  Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:</p>

<pre><code>/dev/console
/dev/full
/dev/null
/dev/random
/dev/tty
/dev/urandom
/dev/zero
</code></pre>

<p>Allowing a container to create these devices would be safe.  Of course, the container will simply bind-mount these devices during container startup into the container so this isn&#39;t really a serious problem.  But any program running inside the container that wants to create these harmless devices nodes would fail.</p>

<p>The other example that was mentioned earlier is mounting of block-based filesystems.  Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe.  For example, they could have a dedicated disk for the container or they want to share data with or among containers.  But the container could not mount any of those disks.</p>

<p>For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers.  For example, here is how you&#39;d hotplug <code>/dev/zero</code> into a running container:</p>

<pre><code> brauner@wittgenstein|~
&gt; lxc exec f5 -- ls -al /my/zero

brauner@wittgenstein|~
&gt; lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero
Device zero-device added to f5

brauner@wittgenstein|~
&gt; lxc exec f5 -- ls -al /my/zero
crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero
</code></pre>

<p>But of course, that doesn&#39;t help at all when a random application inside the container calls <code>mknod(2)</code> itself.  In these cases LXD has no way of helping the application by hotplugging the device as it&#39;s unaware that a mknod syscall has been performed.</p>

<p>So the root of the problem seems to be:
– A task inside the container performs a syscall that will fail.
– The syscall would not need to fail since the container manager knows that it is safe.
– The container manager has no way of knowing when such a syscall is performed.
– Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.</p>

<p>So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail.  So somehow we need to be able to interact with the syscalls of another task.</p>

<h4 id="seccomp-the-basics-of-syscall-interception">Seccomp – The Basics of Syscall Interception</h4>

<p>The obvious candidate to look at is seccomp.  Short for “secure computing” it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question.  But seccomp allows even more advanced configurations through so-called “filters”.  Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel.  For example, a task could use a seccomp filter to only allow the <code>mount()</code> syscall and only those mount syscalls that create bind mounts.  This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs.  Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully.  Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.</p>

<p>At its core seccomp is nothing but a syscall interception mechanism.  One way or another every operating system has something that is at least roughly comparable.  The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths.  So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.</p>

<p>Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.</p>

<p>But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture&#39;s syscall table the kernel will first call into seccomp and run the loaded seccomp filter.</p>

<p>Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task.  If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code.  This way, it is for example possible to have the kernel pretend like it doesn&#39;t know the <code>mount(2)</code> syscall by creating a seccomp filter that reports back <code>ENOSYS</code> whenever the task tries to call <code>mount(2)</code>.</p>

<p>But the way seccomp used to work isn&#39;t very dynamic.  Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter.  So there is no way to make a case-by-case decision which might come in handy in some scenarios.</p>

<p>In addition seccomp itself can&#39;t make a syscall actually succeed other than in the trivial way of reporting back success to the caller.  So seccomp will only allow the kernel to pretend that a syscall succeeded.  So while it is possible to instruct the kernel to return 0 for the <code>mount(2)</code> syscall it cannot actually be instructed to make the <code>mount(2)</code> syscall succeed.  So just making the seccomp filter return 0 for mounting a dedicated <code>ext4</code> disk device to <code>/mnt</code> will still not actually mount it at <code>/mnt</code>; it just pretends to the caller that it did.  Of course that is in itself already a useful property for a bunch of use-cases but it doesn&#39;t really help with the <code>mknod(2)</code> or <code>mount(2)</code> problem outlined above.</p>

<h4 id="extending-seccomp">Extending Seccomp</h4>

<p>So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our <code>mknod(2)</code> and <code>mount(2)</code> problem.  Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task.  What is missing though is a way to bring another task such as the LXD container manager into the picture.  Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.</p>

<p>The answer to these questions is seccomp notify.  This is as good a time as any to bring in some historical context.  The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized.  The first concrete details around seccomp notify were conceived in early 2017 in the LXD team.  The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container&#39;s Microconference in Los Angeles.  The details of this talk are still listed <a href="https://blog.linuxplumbersconf.org/2017/ocw/sessions/4795.html" rel="nofollow">here</a> here and I&#39;m sure Stéphane can still provide the slides we came up with.  I didn&#39;t find a video recording even though I somehow thought we did have one.  If someone is really curious I can try to investigate with the Linux Plumbers committee.  After this talk implementation specifics were discussed in a hallway meeting later that day.  And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team.  The rest is history^wchangelog.</p>

<h4 id="seccomp-notify-syscall-interception-2-0">Seccomp Notify – Syscall Interception 2.0</h4>

<p>In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter.  When a container starts it will usually load a seccomp filter to restrict its attack surface.  That is even done for unprivileged containers even though it is not strictly necessary.</p>

<p>With the addition of seccomp notify a container wishing to have a subset of syscalls handled by another process can set the new <code>SECCOMP_RET_USER_NOTIF</code> flag on its seccomp filter.  This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter.  This file descriptor is a seccomp notify file descriptor.</p>

<p>Of course, the seccomp notify fd is not very useful to the task itself.  First, since it doesn&#39;t make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls.  Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.</p>

<p>But what the task can do with the seccomp notify fd is to hand to another task.  Usually the task that it will hand the seccomp notify fd to will be more privileged than itself.  For a container the most obvious candidate would be the container manager of course.</p>

<p>Since the seccomp notify fd is pollable it is possible to put it into an event loop such as <code>epoll(7)</code>, <code>poll(2)</code>, or <code>select(2)</code> and wait for the file descriptor to become readable, i.e. for the kernel to return <code>EPOLLIN</code> to userspace.  For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements.  This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. <code>mknod(2)</code> or <code>mount(2)</code>.</p>

<p>Put another way, this means the container manager can listen for syscall events for tasks running in the container.  Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.</p>

<p>After the seccomp notify fd indicates that it is readable the container manager can use the new <code>SECCOMP_IOCTL_NOTIF_RECV</code> <code>ioctl()</code> associated with seccomp notify fds to read a <code>struct seccomp_notif</code> message for the syscall.  Currently the data to be read from the seccomp notify fd includes the following pieces.  But please be aware that we are in the process of discussing potentially intrusive changes for future versions:</p>

<pre><code class="language-c">struct seccomp_notif {
	__u64 id;
	__u32 pid;
	__u32 flags;
	struct seccomp_data data;
};
</code></pre>

<p>Let&#39;s look at this in a little more detail.  The <code>pid</code> field is the <code>pid</code> of the task that performed the syscall as seen in the caller&#39;s pid namespace.  To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called <code>mknod(2)</code> as seen in the pid namespace of the container manager.  The <code>id</code> field is a unique identifier for the performed syscall.  This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling.  The <code>flags</code> argument is currently unused and reserved for future extensions.</p>

<p>The <code>struct seccomp_data</code> argument is probably the most interesting one as it contains the really exciting bits and pieces:</p>

<pre><code class="language-c">struct seccomp_data {
	int nr;
	__u32 arch;
	__u64 instruction_pointer;
	__u64 args[6];
};
</code></pre>

<p>The <code>int</code> field is the syscall number which can only be correctly interpreted relative to the <code>arch</code> field.  The <code>arch</code> field is the (audit) architecture for which this syscall was made.  This field is very relevant since compatible architectures (For the <code>x86</code> architectures this encompasses at least <code>x32</code>, <code>i386</code>, and <code>x86_64</code>. The <code>arm</code>, <code>mips</code>, and <code>power</code> architectures also have compatible “sub” architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall <code>foo()</code> might have syscall number 1 on 32 bit and 2 on 64 bit.  So the task reading the seccomp data can&#39;t simply assume that since it itself is running in a 32 bit environment the syscall number must be 1.  Rather, it must check what the audit <code>arch</code> is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit.  Otherwise the container manager might end up emulating <code>mount()</code> when it should be emulating <code>mknod()</code>.).  The <code>instruction_pointer</code> is set to the address of the instruction that performed the syscall. This is of course also architecture specific.  And last the <code>args</code> member are the syscall arguments that the task performed the syscall with.</p>

<p>The <code>args</code> need to be interpreted and treated differently depending on the syscall layout and their type.  If they are non-pointer arguments (<code>unsigned int</code> etc.) they can be copied into a local variable and interpreted right away.  But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall.  In the latter case the memory needs to be read and copied before it can be interpreted.</p>

<p>Let&#39;s look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments.  Say the performed syscall was <code>mount(2)</code>.  In order to interpret the <code>args</code> field correctly we look at the <em>syscall</em> layout of <code>mount()</code>.  (Please note, that I&#39;m stressing that we need to look at the layout of <em>syscall</em> and the only reliable source for this is actually the kernel source code.  The Linux manpages often list the wrapper provided by the system&#39;s libc and these wrapper do not necessarily line-up with the syscall itself (compare the <code>waitid()</code> wrapper and the <code>waitid()</code> syscall or the various <code>clone()</code> syscall layouts).) From the layout of <code>mount(2)</code> we see that <code>args[0]</code> is a pointer argument identifying the source path, <code>args[1]</code> is another pointer argument identifying the target path, <code>args[2]</code> is a pointer argument identifying the filesystem type, <code>args[3]</code> is a non-pointer argument identifying the options, and <code>args[4]</code> is another pointer argument identifying additional mount options.</p>

<p>So if we were to be interested in the source path of this <code>mount(2)</code> syscall we would need to open the <code>/proc/&lt;pid&gt;/mem</code> file of the task that performed this syscall and e.g. use the <code>pread(2)</code> function with <code>args[0]</code> as the offset into the task&#39;s virtual memory and read it into a buffer at least the length of a standard path.  Alternatively, we can use a single syscall like <code>process_vm_readv(2)</code> to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.</p>

<p>A friendly advice: in general it is a good idea for the container manager to read all syscall arguments <em>once</em> into a local buffer and base its decisions on how to proceed on the data in this local buffer.  Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it&#39;s also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to <code>/proc/&lt;pid&gt;/mem</code> and change the contents of e.g. <code>args[0]</code> or any other syscall argument.  Also note, that the container manager should ensure that <code>/proc/&lt;pid&gt;</code> still refers to the same task after opening it by checking the validity of the syscall request via the <code>id</code> field and the associated <code>SECCOMP_IOCTL_NOTIF_ID_VALID</code> <code>ioctl()</code> to exclude the possibility of the task having exited, been reaped and its pid having been recycled.</p>

<p>But let&#39;s assume we have done all that.  Now that the container manager has the task&#39;s syscall arguments available in a local buffer it can interpret the syscall arguments.  While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed.  After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the <code>SECCOMP_IOCTL_NOTIF_SEND</code> <code>ioctl()</code> on the seccomp notify fd to tell the kernel what it should do with the blocked task&#39;s syscall.  The response is given in the form <code>struct seccomp_notif_resp</code>:</p>

<pre><code class="language-c">struct seccomp_notif_resp {
	__u64 id;
	__s64 val;
	__s32 error;
	__u32 flags;
};
</code></pre>

<p>Let&#39;s look at this struct in a little more detail too.  The <code>id</code> field is set to the <code>id</code> of the syscall request to respond to and should correspond to the received <code>id</code> in the <code>struct seccomp_notif</code> that the container manager read via the <code>SECCOMP_IOCTL_NOTIF_RECV</code> <code>ioctl()</code> when the seccomp notify fd became readable.  The <code>val</code> field is the return value of the syscall and is only set if the <code>error</code> field is set to 0.  The <code>error</code> field is the error to return from the syscall and should be set to a negative <code>errno(3)</code> code if the syscall is supposed to fail (For example, to trick the caller into thinking that <code>mount(2)</code> is not supported on this kernel set <code>error</code> to <code>-ENOSYS</code>.).  The <code>flags</code> value can be used to tell the kernel to continue the syscall by setting the <code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag which I added to be able to intercept <code>mount(2)</code> and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments.  More on that in a little bit.</p>

<p>With this machinery in place we are for now ;) done with the kernel bits.</p>

<h4 id="emulating-syscalls-in-userspace">Emulating Syscalls In Userspace</h4>

<p>So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue.  Probably emulate it.  Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a <em>very</em> bad idea.).</p>

<p>Emulating syscalls in userspace is not a very new thing to do.  It has been done for a long time.  For example, libc&#39;s can choose to emulate the <code>execveat(2)</code> syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path.  On a kernel that doesn&#39;t support the <code>execveat(2)</code> syscall the libc can emulate it by calling <code>exec(3)</code> with the path set to <code>/proc/self/fd/&lt;nr&gt;</code>.  The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (<code>fexecve(3)</code> for our example).  Any task using <code>syscall(__NR_execveat, [...])</code> to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn&#39;t know that the task wants to perform the <code>execveat(2)</code> syscall and will not be able to emulate it in case the kernel doesn&#39;t support it.</p>

<p>Seccomp notify doesn&#39;t suffer from this problem since its syscall interception abilities aren&#39;t located in userspace at the library level but directly in the syscall path as we have seen.  This greatly expands the abilities to emulate syscalls.</p>

<p>So now we have all the kernel pieces in place to solve our <code>mknod(2)</code> and <code>mount(2)</code> problem in unprivileged containers.  Instead of simply letting the container fail on such harmless requests as creating the <code>/dev/zero</code> device node we can use seccomp notify to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it.  Similarly, we can intercept <code>mount(2)</code> requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container.  We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem.  We actually support this feature in LXD.  Since fuse is a safe way for unprivileged users to mount filesystems rewriting <code>mount(2)</code> requests is a great way to expose filesystems to containers.</p>

<p>In general, the possibilities of seccomp notify can&#39;t be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC.  As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with seccomp notify seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to seccomp notify.</p>

<h4 id="seccomp-notify-in-action-lxd">Seccomp Notify in action – LXD</h4>

<p>After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses.  Instead of simply exposing the raw seccomp notify fd for the container&#39;s seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients.  Clients here are new containers who the administrator has signed up for syscall supervisions through LXD.  Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.</p>

<p>When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd.  The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials.  LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments.  If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall.  For <code>mknod(2)</code> it will create the device node for the container and for <code>mount(2)</code> it will mount the filesystem for the container.  Either by directly mounting it or by using a specified fuse binary for additional security.</p>

<p>If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container.  The container then parses the message, verifying the credentials and will use the <code>SECCOMP_IOCTL_NOTIF_SEND</code> <code>ioctl()</code> sending a <code>struct seccomp_notif_resp</code> causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded.  Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.</p>

<h4 id="show-me">Show Me!</h4>

<p>Ok, enough talk.  Let&#39;s intercept some syscalls.  The following demo shows how LXD uses the seccomp notify fd to emulate the <code>mknod(2)</code> and <code>mount(2)</code> syscalls for an unprivileged container:</p>

<p><a href="https://asciinema.org/a/285491" rel="nofollow"><img src="https://asciinema.org/a/285491.svg" alt="asciicast"></a></p>

<h4 id="current-work-and-future-directions">Current Work and Future Directions</h4>

<h5 id="seccomp-user-notif-flag-continue"><code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code></h5>

<p>After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly.  We realized we couldn&#39;t intercept the mount syscall.  Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept.  This is caused by seccomp not being able to handle pointer arguments.  They are opaque for seccomp.  So while it is possible to tell seccomp to only intercept <code>mount(2)</code> requests for real filesystems by only intercepting <code>mount(2)</code> syscalls where the <code>MS_BIND</code> flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about <code>mount(2)</code> syscalls for the <code>ext4</code> or <code>btrfs</code> filesystem because the filesystem argument is a pointer.</p>

<p>But this means we will inadvertently intercept syscalls that we didn&#39;t intend to intercept.  That is a generic problem but for some syscalls it&#39;s not really a big deal.  For example, we know that <code>mknod(2)</code> fails for all character and block devices in unprivileged containers.  So as long was we write a seccomp filter that intercepts only character and block device <code>mknod(2)</code> syscalls but no socket or fifo <code>mknod()</code> syscalls we don&#39;t have a problem.  For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report <code>EPERM</code> and since the syscalls would fail anyway there&#39;s no problem.</p>

<p>But <em>any</em> system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace.  But this would of course include all <code>mount(2)</code> syscalls for filesystems that can be mounted in unprivileged containers.  I&#39;ve listed a subset of them above.  It includes at least <code>tmpfs</code>, <code>proc</code>, <code>sysfs</code>, <code>devpts</code>, <code>cgroup</code>, <code>cgroup2</code> and probably a few others I&#39;m forgetting.  That&#39;s not ideal.  We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.</p>

<p>The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall.  To instruct the kernel to continue the syscall the <code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag can be set in <code>struct seccomp_notif_resp</code>&#39;s flag argument when instructing the kernel to unblock the task.</p>

<p>This is of course a very exciting feature and has a few readers probably thinking “Hm, I could implement a dynamic userspace seccomp policy.” to which I want to very loudly respond “No, you can&#39;t!”.  In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace.  I&#39;m now going to mostly quote verbatim from my comment for the extension: The <code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag must be used with extreme caution!  If set by the task supervising the syscalls of another task the syscall will continue.  This is problematic is inherent because of TOCTOU (Time of Check-Time of Use).  An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall.  It should be absolutely clear that this means that seccomp notify <em>cannot</em> be used to implement a security policy on syscalls that read from dereferenced pointers in user space!  It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe.  In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.</p>

<p>Similar precautions should be applied when stacking <code>SECCOMP_RET_USER_NOTIF</code> or <code>SECCOMP_RET_TRACE</code>.  For <code>SECCOMP_RET_USER_NOTIF</code> filters acting on the same syscall, the most recently added filter takes precedence.  This means that the new <code>SECCOMP_RET_USER_NOTIF</code> filter can override any <code>SECCOMP_IOCTL_NOTIF_SEND</code> from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response <code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code>.  Note that <code>SECCOMP_RET_TRACE</code> can equally be overriden by <code>SECCOMP_USER_NOTIF_FLAG_CONTINUE</code>.</p>

<h5 id="retrieving-file-descriptors-pidfd-getfd">Retrieving file descriptors <code>pidfd_getfd()</code></h5>

<p>Another extension that was added by <a href="https://twitter.com/sargun" rel="nofollow">Sargun Dhillon</a> recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task.  This works even without seccomp notify since it is a new syscall but is of course especially useful in conjunction with it.</p>

<p>Often we would like to intercept syscalls such as <code>connect(2)</code>.  For example, the container manager might want to rewrite the <code>connect(2)</code> request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint.  In these cases <code>pidfd_getfd(2)</code> can be used to retrieve a copy of the file descriptor of the task and perform the <code>connect(2)</code> for it.  This unblocks another wide range of use-cases.</p>

<p>For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run <code>getsockopt(2)</code> on the file descriptor, and you can use options like <code>TCP_INFO</code> to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like <code>TCP_NODELAY</code>, to tune the socket without requiring the user&#39;s intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where <code>connect(2)</code> calls are intercepted, and the destination is changed to a real server instead.</p>

<p>Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.</p>

<div>
    <a href="https://plotly.com/~sargun/63/?share_key=TBxaZob2h9GiGD9LxVuFuE" target="_blank" title="Plot 63" style="display: block; text-align: center;" rel="nofollow noopener"><img src="https://plotly.com/~sargun/63.png?share_key=TBxaZob2h9GiGD9LxVuFuE" alt="Plot 63" style="max-width: 100%;width: 600px;" width="600"/></a>
</div>

<h5 id="injecting-file-descriptors-seccomp-notify-ioctl-addfd">Injecting file descriptors <code>SECCOMP_NOTIFY_IOCTL_ADDFD</code></h5>

<p>Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task.  As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as <code>open(2)</code> that cause new file descriptors to be installed in the task performing the syscall.</p>

<p>The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on.  This way it is possible to intercept syscalls such as <code>open(2)</code> or <code>accept(2)</code>, and install (or replace, like <code>dup2(2)</code>) the container manager&#39;s resulting fd in the target task.</p>

<p>This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to <code>perf_event_open(2)</code>, and <code>bpf(2)</code> for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.</p>

<p>Last, I want to note that <a href="https://twitter.com/sargun" rel="nofollow">Sargun Dhillon</a> was kind enough to contribute paragraphs to the <code>pidfd_getfd(2)</code> and <code>SECCOMP_NOTIFY_IOCTL_ADDFD</code> sections. He also provided the graphic in the <code>pidfd_getfd(2)</code> sections to illustrate the performance benefits of this solution.</p>

<p>Christian</p>
]]></content:encoded>
      <author>Christian Brauner</author>
      <guid>https://people.kernel.org/read/a/p4s7vjufjs</guid>
      <pubDate>Thu, 23 Jul 2020 13:26:33 +0000</pubDate>
    </item>
    <item>
      <title>The CPU Cost of Networking on a Host</title>
      <link>https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host</link>
      <description>&lt;![CDATA[When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.&#xA;&#xA;This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.&#xA;&#xA;Packet Processing in Linux&#xA;Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, &#39;top&#39; can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload. &#xA;&#xA;net\rx\action, the handler for network Rx traffic, usually runs really fast - like under 25 usecs[1] -  dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see \\do\softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).&#xA;&#xA;One way to see the above description is using perf:&#xA;sudo perf record -a \&#xA;        -e irq:irqhandlerentry,irq:irqhandlerexit \&#xA;        -e irq:softirqentry --filter=&#34;vec == 3&#34; \&#xA;        -e irq:softirqexit --filter=&#34;vec == 3&#34;  \&#xA;        -e napi:napipoll \&#xA;        -- sleep 1&#xA;&#xA;sudo perf script&#xA;The output is something like:&#xA;swapper     0 [005] 176146.491879: irq:irqhandlerentry: irq=152 name=mlx5comp2@pci:0000:d8:00.0&#xA;swapper     0 [005] 176146.491880:  irq:irqhandlerexit: irq=152 ret=handled&#xA;swapper     0 [005] 176146.491880:     irq:softirqentry: vec=3 [action=NETRX]&#xA;swapper     0 [005] 176146.491942:        napi:napipoll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64&#xA;swapper     0 [005] 176146.491943:      irq:softirqexit: vec=3 [action=NETRX]&#xA;swapper     0 [005] 176146.491943:     irq:softirqentry: vec=3 [action=NETRX]&#xA;swapper     0 [005] 176146.491971:        napi:napipoll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64&#xA;swapper     0 [005] 176146.491971:      irq:softirqexit: vec=3 [action=NETRX]&#xA;swapper     0 [005] 176146.492200: irq:irqhandlerentry: irq=152 name=mlx5comp2@pci:0000:d8:00.0&#xA;In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.&#xA;&#xA;The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?&#xA;&#xA;With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000&#39;s to 10,000&#39;s of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.&#xA;&#xA;Let&#39;s take a look at one way to see this impact.&#xA;&#xA;Undo the Distributed Processing&#xA;First, let&#39;s undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.&#xA;&#xA;RPS is disabled on the 2 nics using&#xA;for d in eth0 eth1; do&#xA;    find /sys/class/net/${d}/queues -name rpscpus |&#xA;    while read f; do&#xA;            echo 0 | sudo tee ${f}&#xA;    done&#xA;done&#xA;Next, add flow rules to push packets for the VM under test to a single CPU&#xA;DMAC=12:34:de:ad:ca:fe&#xA;sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2&#xA;sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2&#xA;Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.&#xA;&#xA;openssl speed&#xA;I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as &#39;openssl speed&#39; and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting). &#xA;&#xA;&#39;openssl speed&#39; is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:&#xA;alarm(10)                               = 0&#xA;times({tmsutime=0, tmsstime=0, tmscutime=0, tmscstime=0}) = 1726601344&#xA;--- SIGALRM {sisigno=SIGALRM, sicode=SIKERNEL} ---&#xA;rtsigaction(SIGALRM, ...) = 0&#xA;rtsigreturn({mask=[]}) = 2782545353&#xA;times({tmsutime=1000, tmsstime=0, tmscutime=0, tmscstime=0}) = 1726602344&#xA;so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tmsutime will match the test time (10 seconds in this case).&#xA;&#xA;Since it is is a pure userspace benchmark ANY system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:&#xA;alarm(10)                               = 0&#xA;times({tmsutime=0, tmsstime=0, tmscutime=0, tmscstime=0}) = 1726617896&#xA;--- SIGALRM {sisigno=SIGALRM, sicode=SIKERNEL} ---&#xA;rtsigaction(SIGALRM, ...) = 0&#xA;rtsigreturn({mask=[]}) = 4079301579&#xA;times({tmsutime=178, tmsstime=571, tmscutime=0, tmscstime=0}) = 1726618896&#xA;shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead - time stolen from this process for &#34;system needs.&#34;&#xA;&#xA;Using openssl to Infer Networking Overhead&#xA;With an understanding of how &#39;openssl speed&#39; works, let&#39;s look at a near idle server:&#xA;$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc   /dev/null&#xA;Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc&#39;s in 9.99s&#xA;Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc&#39;s in 10.00s&#xA;Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc&#39;s in 10.00s&#xA;Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc&#39;s in 10.00s&#xA;Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc&#39;s in 10.00s&#xA;Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc&#39;s in 10.00s&#xA;so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let&#39;s add network load, netperf TCPSTREAM from 2 sources, and re-do the test:&#xA;$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc   /dev/null&#xA;Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc&#39;s in 1.96s&#xA;Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc&#39;s in 2.10s&#xA;Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc&#39;s in 2.01s&#xA;Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc&#39;s in 1.86s&#xA;Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc&#39;s in 1.78s&#xA;Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc&#39;s in 1.89s&#xA;Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets - either in the context of openssl or via ksoftirqd.&#xA;&#xA;Looking at top for the previous openssl run:&#xA; PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P &#xA; 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75&#xA; 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89&#xA; 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    &#xA;   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5&#xA;one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.&#xA;&#xA;If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:&#xA;  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P&#xA; 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29&#xA;44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5&#xA; 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55&#xA;   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5&#xA;but openssl reports ~4 seconds of userspace time:&#xA;Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc&#39;s in 4.01s&#xA;Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc&#39;s in 4.14s&#xA;Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc&#39;s in 4.31s&#xA;Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc&#39;s in 4.28s&#xA;Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc&#39;s in 4.46s&#xA;Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc&#39;s in 4.16s&#xA;Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.&#xA;&#xA;In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything - virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.&#xA;&#xA;Up Next&#xA;This post is the basis for a follow up post where I will discuss a comparison of the overhead of &#34;full stack&#34; on a VM host to &#34;XDP&#34;.&#xA;&#xA;[1] Measured using ebpf program on entry and exit. See net\rx\action in https://github.com/dsahern/bpf-progs&#xA;&#xA;[2] Assuming no bugs in the networking stack and driver. I have examined systems where net\rx\_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).&#xA;&#xA;[3] https://github.com/isc-projects/ethq]]&gt;</description>
      <content:encoded><![CDATA[<p>When evaluating networking for a host the focus is typically on latency, throughput or packets per second (pps) to see the maximum load a system can handle for a given configuration. While those are important and often telling metrics, results for such benchmarks do not tell you the impact processing those packets has on the workloads running on that system.</p>

<p>This post looks at the cost of networking in terms of CPU cycles stolen from processes running in a host.</p>

<h2 id="packet-processing-in-linux">Packet Processing in Linux</h2>

<p>Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the irq is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, &#39;top&#39; can show a process appears to be using 99+% cpu but in reality 60% of that time is spent processing packets meaning the process is really only get 40% of the CPU to make progress on its workload.</p>

<p>net_rx_action, the handler for network Rx traffic, <em>usually</em> runs really fast – like under 25 usecs[1] –  dealing with up to 64 packets per napi instance (NIC and RPS) at a time before deferring to another softirq cycle. softirq cycles can be back to back, up to 10 times or 2 msec (see __do_softirq), before taking a break. If the softirq vector still has more work to do after the maximum number of loops or time is reached, it defers further work to the ksoftirqd thread for that CPU. When that happens the system is a bit more transparent about the networking overhead in the sense that CPU usage can be monitored (though with the assumption that it is packet handling versus other softirqs).</p>

<p>One way to see the above description is using perf:</p>

<pre><code>sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit \
        -e irq:softirq_entry --filter=&#34;vec == 3&#34; \
        -e irq:softirq_exit --filter=&#34;vec == 3&#34;  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script
</code></pre>

<p>The output is something like:</p>

<pre><code>swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
</code></pre>

<p>In this case the cpu is idle (hence swapper for the process), an irq fired for an Rx queue on CPU 5, softirq processing looped twice handling 64 packets and then 27 packets before exiting with the next irq firing 229 usec later and starting the loop again.</p>

<p>The above was recorded on an idle system. In general, any task can be running on the CPU in which case the above series of events plays out by interrupting that task, doing the irq/softirq dance and with system accounting attributing cycles to the interrupted process. Thus, processing packets is typically hidden from the usual CPU monitoring as it is done in the context of some random, victim process, so how do you view or quantify the time a process is interrupted handling packets? And how can you compare 2 different networking solutions to see which one is less disruptive to a workload?</p>

<p>With RSS, RPS, and flow steering, packet processing is usually distributed across cores, so the packet processing sequence describe above is all per-CPU. As packet rates increase (think 100,000 pps and up) the load means 1000&#39;s to 10,000&#39;s of packets are processed per second per cpu. Processing that many packets will inevitably have an impact on the workloads running on those systems.</p>

<p>Let&#39;s take a look at one way to see this impact.</p>

<h2 id="undo-the-distributed-processing">Undo the Distributed Processing</h2>

<p>First, let&#39;s undo the distributed processing by disabling RPS and installing flow rules to force the processing of all packets for a specific MAC address on a single, known CPU. My system has 2 nics enslaved to a bond in an 802.3ad configuration with the networking load targeted at a single virtual machine running in the host.</p>

<p>RPS is disabled on the 2 nics using</p>

<pre><code>for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done
</code></pre>

<p>Next, add flow rules to push packets for the VM under test to a single CPU</p>

<pre><code>DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2
</code></pre>

<p>Together, lack of RPS + flow rules ensure all packets destined to the VM are processed on the same CPU. You can use a command like ethq[3] to verify packets are directed to the expected queue and then map that queue to a CPU using /proc/interrupts. In my case queue 2 is handled on CPU 5.</p>

<h2 id="openssl-speed">openssl speed</h2>

<p>I could use perf or a bpf program to track softirq entry and exit for network Rx, but that gets complicated quick, and the observation will definitely influence the results. A much simpler and more intuitive solution is to infer the networking overhead using a well known workload such as &#39;openssl speed&#39; and look at how much CPU access it really gets versus is perceived to get (recognizing the squishiness of process accounting).</p>

<p>&#39;openssl speed&#39; is a nearly 100% userspace command and when pinned to a CPU will use all available cycles for that CPU for the duration of its tests. The command works by setting an alarm for a given interval (e.g., 10 seconds here for easy math), launches into its benchmark and then uses times() when the alarm fires as a way of checking how much CPU time it was actually given. From a syscall perspective it looks like this:</p>

<pre><code>alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344
</code></pre>

<p>so very few system calls between the alarm and checking the results of times(). With no/few interruptions the tms_utime will match the test time (10 seconds in this case).</p>

<p>Since it is is a pure userspace benchmark <em>ANY</em> system time that shows up in times() is overhead. openssl may be the process on the CPU, but the CPU is actually doing something else, like processing packets. For example:</p>

<pre><code>alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896
</code></pre>

<p>shows that openssl was on the cpu for 7.49 seconds (178 + 571 in .01 increments), but 5.71 seconds of that time was in system time. Since openssl is not doing anything in the kernel, that 5.71 seconds is all overhead – time stolen from this process for “system needs.”</p>

<h2 id="using-openssl-to-infer-networking-overhead">Using openssl to Infer Networking Overhead</h2>

<p>With an understanding of how &#39;openssl speed&#39; works, let&#39;s look at a near idle server:</p>

<pre><code>$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc &gt;/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc&#39;s in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc&#39;s in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc&#39;s in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc&#39;s in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc&#39;s in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc&#39;s in 10.00s
</code></pre>

<p>so in this case openssl reports 9.99 to 10.00 seconds of run time for each of the block sizes confirming no contention for the CPU. Let&#39;s add network load, netperf TCP_STREAM from 2 sources, and re-do the test:</p>

<pre><code>$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc &gt;/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc&#39;s in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc&#39;s in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc&#39;s in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc&#39;s in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc&#39;s in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc&#39;s in 1.89s
</code></pre>

<p>Much different outcome. Each block size test wants to run for 10 seconds, but times() is reporting the actual user time to be between 1.78 and 2.10 seconds. Thus, the other 7.9 to 8.22 seconds was spent processing packets – either in the context of openssl or via ksoftirqd.</p>

<p>Looking at top for the previous openssl run:</p>

<pre><code> PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5
</code></pre>

<p>one would think openssl is using ~73% of cpu 5 with ksoftirqd taking the rest but in reality so many packets are getting processed in the context of openssl that it is only effectively getting 18-21% time on the cpu to make progress on its workload.</p>

<p>If I drop the network load to just 1 stream, openssl appears to be running at 99% CPU:</p>

<pre><code>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5
</code></pre>

<p>but openssl reports ~4 seconds of userspace time:</p>

<pre><code>Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc&#39;s in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc&#39;s in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc&#39;s in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc&#39;s in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc&#39;s in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc&#39;s in 4.16s
</code></pre>

<p>Again, monitoring tools show a lot of CPU access, but reality is much different with 55-80% of the CPU spent processing packets. The throughput numbers look great (22+Gbps for a 25G link), but the impact on processes is huge.</p>

<p>In this example, the process robbed of CPU cycles is a silly benchmark. On a fully populated host the interrupted process can be anything – virtual cpus for a VM, emulator threads for the VM, vhost threads for the VM, or host level system processes with varying degrees of impact on performance of those processes and the system.</p>

<h2 id="up-next">Up Next</h2>

<p>This post is the basis for a follow up post where I will discuss a comparison of the overhead of “full stack” on a VM host to “XDP”.</p>

<p>[1] Measured using ebpf program on entry and exit. See net_rx_action in <a href="https://github.com/dsahern/bpf-progs" rel="nofollow">https://github.com/dsahern/bpf-progs</a></p>

<p>[2] Assuming no bugs in the networking stack and driver. I have examined systems where net_rx_action takes well over 20,000 usec to process less than 64 packets due to a combination of bugs in the NIC driver (ARFS path) and OVS (thundering herd wakeup).</p>

<p>[3] <a href="https://github.com/isc-projects/ethq" rel="nofollow">https://github.com/isc-projects/ethq</a></p>
]]></content:encoded>
      <author>David Ahern</author>
      <guid>https://people.kernel.org/read/a/zp238xdjfx</guid>
      <pubDate>Fri, 15 May 2020 04:52:23 +0000</pubDate>
    </item>
    <item>
      <title>GUS (Global Unbounded Sequences)</title>
      <link>https://people.kernel.org/joelfernandes/gus-vs-rcu</link>
      <description>&lt;![CDATA[GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD&#39;s slab implementation) is here: https://www.youtube.com/watch?v=ZXUIFj4nRjk&#xA;&#xA;The best description of GUS is in the FreeBSD code itself. It is based on the concept of global write clock, with readers catching up to writers.&#xA;&#xA;Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can&#39;t see the photon yet, we can only see what we&#39;ve been shown so far, and in a way, if we&#39;ve not seen something because enough &#34;time&#34; has not passed, then we may not reclaim it yet. If we&#39;ve not seen something, we will see it at some point in the future. Till then we need to sit tight.&#xA;&#xA;Roughly, an implementation has 2+N counters (with N CPUs):&#xA;Global write sequence.&#xA;Global read sequence.&#xA;Per-cpu read sequence (read from #1 when a reader starts)&#xA;&#xA;On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free&#39;ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs.  Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronizercu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.&#xA;&#xA;Note the scalability drawbacks of this reclaim scheme:&#xA;&#xA;Expensive poll operation if you have 1000s of CPUs.&#xA; (Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)&#xA;&#xA;Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.&#xA;&#xA;There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.&#xA;&#xA;One advantage of the scheme could be implementation simplicity.&#xA;&#xA;RCU (not SRCU or Userspace RCU) doesn&#39;t suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.]]&gt;</description>
      <content:encoded><![CDATA[<p>GUS is a memory reclaim algorithm used in FreeBSD, similar to RCU. It is borrows concepts from Epoch and Parsec. A video of a presentation describing the integration of GUS with UMA (FreeBSD&#39;s slab implementation) is here: <a href="https://www.youtube.com/watch?v=ZXUIFj4nRjk" rel="nofollow">https://www.youtube.com/watch?v=ZXUIFj4nRjk</a></p>

<p>The best description of GUS is in the FreeBSD code <a href="http://bxr.su/FreeBSD/sys/kern/subr_smr.c#44" rel="nofollow">itself</a>. It is based on the concept of global write clock, with readers catching up to writers.</p>

<p>Effectively, I see GUS as an implementation of light traveling from distant stars. When a photon leaves a star, it is no longer needed by the star and is ready to be reclaimed. However, on earth we can&#39;t see the photon yet, we can only see what we&#39;ve been shown so far, and in a way, if we&#39;ve not seen something because enough “time” has not passed, then we may not reclaim it yet. If we&#39;ve not seen something, we will see it at some point in the future. Till then we need to sit tight.</p>

<p>Roughly, an implementation has 2+N counters (with N CPUs):
1. Global write sequence.
2. Global read sequence.
3. Per-cpu read sequence (read from #1 when a reader starts)</p>

<p>On freeing, the object is tagged with the write sequence. Only once global read sequence has caught up with global write sequence, the object is freed. Until then, the free&#39;ing is deferred. The poll() operation updates #2 by referring to #3 of all CPUs.  Whatever was tagged between the old read sequence and new read sequence can be freed. This is similar to synchronize_rcu() in the Linux kernel which waits for all readers to have finished observing the object being reclaimed.</p>

<p>Note the scalability drawbacks of this reclaim scheme:</p>
<ol><li><p>Expensive poll operation if you have 1000s of CPUs.
(Note: Parsec uses a tree-based mechanism to improve the situation which GUS could consider)</p></li>

<li><p>Heavy-weight memory barriers are needed (SRCU has a similar drawback) to ensure ordering properties of reader sections with respect to poll() operation.</p></li>

<li><p>There can be a delay between reading the global write-sequence number and writing it into the per-cpu read-sequence number. This can cause the per-cpu read-sequence to advance past the global write-sequence. Special handling is needed.</p></li></ol>

<p>One advantage of the scheme could be implementation simplicity.</p>

<p>RCU (not SRCU or Userspace RCU) doesn&#39;t suffer from these drawbacks. Reader-sections in Linux kernel RCU are extremely scalable and lightweight.</p>
]]></content:encoded>
      <author>joelfernandes</author>
      <guid>https://people.kernel.org/read/a/esc20hpt53</guid>
      <pubDate>Fri, 24 Apr 2020 20:57:47 +0000</pubDate>
    </item>
    <item>
      <title>Docker and Management VRF</title>
      <link>https://people.kernel.org/dsahern/docker-and-management-vrf</link>
      <description>&lt;![CDATA[Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.&#xA;&#xA;This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:&#xA;&#xA;Configure mgmt VRF and disable systemd-resolved as noted in a previous post about management vrf and DNS&#xA;&#xA;Install docker-ce&#xA;&#xA;Edit /lib/systemd/system/docker.service and add /usr/sbin/ip vrf exec mgmt to the Exec lines like this:&#xA;ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://&#xA;--containerd=/run/containerd/containerd.sock&#xA;&#xA;Tell systemd about the change and restart docker&#xA;systemctl daemon-reload&#xA;systemctl restart docker&#xA;&#xA;With that, docker pull should work fine - in mgmt vrf or default vrf.]]&gt;</description>
      <content:encoded><![CDATA[<p>Running docker service over management VRF requires the service to be started bound to the VRF. Since docker and systemd do not natively understand VRF, the vrf exec helper in iproute2 can be used.</p>

<p>This series of steps worked for me on Ubuntu 19.10 and should work on 18.04 as well:</p>
<ul><li><p>Configure mgmt VRF and disable systemd-resolved as noted in a previous post <a href="https://people.kernel.org/dsahern/management-vrf-and-dns" rel="nofollow">about management vrf and DNS</a></p></li>

<li><p>Install docker-ce</p></li>

<li><p>Edit /lib/systemd/system/docker.service and add <code>/usr/sbin/ip vrf exec mgmt</code> to the Exec lines like this:</p>

<pre><code>ExecStart=/usr/sbin/ip vrf exec mgmt /usr/bin/dockerd -H fd://
--containerd=/run/containerd/containerd.sock
</code></pre></li>

<li><p>Tell systemd about the change and restart docker</p>

<pre><code>systemctl daemon-reload
systemctl restart docker
</code></pre></li></ul>

<p>With that, <code>docker pull</code> should work fine – in mgmt vrf or default vrf.</p>
]]></content:encoded>
      <author>David Ahern</author>
      <guid>https://people.kernel.org/read/a/metxgfdlv5</guid>
      <pubDate>Fri, 13 Mar 2020 01:47:59 +0000</pubDate>
    </item>
    <item>
      <title>Management VRF and DNS</title>
      <link>https://people.kernel.org/dsahern/management-vrf-and-dns</link>
      <description>&lt;![CDATA[Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).&#xA;&#xA;The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:&#xA;$ modprobe vrf&#xA;modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic&#xA;despite VRF being enabled and built:&#xA;$ grep VRF /boot/config-4.15.0-66-generic&#xA;CONFIGNETVRF=m&#xA;which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel: &#xA;$ sudo apt-get install --install-recommends linux-generic-hwe-18.04&#xA;&#xA;although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.&#xA;&#xA;First, let&#39;s enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:&#xA;    echo &#34;net.ipv4.tcpl3mdevaccept=1&#34;     /etc/sysctl.d/99-sysctl.conf&#xA;    sysctl -p /etc/sysctl.d/99-sysctl.conf&#xA;Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]&#xA;&#xA;Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018:&#xA;    https://bugs.launchpad.net/netplan/+bug/1773522&#xA;&#xA;One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:&#xA;!/bin/bash&#xA;&#xA;ip link show dev mgmt 2  /dev/null&#xA;if [ $? -ne 0 ]&#xA;then&#xA;        # capture default route&#xA;        DEF=$(ip ro ls default)&#xA;&#xA;        # only need to do this once&#xA;        ip link add mgmt type vrf table 1000&#xA;        ip link set mgmt up&#xA;        ip link set eth0 vrf mgmt&#xA;        sleep 1&#xA;&#xA;        # move the default route&#xA;        ip route add vrf mgmt ${DEF}&#xA;        ip route del default&#xA;&#xA;        # fix up rules to look in VRF table first&#xA;        ip ru add pref 32765 from all lookup local&#xA;        ip ru del pref 0&#xA;        ip -6 ru add pref 32765 from all lookup local&#xA;        ip -6 ru del pref 0&#xA;fi&#xA;ip route del default&#xA;The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]&#xA;&#xA;Reboot node to make the changes take effect.&#xA;&#xA;WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.&#xA;&#xA;After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF - and that includes loopback addresses (they are not that special).&#xA;&#xA;Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use &#39;ip vrf exec&#39; as a helper to do the VRF binding. &#39;ip vrf exec&#39; uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device (&#39;mgmt&#39; in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).&#xA;&#xA;Let&#39;s see what happens:&#xA;$ ip vrf exec mgmt apt-get update&#xA;Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease&#xA;  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;&#xA;Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease&#xA;  Temporary failure resolving &#39;security.ubuntu.com&#39;&#xA;Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease&#xA;  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;&#xA;Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease&#xA;  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;&#xA;Theoretically, this should Just Work, but it clearly does not. Why? &#xA;&#xA;Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:&#xA;$ ls -l /etc/resolv.conf&#xA;lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -  ../run/systemd/resolve/stub-resolv.conf&#xA;&#xA;$ cat /etc/resolv.conf&#xA;...&#xA;nameserver 127.0.0.53&#xA;options edns0&#xA;So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.&#xA;&#xA;Where does the theory breakdown? In 3 places.&#xA;&#xA;First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:&#xA;$ ip ro get vrf mgmt 127.0.0.53&#xA;127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0&#xA;    cache&#xA;That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let&#39;s add the loopback addresses to it:&#xA;    $ ip addr add dev mgmt 127.0.0.1/8&#xA;    $ ip addr add dev mgmt ::1/128&#xA;&#xA;    $ ip ro get vrf mgmt 127.0.0.53&#xA;    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0&#xA;        cache&#xA;The second problem is systemd-resolved binds its socket to the loopback device:&#xA;    $ ss -apn | grep systemd-resolve&#xA;    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:     users:((&#34;systemd-resolve&#34;,pid=803,fd=12))&#xA;    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:     users:((&#34;systemd-resolve&#34;,pid=803,fd=13))&#xA;The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device. &#xA;&#xA;The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.&#xA;&#xA;Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF - or VRF at all.&#xA;&#xA;$ rm /etc/resolv.conf&#xA;$ grep nameserver /run/systemd/resolve/resolv.conf   /etc/resolv.conf&#xA;$ systemctl stop systemd-resolved.service&#xA;$ systemctl disable systemd-resolved.service&#xA;&#xA;With that it works as expected:&#xA;$ ip vrf exec mgmt apt-get update&#xA;Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]&#xA;Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]&#xA;Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]&#xA;Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]&#xA;Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]&#xA;&#xA;When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding:&#xA;$ ip vrf exec mgmt su - dsahern&#xA;&#xA;Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].&#xA;&#xA;Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:&#xA;NS=$(ip netns identify)&#xA;[ -n &#34;$NS&#34; ] &amp;&amp; NS=&#34;:${NS}&#34;&#xA;&#xA;VRF=$(ip vrf identify)&#xA;[ -n &#34;$VRF&#34; ] &amp;&amp; VRF=&#34;:${VRF}&#34;&#xA;And then adding &#39;${NS}${VRF}&#39; after the host in PS1:&#xA;PS1=&#39;${debianchroot:+($debianchroot)}\u@\h${NS}${VRF}:\w\$ &#39;&#xA;For example, now the prompt becomes:&#xA;dsahern@myhost:mgmt:~$&#xA;&#xA;References:&#xA;&#xA;[1] VRF tutorial, Open Source Summit, North America, Sept 2017&#xA;http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf&#xA;&#xA;[2] VRF helpers, e.g., systemd instances for VRF&#xA;https://github.com/CumulusNetworks/vrf&#xA;&#xA;[3] Example using VRF in dhclient-exit-hook&#xA;https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf&#xA;&#xA;[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.]]&gt;</description>
      <content:encoded><![CDATA[<p>Someone recently asked me why apt-get was not working when he enabled management VRF on Ubuntu 18.04. After a few back and forths and a little digging I was reminded of why. The TL;DR is systemd-resolved. This blog post documents how I came to that conclusion and what you need to do to use management VRF with Ubuntu (or any OS using a DNS caching service such as systemd-resolved).</p>

<p>The following example is based on a newly created Ubuntu 18.04 VM. The VM comes up with the 4.15.0-66-generic kernel which is missing the VRF module:</p>

<pre><code>$ modprobe vrf
modprobe: FATAL: Module vrf not found in directory /lib/modules/4.15.0-66-generic
</code></pre>

<p>despite VRF being enabled and built:</p>

<pre><code>$ grep VRF /boot/config-4.15.0-66-generic
CONFIG_NET_VRF=m
</code></pre>

<p>which is really weird.[4] So for this blog post I shifted to the v5.3 HWE kernel:
<code>$ sudo apt-get install --install-recommends linux-generic-hwe-18.04</code></p>

<p>although nothing about the DNS problem is kernel specific. A 4.14 or better kernel with VRF enabled and usable will work.</p>

<p>First, let&#39;s enable Management VRF. All of the following commands need to be run as root. For simplicity getting started, you will want to enable this sysctl to allow sshd to work across VRFs:</p>

<pre><code>    echo &#34;net.ipv4.tcp_l3mdev_accept=1&#34; &gt;&gt; /etc/sysctl.d/99-sysctl.conf
    sysctl -p /etc/sysctl.d/99-sysctl.conf
</code></pre>

<p>Advanced users can leave that disabled and use something like the systemd instances to run sshd in Management VRF only.[1]</p>

<p>Ubuntu has moved to netplan for network configuration, and apparently netplan is still missing VRF support despite requests from multiple users since May 2018:
    <a href="https://bugs.launchpad.net/netplan/+bug/1773522" rel="nofollow">https://bugs.launchpad.net/netplan/+bug/1773522</a></p>

<p>One option to workaround the problem is to put the following in /etc/networkd-dispatcher/routable.d/50-ifup-hooks:</p>

<pre><code>#!/bin/bash

ip link show dev mgmt 2&gt;/dev/null
if [ $? -ne 0 ]
then
        # capture default route
        DEF=$(ip ro ls default)

        # only need to do this once
        ip link add mgmt type vrf table 1000
        ip link set mgmt up
        ip link set eth0 vrf mgmt
        sleep 1

        # move the default route
        ip route add vrf mgmt ${DEF}
        ip route del default

        # fix up rules to look in VRF table first
        ip ru add pref 32765 from all lookup local
        ip ru del pref 0
        ip -6 ru add pref 32765 from all lookup local
        ip -6 ru del pref 0
fi
ip route del default
</code></pre>

<p>The above assumes eth0 is the nic to put into Management VRF, and it has a static IP address. If using DHCP instead of a static route, create or update the dhclient-exit-hook to put the default route in the Management VRF table.[3] Another option is to use ifupdown2 for network management; it has good support for VRF.[1]</p>

<p>Reboot node to make the changes take effect.</p>

<p>WARNING: If you run these commands from an active ssh session, you will lose connectivity since you are shifting the L3 domain of eth0 and that impacts existing logins. You can avoid the reboot by running the above commands from console.</p>

<p>After logging back in to the node with Management VRF enabled, the first thing to remember is that when VRF is enabled network addresses become relative to the VRF – and that includes loopback addresses (they are not that special).</p>

<p>Any command that needs to contact a service over the Management VRF needs to be run in that context. If the command does not have native VRF support, then you can use &#39;ip vrf exec&#39; as a helper to do the VRF binding. &#39;ip vrf exec&#39; uses a small eBPF program to bind all IPv4 and IPv6 sockets opened by the command to the given device (&#39;mgmt&#39; in this case) which causes all routing lookups to go to the table associated with the VRF (table 1000 per the setting above).</p>

<p>Let&#39;s see what happens:</p>

<pre><code>$ ip vrf exec mgmt apt-get update
Err:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease
  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving &#39;security.ubuntu.com&#39;
Err:3 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease
  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;
Err:4 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease
  Temporary failure resolving &#39;mirrors.digitalocean.com&#39;
</code></pre>

<p>Theoretically, this should Just Work, but it clearly does not. Why?</p>

<p>Ubuntu uses systemd-resolved service by default with /etc/resolv.conf configured to send DNS lookups to it:</p>

<pre><code>$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 21 15:48 /etc/resolv.conf -&gt; ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
...
nameserver 127.0.0.53
options edns0
</code></pre>

<p>So when a process (e.g., apt) does a name lookup, the message is sent to 127.0.0.53/53. In theory, systemd-resolved gets the request and attempts to contact the actual nameserver.</p>

<p>Where does the theory breakdown? In 3 places.</p>

<p>First, 127.0.0.53 is not configured for Management VRF, so attempts to reach it fail:</p>

<pre><code>$ ip ro get vrf mgmt 127.0.0.53
127.0.0.53 via 157.245.160.1 dev eth0 table 1000 src 157.245.160.132 uid 0
    cache
</code></pre>

<p>That one is easy enough to fix. The VRF device is meant to be the loopback for a VRF, so let&#39;s add the loopback addresses to it:</p>

<pre><code>    $ ip addr add dev mgmt 127.0.0.1/8
    $ ip addr add dev mgmt ::1/128

    $ ip ro get vrf mgmt 127.0.0.53
    127.0.0.53 dev mgmt table 1000 src 127.0.0.1 uid 0
        cache
</code></pre>

<p>The second problem is systemd-resolved binds its socket to the loopback device:</p>

<pre><code>    $ ss -apn | grep systemd-resolve
    udp  UNCONN   0      0       127.0.0.53%lo:53     0.0.0.0:*     users:((&#34;systemd-resolve&#34;,pid=803,fd=12))
    tcp  LISTEN   0      128     127.0.0.53%lo:53     0.0.0.0:*     users:((&#34;systemd-resolve&#34;,pid=803,fd=13))
</code></pre>

<p>The loopback device is in the default VRF and can not be moved to Management VRF. A process bound to the Management VRF can not communicate with a socket bound to the loopback device.</p>

<p>The third issue is that systemd-resolved runs in the default VRF, so its attempts to reach the real DNS server happen over the default VRF. Those attempts fail since the servers are only reachable from the Management VRF and systemd-resolved has no knowledge of it.</p>

<p>Since systemd-resolved is hardcoded (from a quick look at the source) to bind to the loopback device, there is no option but to disable it. It is not compatible with Management VRF – or VRF at all.</p>

<pre><code>$ rm /etc/resolv.conf
$ grep nameserver /run/systemd/resolve/resolv.conf &gt; /etc/resolv.conf
$ systemctl stop systemd-resolved.service
$ systemctl disable systemd-resolved.service

With that it works as expected:
$ ip vrf exec mgmt apt-get update
Get:1 http://mirrors.digitalocean.com/ubuntu bionic InRelease [242 kB]
Get:2 http://mirrors.digitalocean.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://mirrors.digitalocean.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://mirrors.digitalocean.com/ubuntu bionic-updates/universe amd64 Packages [1054 kB]
</code></pre>

<p>When using Management VRF, it is convenient (ie., less typing) to bind the shell to the VRF and let all commands run by it inherit the VRF binding:
$ ip vrf exec mgmt su – dsahern</p>

<p>Now all commands run will automatically use Management VRF. This can be done at login using libpamscript[2].</p>

<p>Personally, I like a reminder about the network bindings in my bash prompt. I do that by adding the following to my .bashrc:</p>

<pre><code>NS=$(ip netns identify)
[ -n &#34;$NS&#34; ] &amp;&amp; NS=&#34;:${NS}&#34;

VRF=$(ip vrf identify)
[ -n &#34;$VRF&#34; ] &amp;&amp; VRF=&#34;:${VRF}&#34;
</code></pre>

<p>And then adding &#39;${NS}${VRF}&#39; after the host in PS1:</p>

<pre><code>PS1=&#39;${debian_chroot:+($debian_chroot)}\u@\h${NS}${VRF}:\w\$ &#39;
</code></pre>

<p>For example, now the prompt becomes:
dsahern@myhost:mgmt:~$</p>

<p>References:</p>

<p>[1] VRF tutorial, Open Source Summit, North America, Sept 2017
<a href="http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf" rel="nofollow">http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf</a></p>

<p>[2] VRF helpers, e.g., systemd instances for VRF
<a href="https://github.com/CumulusNetworks/vrf" rel="nofollow">https://github.com/CumulusNetworks/vrf</a></p>

<p>[3] Example using VRF in dhclient-exit-hook
<a href="https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf" rel="nofollow">https://github.com/CumulusNetworks/vrf/blob/master/etc/dhcp/dhclient-exit-hooks.d/vrf</a></p>

<p>[4] Vincent Bernat informed me that some modules were moved to non-standard package; installing linux-modules-extra-$(uname -r)-generic provides the vrf module. Thanks Vincent.</p>
]]></content:encoded>
      <author>David Ahern</author>
      <guid>https://people.kernel.org/read/a/4ntwxbubz5</guid>
      <pubDate>Fri, 13 Mar 2020 01:12:22 +0000</pubDate>
    </item>
    <item>
      <title>SRCU state double scan</title>
      <link>https://people.kernel.org/joelfernandes/srcu-state-machine-and-double-scan</link>
      <description>&lt;![CDATA[The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcustruct).&#xA;&#xA;There&#39;s are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcuidx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part&#39;s lock counter during locking and likewise for unlock.&#xA;&#xA;During an update, the updater flips srcuidx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcuidx to match.&#xA;Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.&#xA;&#xA;Things are not that simple, however. It is possible that a reader samples the srcuidx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcuidx = 0 and srcuidx = 1.&#xA;&#xA;To prevent such a situation, a writer has to wait for readers corresponding to both srcuidx = 0 and srcuidx = 1 to complete. This depicted with &#39;A MUST&#39; in the below pseudo-code:&#xA;        reader 1        writer                        reader 2&#xA;        -------------------------------------------------------&#xA;        // readlock&#xA;        // enter&#xA;        Read: idx = 0;&#xA;        long delay    // writelock&#xA;                        // enter&#xA;                        waitfor lock[1]==unlock[1]&#xA;                        idx = 1; / flip /&#xA;                        waitfor lock[0]==unlock[0]&#xA;                        done.&#xA;                                                      Read: idx = 1;&#xA;        lock[0]++;&#xA;                                                      lock[1]++;&#xA;                        // writelock&#xA;                        // return&#xA;        // readlock&#xA;        // return&#xA;        /* NOW BOTH lock[0] and lock[1] are non-zero!! /&#xA;                        // writelock&#xA;                        // enter&#xA;                        waitfor lock[0]==unlock[0] &lt;- A MUST!&#xA;                        idx = 0; / flip */&#xA;                        waitfor lock[1]==unlock[1] &lt;- A MUST!&#xA;NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its &#39;srcuidx&#39; equivalent.&#xA;&#xA;Q: If you have to wait for readers of both srcuidx = 0, and 1, then why not just have a single counter and do away with the &#34;flipping&#34; logic?&#xA;Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the &#39;flip&#39; logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a &#39;flipped&#39; part would only monotonically decrease since new readers go to its counterpart.]]&gt;</description>
      <content:encoded><![CDATA[<p>The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (<code>srcu_struct</code>).</p>

<p>There&#39;s are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample <code>srcu_idx</code> and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part&#39;s lock counter during locking and likewise for unlock.</p>

<p>During an update, the updater flips <code>srcu_idx</code> (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of <code>srcu_idx</code> to match.
Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.</p>

<p>Things are not that simple, however. It is possible that a reader samples the <code>srcu_idx</code>, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both <code>srcu_idx = 0</code> and <code>srcu_idx = 1</code>.</p>

<p>To prevent such a situation, a writer has to wait for readers corresponding to both <code>srcu_idx = 0</code> and <code>srcu_idx = 1</code> to complete. This depicted with &#39;A MUST&#39; in the below pseudo-code:</p>

<pre><code>        reader 1        writer                        reader 2
        -------------------------------------------------------
        // read_lock
        // enter
        Read: idx = 0;
        &lt;long delay&gt;    // write_lock
                        // enter
                        wait_for lock[1]==unlock[1]
                        idx = 1; /* flip */
                        wait_for lock[0]==unlock[0]
                        done.
                                                      Read: idx = 1;
        lock[0]++;
                                                      lock[1]++;
                        // write_lock
                        // return
        // read_lock
        // return
        /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                        // write_lock
                        // enter
                        wait_for lock[0]==unlock[0] &lt;- A MUST!
                        idx = 0; /* flip */
                        wait_for lock[1]==unlock[1] &lt;- A MUST!
</code></pre>

<p>NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its &#39;srcu_idx&#39; equivalent.</p>

<p>Q: If you have to wait for readers of both <code>srcu_idx = 0</code>, and <code>1</code>, then why not just have a single counter and do away with the “flipping” logic?
Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the &#39;flip&#39; logic, we are able to drain pre-existing readers using the inactive part of <code>srcu_idx</code> to be drained in a bounded time. The number of readers of a &#39;flipped&#39; part would only monotonically decrease since new readers go to its counterpart.</p>
]]></content:encoded>
      <author>joelfernandes</author>
      <guid>https://people.kernel.org/read/a/cq35nijdqh</guid>
      <pubDate>Fri, 06 Mar 2020 07:11:17 +0000</pubDate>
    </item>
    <item>
      <title>Modeling (lack of) store ordering using PlusCal - and a wishlist</title>
      <link>https://people.kernel.org/joelfernandes/modeling-store-ordering-or-the-lack-thereof-using-pluscal</link>
      <description>&lt;![CDATA[The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in buf and then signals to consumers like P1 that the message is available -- by doing a store to flag. P1 reads flag and if it is set, knows that some data is available in buf and goes ahead and reads it. However, if flag is not set, then P1 does nothing else. Without memory barriers between P0&#39;s stores and P1&#39;s loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition r1 == 0 and r2 == 1 is a failure in the below code and would violate the condition. Only after the flag variable is updated, should P1 be allowed to read the buf (&#34;message&#34;).&#xA;&#xA;        int buf = 0, flag = 0;&#xA;&#xA;        P0()&#xA;        {&#xA;                WRITEONCE(buf, 1);&#xA;                WRITEONCE(flag, 1);&#xA;        }&#xA;&#xA;        P1()&#xA;        {&#xA;                int r1;&#xA;                int r2 = 0;&#xA;&#xA;                r1 = READONCE(flag);&#xA;                if (r1)&#xA;                        r2 = READONCE(buf);&#xA;        }&#xA;&#xA;Below is a simple program in PlusCal to model the &#34;Message passing&#34; access pattern and check whether the failure scenario r1 == 0 and r2 == 1 could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to buf and flag using an either or block. This makes PlusCal evaluate both scenarios of stores (store to buf first and then flag, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an &#34;if block&#34; (Refer to Paul McKenney&#39;s perfbook for details on that).&#xA;&#xA;EXTENDS Integers, TLC&#xA;(--algorithm mppattern&#xA;variables&#xA;    buf = 0,&#xA;    flag = 0;&#xA;&#xA;process Writer = 1&#xA;variables&#xA;    begin&#xA;e0:&#xA;       either&#xA;e1:        buf := 1;&#xA;e2:        flag := 1;&#xA;        or&#xA;e3:        flag := 1;&#xA;e4:        buf := 1;&#xA;        end either;&#xA;end process;&#xA;&#xA;process Reader = 2&#xA;variables&#xA;    r1 = 0,&#xA;    r2 = 0;  &#xA;    begin&#xA;e5:     r1 := flag;&#xA;e6:     if r1 = 1 then&#xA;e7:         r2 := buf;&#xA;        end if;&#xA;e8:     assert r1 = 0 \/ r2 = 1;&#xA;end process;&#xA;&#xA;end algorithm;)&#xA;&#xA;Sure enough, the assert r1 = 0 \/ r2 = 1;  fires when the PlusCal program is run through the TLC model checker.&#xA;&#xA;I do find the either or block clunky, and wish I could just do something like:&#xA;nondeterministic {&#xA;        buf := 1;&#xA;        flag := 1;&#xA;}&#xA;And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.&#xA;&#xA;Other notes:&#xA;&#xA;PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal&#39;s syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don&#39;t want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an &#34;atomic&#34; entity. A group of statements under a label are all atomic. However, if you don&#39;t specific labels on every statement like I did above (eX), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit atomic { } blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.&#xA;&#xA;I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.]]&gt;</description>
      <content:encoded><![CDATA[<p>The Message Passing pattern (MP pattern) is shown in the snippet below (borrowed from LKMM docs). Here, P0 and P1 are 2 CPUs executing some code. P0 stores a message in <code>buf</code> and then signals to consumers like P1 that the message is available — by doing a store to <code>flag</code>. P1 reads <code>flag</code> and if it is set, knows that some data is available in <code>buf</code> and goes ahead and reads it. However, if <code>flag</code> is not set, then P1 does nothing else. Without memory barriers between P0&#39;s stores and P1&#39;s loads, the stores can appear out of order to P1 (on some systems), thus breaking the pattern. The condition <code>r1 == 0 and r2 == 1</code> is a failure in the below code and would violate the condition. Only after the <code>flag</code> variable is updated, should P1 be allowed to read the <code>buf</code> (“message”).</p>

<pre><code>        int buf = 0, flag = 0;

        P0()
        {
                WRITE_ONCE(buf, 1);
                WRITE_ONCE(flag, 1);
        }

        P1()
        {
                int r1;
                int r2 = 0;

                r1 = READ_ONCE(flag);
                if (r1)
                        r2 = READ_ONCE(buf);
        }
</code></pre>

<p>Below is a simple program in PlusCal to model the “Message passing” access pattern and check whether the failure scenario <code>r1 == 0 and r2 == 1</code> could ever occur. In PlusCal, we can model the non deterministic out-of-order stores to <code>buf</code> and <code>flag</code> using an <code>either or</code> block. This makes PlusCal evaluate both scenarios of stores (store to <code>buf</code> first and then <code>flag</code>, or viceversa) during model checking. The technique used for modeling this non-determinism is similar to how it is done in Promela/Spin using an “if block” (Refer to Paul McKenney&#39;s perfbook for details on that).</p>

<pre><code>EXTENDS Integers, TLC
(*--algorithm mp_pattern
variables
    buf = 0,
    flag = 0;

process Writer = 1
variables
    begin
e0:
       either
e1:        buf := 1;
e2:        flag := 1;
        or
e3:        flag := 1;
e4:        buf := 1;
        end either;
end process;

process Reader = 2
variables
    r1 = 0,
    r2 = 0;  
    begin
e5:     r1 := flag;
e6:     if r1 = 1 then
e7:         r2 := buf;
        end if;
e8:     assert r1 = 0 \/ r2 = 1;
end process;

end algorithm;*)
</code></pre>

<p>Sure enough, the <code>assert r1 = 0 \/ r2 = 1;</code>  fires when the PlusCal program is run through the TLC model checker.</p>

<p>I do find the <code>either or</code> block clunky, and wish I could just do something like:</p>

<pre><code>non_deterministic {
        buf := 1;
        flag := 1;
}
</code></pre>

<p>And then, PlusCal should evaluate both store orders. In fact, if I wanted more than 2 stores, then it can get crazy pretty quickly without such a construct. I should try to hack the PlusCal sources soon if I get time, to do exactly this. Thankfully it is open source software.</p>

<p>Other notes:</p>
<ul><li><p>PlusCal is a powerful language that translates to TLA+. TLA+ is to PlusCal what assembler is to C. I do find PlusCal&#39;s syntax to be non-intuitive but that could just be because I am new to it. In particular, I hate having to mark statements with labels if I don&#39;t want them to atomically execute with neighboring statements. In PlusCal, a label is used to mark a statement as an “atomic” entity. A group of statements under a label are all atomic. However, if you don&#39;t specific labels on every statement like I did above (<code>eX</code>), then everything goes under a neighboring label. I wish PlusCal had an option, where a programmer could add implict labels to all statements, and then add explicit <code>atomic { }</code> blocks around statements that were indeed atomic. This is similar to how it is done in Promela/Spin.</p></li>

<li><p>I might try to hack up my own compiler to TLA+ if I can find the time to, or better yet modify PlusCal itself to do what I want. Thankfully the code for the PlusCal translator is open source software.</p></li></ul>
]]></content:encoded>
      <author>joelfernandes</author>
      <guid>https://people.kernel.org/read/a/ai18u7jixo</guid>
      <pubDate>Fri, 18 Oct 2019 22:30:43 +0000</pubDate>
    </item>
    <item>
      <title>Now how many USB-C™ to USB-C™ cables are there? (USB4™ Update,  September 12, 2019)</title>
      <link>https://people.kernel.org/bleung/now-how-many-usb-c-to-usb-c-cables-are-there-usb4-update-september-12</link>
      <description>&lt;![CDATA[tl;dr: There are now 8. Thunderbolt 3 cables officially count too. It&#39;s getting hard to manage, but help is on the way.&#xA;&#xA;Edited lightly 09-16-2019: Tables 3-1 and 5-1 from USB Type-C Spec reproduced as tables instead of images. Made an edit to clarify that Thunderbolt 3 passive cables have always been compliant USB-C cables.&#xA;&#xA;If you recall my first cable post, there were 6 kinds of cables with USB-C plugs on both ends. I was also careful to preface that it was true as of USB Type-C™ Specification 1.4 on June 2019.&#xA;&#xA;Last week, the USB-IF officially published the USB Type-C™ Specification Version Revision 2.0, August 29, 2019.&#xA;&#xA;This is a major update to USB-C and contains required amendments to support the new USB4™ Spec.&#xA;&#xA;One of those amendments? Introducing a new data rate, 20Gbps per lane, or 40Gbps total. This is called &#34;USB4 Gen 3&#34; in the new spec. One more data rate means the matrix of cables increases by a row, so we now have 8 C-to-C cable kinds, see Table 3-1:&#xA;&#xA;h3Table 3-1 USB Type-C Standard Cable Assemblies/h3&#xA;table style=&#34;border-collapse:collapse;border-spacing:0;&#34;&#xA;  tr&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Cable Ref/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Plug 1/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Plug 2/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;USB Version/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Cable Length/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Current Rating/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;USB Power Delivery/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;USB Type-C Electronically Marked/th&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC2-3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;USB 2.0/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;≤ 4 m /td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;3 A/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Supported/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;Optional/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC2-5/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;5 A/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;Required/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G1-3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;USB 3.2 Gen1 and USB4 Gen2/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;≤ 2 m/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;3 A/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Supported/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Required/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G1-5/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;5 A/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G2-3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;USB 3.2 Gen2 and USB4 Gen2/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;≤ 1 m/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;3 A/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Supported/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Required/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G2-5/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;5 A/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G3-3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;C/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;USB4 Gen3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;≤ 0.8 m/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;3 A/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Supported/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34; rowspan=&#34;2&#34;Required/td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;CC3G3-5/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;5 A/td&#xA;  /tr&#xA;/table&#xA;&#xA;Listed, with new cables in bold:&#xA;USB 2.0 rated at 3A&#xA;USB 2.0 rated at 5A&#xA;USB 3.2 Gen 1  rated at 3A&#xA;USB 3.2 Gen 1  rated at 5A&#xA;USB 3.2 Gen 2  rated at 3A&#xA;USB 3.2 Gen 2 rated at 5A&#xA;7. USB4 Gen 3 rated at 3A&#xA;8. USB4 Gen 3 rated at 5A&#xA;&#xA;New cables 7 and 8 have the same number of wires as cables 3 through 6, but are built to tolerances such that they can sustain 20Gbps per set of differential pairs, or 40Gbps for the whole cable. This is the maximum data rate in the USB4 Spec.&#xA;&#xA;Also, please notice in the table above that (informative) maximum cable length shrinks as speed increases. Gen 1 cables can be 2M long, while Gen 3 cables can be 0.8m. This is just a practical consequence of physics and signal integrity when it comes to passive cables.&#xA;&#xA;Data Rates &#xA;Data rates require some explanation too, as advancements since USB 3.1 means that the same physical cable is capable of way more when used in a USB4 system.&#xA;&#xA;A USB 3.1 Gen 1 cable built and sold in 2015 would have been advertised to support 5Gbps operation in 2015. Fast forward to 2019 or 2020, that exact same physical cable (Gen 1), will actually allow you to hit 20gbps using USB4. This is due to advancements in the underlying phy on the host and client-side, but also because USB4 uses all 8 SuperSpeed wires simultaneously, while USB 3.1 only used 4 (single lane operation versus dual-lane operation).&#xA;&#xA;The same goes for USB 3.1 Gen 2 cables, which would have been sold as 10gbps cables. They are able to support 20gbps operation in USB4, again, because of dual-lane.&#xA;&#xA;h3Table 5-1 Certified Cables Where USB4-compatible Operation is Expected/h3&#xA;table style=&#34;border-collapse:collapse;border-spacing:0;&#34;&#xA;  tr&#xA;    th style=&#34;padding:10px 5px;border:none;&#34;/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Cable Signaling/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;USB4 Operation/th&#xA;    th style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34;Notes/th&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34; rowspan=&#34;3&#34;USB Type-C Full-Featured Cables (Passive)/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;USB 3.2 Gen1/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;20 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB 3.2 Gen1 (001b) in the USB Signaling field of its Passive Cable VDO response. Note: even though this cable isn’t explicitly tested, certified or logo’ed for USB 3.2 Gen2 operation, USB4 Gen2 operation will generally work./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;USB 3.2 Gen2 (USB4 Gen2)/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;20 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;USB4 Gen3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;40 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Passive Cable VDO response./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34; rowspan=&#34;2&#34;Thunderbolt™ 3 Cables (Passive)/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;TBT3 Gen2/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;20 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB 3.2 Gen1 (001b) or USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;TBT3 Gen3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;40 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;In addition to indicating support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response, this cable will indicate that it supports TBT3 Gen3 in the Discover Mode VDO response./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;font-weight:bold;&#34; rowspan=&#34;2&#34;USB Type-C Full-Featured Cables (Active)/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;USB4 Gen2/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;20 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB4 Gen2 (010b) in the USB Signaling field of its Active Cable VDO response./td&#xA;  /tr&#xA;  tr&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;USB4 Gen3/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;40 Gbps/td&#xA;    td style=&#34;padding:10px 5px;border:1px black solid;&#34;This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Active Cable VDO response./td&#xA;  /tr&#xA;/table&#xA;&#xA;What about Thunderbolt 3 cables? Thunderbolt 3 cables physically look the same as a USB-C to USB-C cable and the passive variants of the cables comply with the existing USB-C spec and are to be regarded as USB-C cables of kinds 3 through 6. In addition to being compliant USB-C cables, Intel needed a way to mark some of their cables as 40Gbps capable, years before USB-IF defined the Gen 3 40gbps data rate level. They did so using extra alternate mode data objects in the Thunderbolt 3 cables&#39; electronic marker, amounting to extra registers that mark the cable as high speed capable.&#xA;&#xA;The good news is that since Intel decided to open up the Thunderbolt 3 spec, the USB-IF was able to completely take in and make Passive 20Gbps and 40Gbps Thunderbolt 3 cables supported by USB4 devices. A passive 40Gbps TBT3 cable you bought in 2016 or 2017 will just work at 40Gbps on a USB4 device in 2020.&#xA;&#xA;How Linux USB PD and USB4 systems can help identify cables for users &#xA;By now, you are likely ever so confused by this mess of cable and data rate possibilities. The fact that I need a matrix and a decoder ring to explain the landscape of USB-C cables is a bad sign.&#xA;&#xA;In the real world, your average user will pick a cable and will simply not be able to determine the capabilities of the cable by looking at it. Even if the cable has the appropriate logo to distinguish them, not every user will understand what the hieroglyphs mean.&#xA;&#xA;Software, however, and Power Delivery may very well help with this. I&#39;ve been looking very closely at the kernel&#39;s USB Type-C Connector Class.&#xA;&#xA;The connector class creates the following structure in sysfs, populating these nodes with important properties queried from the cable, the USB-C port, and the port&#39;s partner:&#xA;&#xA;/sys/class/typec/&#xA;/sys/class/typec/port0 &lt;---------------------------Me&#xA;/sys/class/typec/port0/port0-partner/ &lt;------------My Partner&#xA;/sys/class/typec/port0/port0-cable/ &lt;--------------Our Cable&#xA;/sys/class/typec/port0/port0-cable/port0-plug0 &lt;---Cable SOP&#39;&#xA;/sys/class/typec/port0/port0-cable/port0-plug1 &lt;---Cable SOP&#34;&#xA;You may see where I&#39;m going from here. Once user space is able to see what the cable and its e-marker chip has advertised, an App or Settings panel in the OS could tell the user what the cable is, and hopefully in clear language what the cable can do, even if the cable is unlabeled, or the user doesn&#39;t understand the obscure logos.&#xA;&#xA;Lots of work remains here. The present Type-C Connector class needs to be synced with the latest version of the USB-C and PD spec, but this gives me hope that users will have a tool (any USB-C phone with PD) in their pocket to quickly identify cables.]]&gt;</description>
      <content:encoded><![CDATA[<p><em>tl;dr</em>: There are now 8. Thunderbolt 3 cables officially count too. It&#39;s getting hard to manage, but help is on the way.</p>

<p><em>Edited lightly 09-16-2019: Tables 3-1 and 5-1 from USB Type-C Spec reproduced as tables instead of images. Made an edit to clarify that Thunderbolt 3 passive cables have always been compliant USB-C cables.</em></p>

<p>If you recall my first cable <a href="https://people.kernel.org/bleung/how-many-kinds-of-usb-c-to-usb-c-cables-are-there/" rel="nofollow">post</a>, there were 6 kinds of cables with USB-C plugs on both ends. I was also careful to preface that it was true as of USB Type-C™ Specification 1.4 on June 2019.</p>

<p>Last week, the USB-IF officially published the <a href="https://www.usb.org/document-library/usb-type-cr-cable-and-connector-specification-revision-20-august-2019" rel="nofollow">USB Type-C™ Specification Version Revision 2.0, August 29, 2019.</a></p>

<p>This is a major update to USB-C and contains required amendments to support the new USB4™ Spec.</p>

<p>One of those amendments? Introducing a new data rate, 20Gbps per lane, or 40Gbps total. This is called “USB4 Gen 3” in the new spec. One more data rate means the matrix of cables increases by a row, so we now have <em>8</em> C-to-C cable kinds, see Table 3-1:</p>

<p><h3>Table 3-1 USB Type-C Standard Cable Assemblies</h3>
<table style="border-collapse:collapse;border-spacing:0;">
  <tr>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Cable Ref</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Plug 1</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Plug 2</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">USB Version</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Cable Length</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Current Rating</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">USB Power Delivery</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">USB Type-C Electronically Marked</th>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC2-3</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">USB 2.0</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">≤ 4 m </td>
    <td style="padding:10px 5px;border:1px black solid;">3 A</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Supported</td>
    <td style="padding:10px 5px;border:1px black solid;">Optional</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC2-5</td>
    <td style="padding:10px 5px;border:1px black solid;">5 A</td>
    <td style="padding:10px 5px;border:1px black solid;">Required</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G1-3</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">USB 3.2 Gen1 and USB4 Gen2</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">≤ 2 m</td>
    <td style="padding:10px 5px;border:1px black solid;">3 A</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Supported</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Required</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G1-5</td>
    <td style="padding:10px 5px;border:1px black solid;">5 A</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G2-3</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">USB 3.2 Gen2 and USB4 Gen2</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">≤ 1 m</td>
    <td style="padding:10px 5px;border:1px black solid;">3 A</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Supported</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Required</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G2-5</td>
    <td style="padding:10px 5px;border:1px black solid;">5 A</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G3-3</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">C</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">USB4 Gen3</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">≤ 0.8 m</td>
    <td style="padding:10px 5px;border:1px black solid;">3 A</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Supported</td>
    <td style="padding:10px 5px;border:1px black solid;" rowspan="2">Required</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">CC3G3-5</td>
    <td style="padding:10px 5px;border:1px black solid;">5 A</td>
  </tr>
</table></p>

<p>Listed, with new cables in bold:
1. USB 2.0 rated at 3A
2. USB 2.0 rated at 5A
3. USB 3.2 Gen 1  rated at 3A
4. USB 3.2 Gen 1  rated at 5A
5. USB 3.2 Gen 2  rated at 3A
6. USB 3.2 Gen 2 rated at 5A
<strong>7. USB4 Gen 3 rated at 3A</strong>
<strong>8. USB4 Gen 3 rated at 5A</strong></p>

<p>New cables 7 and 8 have the same number of wires as cables 3 through 6, but are built to tolerances such that they can sustain 20Gbps per set of differential pairs, or 40Gbps for the whole cable. This is the maximum data rate in the USB4 Spec.</p>

<p>Also, please notice in the table above that (informative) maximum cable length shrinks as speed increases. Gen 1 cables can be 2M long, while Gen 3 cables can be 0.8m. This is just a practical consequence of physics and signal integrity when it comes to passive cables.</p>

<h2 id="data-rates">Data Rates</h2>

<p>Data rates require some explanation too, as advancements since USB 3.1 means that the same physical cable is capable of way more when used in a USB4 system.</p>

<p>A USB 3.1 Gen 1 cable built and sold in 2015 would have been advertised to support 5Gbps operation in 2015. Fast forward to 2019 or 2020, that exact same physical cable (Gen 1), will actually allow you to hit 20gbps using USB4. This is due to advancements in the underlying phy on the host and client-side, but also because USB4 uses all 8 SuperSpeed wires simultaneously, while USB 3.1 only used 4 (single lane operation versus dual-lane operation).</p>

<p>The same goes for USB 3.1 Gen 2 cables, which would have been sold as 10gbps cables. They are able to support 20gbps operation in USB4, again, because of dual-lane.</p>

<p><h3>Table 5-1 Certified Cables Where USB4-compatible Operation is Expected</h3>
<table style="border-collapse:collapse;border-spacing:0;">
  <tr>
    <th style="padding:10px 5px;border:none;"></th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Cable Signaling</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">USB4 Operation</th>
    <th style="padding:10px 5px;border:1px black solid;font-weight:bold;">Notes</th>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;font-weight:bold;" rowspan="3">USB Type-C Full-Featured Cables (Passive)</td>
    <td style="padding:10px 5px;border:1px black solid;">USB 3.2 Gen1</td>
    <td style="padding:10px 5px;border:1px black solid;">20 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB 3.2 Gen1 (001b) in the USB Signaling field of its Passive Cable VDO response. Note: even though this cable isn’t explicitly tested, certified or logo’ed for USB 3.2 Gen2 operation, USB4 Gen2 operation will generally work.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">USB 3.2 Gen2 (USB4 Gen2)</td>
    <td style="padding:10px 5px;border:1px black solid;">20 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">USB4 Gen3</td>
    <td style="padding:10px 5px;border:1px black solid;">40 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Passive Cable VDO response.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;font-weight:bold;" rowspan="2">Thunderbolt™ 3 Cables (Passive)</td>
    <td style="padding:10px 5px;border:1px black solid;">TBT3 Gen2</td>
    <td style="padding:10px 5px;border:1px black solid;">20 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB 3.2 Gen1 (001b) or USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">TBT3 Gen3</td>
    <td style="padding:10px 5px;border:1px black solid;">40 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">In addition to indicating support for USB 3.2 Gen2 (010b) in the USB Signaling field of its Passive Cable VDO response, this cable will indicate that it supports TBT3 Gen3 in the Discover Mode VDO response.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;font-weight:bold;" rowspan="2">USB Type-C Full-Featured Cables (Active)</td>
    <td style="padding:10px 5px;border:1px black solid;">USB4 Gen2</td>
    <td style="padding:10px 5px;border:1px black solid;">20 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB4 Gen2 (010b) in the USB Signaling field of its Active Cable VDO response.</td>
  </tr>
  <tr>
    <td style="padding:10px 5px;border:1px black solid;">USB4 Gen3</td>
    <td style="padding:10px 5px;border:1px black solid;">40 Gbps</td>
    <td style="padding:10px 5px;border:1px black solid;">This cable will indicate support for USB4 Gen3 (011b) in the USB Signaling field of its Active Cable VDO response.</td>
  </tr>
</table></p>

<p>What about Thunderbolt 3 cables? Thunderbolt 3 cables physically look the same as a USB-C to USB-C cable and the passive variants of the cables comply with the existing USB-C spec and are to be regarded as USB-C cables of kinds 3 through 6. In addition to being compliant USB-C cables, Intel needed a way to mark some of their cables as 40Gbps capable, years before USB-IF defined the Gen 3 40gbps data rate level. They did so using extra <em>alternate mode</em> data objects in the Thunderbolt 3 cables&#39; electronic marker, amounting to extra registers that mark the cable as high speed capable.</p>

<p>The good news is that since Intel decided to open up the Thunderbolt 3 spec, the USB-IF was able to completely take in and make Passive 20Gbps and 40Gbps Thunderbolt 3 cables supported by USB4 devices. A <em>passive</em> 40Gbps TBT3 cable you bought in 2016 or 2017 will just work at 40Gbps on a USB4 device in 2020.</p>

<h2 id="how-linux-usb-pd-and-usb4-systems-can-help-identify-cables-for-users">How Linux USB PD and USB4 systems can help identify cables for users</h2>

<p>By now, you are likely ever so confused by this mess of cable and data rate possibilities. The fact that I need a matrix and a decoder ring to explain the landscape of USB-C cables is a bad sign.</p>

<p>In the real world, your average user will pick a cable and will simply not be able to determine the capabilities of the cable by looking at it. Even if the cable has the appropriate logo to distinguish them, not every user will understand what the hieroglyphs mean.</p>

<p>Software, however, and Power Delivery may very well help with this. I&#39;ve been looking very closely at the kernel&#39;s <a href="https://www.kernel.org/doc/html/latest/driver-api/usb/typec.html" rel="nofollow">USB Type-C Connector Class</a>.</p>

<p>The connector class creates the following structure in sysfs, populating these nodes with important properties queried from the cable, the USB-C port, and the port&#39;s partner:</p>

<pre><code>/sys/class/typec/
/sys/class/typec/port0 &lt;---------------------------Me
/sys/class/typec/port0/port0-partner/ &lt;------------My Partner
/sys/class/typec/port0/port0-cable/ &lt;--------------Our Cable
/sys/class/typec/port0/port0-cable/port0-plug0 &lt;---Cable SOP&#39;
/sys/class/typec/port0/port0-cable/port0-plug1 &lt;---Cable SOP&#34;
</code></pre>

<p>You may see where I&#39;m going from here. Once user space is able to see what the cable and its e-marker chip has advertised, an App or Settings panel in the OS could tell the user what the cable is, and hopefully in clear language what the cable can do, even if the cable is unlabeled, or the user doesn&#39;t understand the obscure logos.</p>

<p>Lots of work remains here. The present Type-C Connector class needs to be synced with the latest version of the USB-C and PD spec, but this gives me hope that users will have a tool (any USB-C phone with PD) in their pocket to quickly identify cables.</p>
]]></content:encoded>
      <author>Benson Leung</author>
      <guid>https://people.kernel.org/read/a/dddbu2ti0h</guid>
      <pubDate>Fri, 06 Sep 2019 05:11:10 +0000</pubDate>
    </item>
    <item>
      <title>Notes about Netiquette</title>
      <link>https://people.kernel.org/tglx/notes-about-netiquette-qw89</link>
      <description>&lt;![CDATA[E-Mail interaction with the community&#xA;&#xA;You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.&#xA;&#xA;Private mail&#xA;&#xA;Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:&#xA;&#xA; Private e-mail does not scale &#xA;Maintainers and developers have limited time and cannot answer the same questions over and over.&#xA;&#xA; Private e-mail is limiting the audience&#xA;Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again.&#xA;Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list  then you already gave that consent by doing so.&#xA;It&#39;s usually not required to subscribe to a mailing list. Most mailing  lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice. &#xA;&#xA; Private e-mail might be considered deliberate disregard of documentation&#xA;The documentation of the Linux kernel and other Open Source projects   gives clear advice how to contact the community. It&#39;s clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.&#xA;&#xA; Corporate policies are not an excuse for private e-mail&#xA;If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.&#xA;&#xA;Confidentiality disclaimers&#xA;&#xA;When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.&#xA;&#xA;If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.&#xA;&#xA;Reply to all&#xA;&#xA;Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.&#xA;&#xA;HTML e-mail&#xA;&#xA;HTML e-mail - even when it is a multipart mail with a corresponding plain/text section - is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.&#xA;&#xA;Multipart e-mail&#xA;&#xA;Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.&#xA;&#xA;Text mail formatting&#xA;&#xA;Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.&#xA;&#xA;Top-posting&#xA;&#xA;If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:&#xA;&#xA;  A: Because it messes up the order in which people normally read text.&#xA;  Q: Why is top-posting such a bad thing?&#xA;    A: Top-posting.&#xA;  Q: What is the most annoying thing in e-mail?&#xA;    A: No.&#xA;  Q: Should I include quotations after my reply?&#xA;&#xA;See also: http://daringfireball.net/2007/07/on_top&#xA;&#xA;Trim replies&#xA;&#xA;If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It&#39;s an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.&#xA;&#xA;Quoting code&#xA;&#xA;If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.&#xA;&#xA;If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.&#xA;&#xA;Does not work for you?&#xA;&#xA;In case you can&#39;t follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional&#xA;help to solve your problem.&#xA;&#xA;Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="e-mail-interaction-with-the-community">E-Mail interaction with the community</h2>

<p>You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.</p>

<h2 id="private-mail">Private mail</h2>

<p>Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:</p>
<ul><li><p><strong>Private e-mail does not scale</strong>
Maintainers and developers have limited time and cannot answer the same questions over and over.</p></li>

<li><p><strong>Private e-mail is limiting the audience</strong>
Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again.
Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list  then you already gave that consent by doing so.
It&#39;s usually not required to subscribe to a mailing list. Most mailing  lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.</p></li>

<li><p><strong>Private e-mail might be considered deliberate disregard of documentation</strong>
The documentation of the Linux kernel and other Open Source projects   gives clear advice how to contact the community. It&#39;s clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.</p></li>

<li><p><strong>Corporate policies are not an excuse for private e-mail</strong>
If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.</p></li></ul>

<h1 id="confidentiality-disclaimers">Confidentiality disclaimers</h1>

<p>When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.</p>

<p>If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.</p>

<h1 id="reply-to-all">Reply to all</h1>

<p>Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.</p>

<h1 id="html-e-mail">HTML e-mail</h1>

<p>HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.</p>

<h1 id="multipart-e-mail">Multipart e-mail</h1>

<p>Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.</p>

<h1 id="text-mail-formatting">Text mail formatting</h1>

<p>Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.</p>

<h1 id="top-posting">Top-posting</h1>

<p>If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:</p>

<blockquote><p> A: Because it messes up the order in which people normally read text.
 Q: Why is top-posting such a bad thing?</p>

<p> A: Top-posting.
 Q: What is the most annoying thing in e-mail?</p>

<p> A: No.
 Q: Should I include quotations after my reply?</p></blockquote>

<p>See also: <a href="http://daringfireball.net/2007/07/on_top" rel="nofollow">http://daringfireball.net/2007/07/on_top</a></p>

<h1 id="trim-replies">Trim replies</h1>

<p>If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It&#39;s an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.</p>

<h1 id="quoting-code">Quoting code</h1>

<p>If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.</p>

<p>If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.</p>

<h1 id="does-not-work-for-you">Does not work for you?</h1>

<p>In case you can&#39;t follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional
help to solve your problem.</p>

<p>Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.</p>
]]></content:encoded>
      <author>tglx</author>
      <guid>https://people.kernel.org/read/a/eq3ihltyo1</guid>
      <pubDate>Tue, 03 Sep 2019 22:32:14 +0000</pubDate>
    </item>
    <item>
      <title>Notes about Netiquette</title>
      <link>https://people.kernel.org/tglx/notes-about-netiquette</link>
      <description>&lt;![CDATA[E-Mail interaction with the community&#xA;&#xA;You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.&#xA;&#xA;Private mail&#xA;&#xA;Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:&#xA;&#xA; Private e-mail does not scale &#xA;Maintainers and developers have limited time and cannot answer the same questions over and over.&#xA;&#xA; Private e-mail is limiting the audience&#xA;Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again.&#xA;Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list  then you already gave that consent by doing so.&#xA;It&#39;s usually not required to subscribe to a mailing list. Most mailing  lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice. &#xA;&#xA; Private e-mail might be considered deliberate disregard of documentation&#xA;The documentation of the Linux kernel and other Open Source projects   gives clear advice how to contact the community. It&#39;s clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.&#xA;&#xA; Corporate policies are not an excuse for private e-mail&#xA;If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.&#xA;&#xA;Confidentiality disclaimers&#xA;&#xA;When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.&#xA;&#xA;If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.&#xA;&#xA;Reply to all&#xA;&#xA;Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.&#xA;&#xA;HTML e-mail&#xA;&#xA;HTML e-mail - even when it is a multipart mail with a corresponding plain/text section - is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.&#xA;&#xA;Multipart e-mail&#xA;&#xA;Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.&#xA;&#xA;Text mail formatting&#xA;&#xA;Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.&#xA;&#xA;Top-posting&#xA;&#xA;If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:&#xA;&#xA;  A: Because it messes up the order in which people normally read text.&#xA;  Q: Why is top-posting such a bad thing?&#xA;    A: Top-posting.&#xA;  Q: What is the most annoying thing in e-mail?&#xA;    A: No.&#xA;  Q: Should I include quotations after my reply?&#xA;&#xA;See also: http://daringfireball.net/2007/07/on_top&#xA;&#xA;Trim replies&#xA;&#xA;If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It&#39;s an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.&#xA;&#xA;Quoting code&#xA;&#xA;If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.&#xA;&#xA;If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.&#xA;&#xA;Does not work for you?&#xA;&#xA;In case you can&#39;t follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional&#xA;help to solve your problem.&#xA;&#xA;Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="e-mail-interaction-with-the-community">E-Mail interaction with the community</h2>

<p>You might have been referred to this page with a form letter reply. If so the form letter has been sent to you because you sent e-mail in a way which violates one or more of the common rules of email communication in the context of the Linux kernel or some other Open Source project.</p>

<h2 id="private-mail">Private mail</h2>

<p>Help from the community is provided as a free of charge service on a best effort basis. Sending private mail to maintainers or developers is pretty much a guarantee for being ignored or redirected to this page via a form letter:</p>
<ul><li><p><strong>Private e-mail does not scale</strong>
Maintainers and developers have limited time and cannot answer the same questions over and over.</p></li>

<li><p><strong>Private e-mail is limiting the audience</strong>
Mailing lists allow people other than the relevant maintainers or developers to answer your question. Mailing lists are archived so the answer to your question is available for public search and helps to avoid the same question being asked again and again.
Private e-mail is also limiting the ability to include the right experts into a discussion as that would first need your consent to give a person who was not included in your Cc list access to the content of your mail and also to your e-mail address. When you post to a public mailing list  then you already gave that consent by doing so.
It&#39;s usually not required to subscribe to a mailing list. Most mailing  lists are open. Those which are not are explicitly marked so. If you send e-mail to an open list the replies will have you in Cc as this is the general practice.</p></li>

<li><p><strong>Private e-mail might be considered deliberate disregard of documentation</strong>
The documentation of the Linux kernel and other Open Source projects   gives clear advice how to contact the community. It&#39;s clearly spelled out that the relevant mailing lists should always be included. Adding the relevant maintainers or developers to CC is good practice and usually helps to get the attention of the right people especially on high volume mailing lists like LKML.</p></li>

<li><p><strong>Corporate policies are not an excuse for private e-mail</strong>
If your company does not allow you to post on public mailing lists with your work e-mail address, please go and talk to your manager.</p></li></ul>

<h1 id="confidentiality-disclaimers">Confidentiality disclaimers</h1>

<p>When posting to public mailing lists the boilerplate confidentiality disclaimers are not only meaningless, they are absolutely wrong for obvious reasons.</p>

<p>If that disclaimer is automatically inserted by your corporate e-mail infrastructure, talk to your manager, IT department or consider to use a different e-mail address which is not affected by this. Quite some companies have dedicated e-mail infrastructure to avoid this problem.</p>

<h1 id="reply-to-all">Reply to all</h1>

<p>Trimming Cc lists is usually considered a bad practice. Replying only to the sender of an e-mail immediately excludes all other people involved and defeats the purpose of mailing lists by turning a public discussion into a private conversation. See above.</p>

<h1 id="html-e-mail">HTML e-mail</h1>

<p>HTML e-mail – even when it is a multipart mail with a corresponding plain/text section – is unconditionally rejected by mailing lists. The plain/text section of multipart HTML e-mail is generated by e-mail clients and often results in completely unreadable gunk.</p>

<h1 id="multipart-e-mail">Multipart e-mail</h1>

<p>Again, use plain/text e-mail and not some magic format. Also refrain from attaching patches as that makes it impossible to reply to the patch directly. The kernel documentation contains elaborate explanations how to send patches.</p>

<h1 id="text-mail-formatting">Text mail formatting</h1>

<p>Text-based e-mail should not exceed 80 columns per line of text. Consult the documentation of your e-mail client to enable proper line breaks around column 78.</p>

<h1 id="top-posting">Top-posting</h1>

<p>If you reply to an e-mail on a mailing list do not top-post. Top-posting is the preferred style in corporate communications, but that does not make an excuse for it:</p>

<blockquote><p> A: Because it messes up the order in which people normally read text.
 Q: Why is top-posting such a bad thing?</p>

<p> A: Top-posting.
 Q: What is the most annoying thing in e-mail?</p>

<p> A: No.
 Q: Should I include quotations after my reply?</p></blockquote>

<p>See also: <a href="http://daringfireball.net/2007/07/on_top" rel="nofollow">http://daringfireball.net/2007/07/on_top</a></p>

<h1 id="trim-replies">Trim replies</h1>

<p>If you reply to an e-mail on a mailing list trim unneeded content of the e-mail you are replying to. It&#39;s an annoyance to have to scroll down through several pages of quoted text to find a single line of reply or to figure out that after that reply the rest of the e-mail is just useless ballast.</p>

<h1 id="quoting-code">Quoting code</h1>

<p>If you want to refer to code or a particular function then mentioning the file and function name is completely sufficient. Maintainers and developers surely do not need a link to a git-web interface or one of the source cross-reference sites. They are definitely able to find the code in question with their favorite editor.</p>

<p>If you really need to quote code to illustrate your point do not copy that from some random web interface as that turns again into unreadable gunk. Insert the code snippet from the source file and only insert the absolute minimum of lines to make your point. Again people are able to find the context on their own and while your hint might be correct in many cases the issue you are looking into is root caused at a completely different place.</p>

<h1 id="does-not-work-for-you">Does not work for you?</h1>

<p>In case you can&#39;t follow the rules above and the documentation of the Open Source project you want to communicate with, consider to seek professional
help to solve your problem.</p>

<p>Open Source consultants and service providers charge for their services and therefore are willing to deal with HTML e-mail, disclaimers, top-posting and other nuisances of corporate style communications.</p>
]]></content:encoded>
      <author>tglx</author>
      <guid>https://people.kernel.org/read/a/81o5d4pt32</guid>
      <pubDate>Tue, 03 Sep 2019 22:13:49 +0000</pubDate>
    </item>
    <item>
      <title>Slides for Open Source Summit (OSS) North America, San Diego 2019: New Container Kernel Features</title>
      <link>https://people.kernel.org/brauner/slides-for-open-source-summit-oss-north-america-san-diego-2019-new</link>
      <description>&lt;![CDATA[slides]]&gt;</description>
      <content:encoded><![CDATA[<p><a href="https://brauner.github.io/_img/2019_oss_na_new_container_kernel_features.pdf" rel="nofollow">slides</a></p>
]]></content:encoded>
      <author>Christian Brauner</author>
      <guid>https://people.kernel.org/read/a/87ef1jgeox</guid>
      <pubDate>Mon, 26 Aug 2019 15:32:31 +0000</pubDate>
    </item>
    <item>
      <title>kdevops: a devops framework for Linux kernel development</title>
      <link>https://people.kernel.org/mcgrof/kdevops-a-devops-framework-for-linux-kernel-development</link>
      <description>&lt;![CDATA[I&#39;m announcing the release of kdevops which aims at making setting up and testing the Linux kernel for any project as easy as possible. Note that setting up testing for a subsystem and testing a subsystem are two separate operations, however we strive for both. This is not a new test framework, it allows you to use existing frameworks, and set those frameworks up as easily can humanly be possible. It relies on a series of modern hip devops frameworks, it relies on ansible, vagrant and terraform, ansible roles through the Ansible Galaxy, and terraform modules.&#xA;&#xA;Three example demo projects are released which demo it&#39;s use:&#xA;&#xA;  kdevops - skeleton generic example using linux-stable&#xA;  fw-kdevops - used for testing firmware loading using linux-next. This example demo was written in about one hour tops by forking kdevops, trimming it, adding a new ansible galaxy for selftests. You are expected to be able to fork it and add your respective kernel selftest fork in a minute&#xA;  oscheck - actively being used to test and advance the XFS filesystem for stable kernel releases. If you fork this to try to add support for testing a new filesystem under a new project, please let me know how long it took you to do that.&#xA;&#xA;Fancy pictures in a nutshell&#xA;&#xA;Of course you just want pictures and the ability to go home after seeing them. Should these be on instagram as well? Gosh.&#xA;&#xA;A first run of kdevops&#xA;&#xA;On a first run:&#xA;&#xA;figure&#xA;iframe src=&#34;https://drive.google.com/file/d/1fOkYPyFfVvnM1Fz7TGeMuOfrurTuSEuM/preview&#34; width=&#34;640&#34; height=&#34;480&#34;/iframe&#xA;/figure&#xA;&#xA;Running the bootlinux role on just one host&#xA;&#xA;Example run of just running the ansible bootlinux role on just one host:&#xA;&#xA;figure&#xA;iframe src=&#34;https://drive.google.com/file/d/1VzwXNmNN3CSFNjLwqN0lAwQSHFF7Lx4V/preview&#34; width=&#34;640&#34; height=&#34;480&#34;/iframe&#xA;/figure&#xA;&#xA;End of running the bootlinx ansible role on just one host&#xA;&#xA;This shows what it looks like at the end of running the ansible bootlinux role after the host has booted into the new shiny kernel:&#xA;&#xA;figure&#xA;iframe src=&#34;https://drive.google.com/file/d/1RboLSIJwq-4ETMkwtRBWOQxER29yPb4/preview&#34; width=&#34;640&#34; height=&#34;480&#34;/iframe&#xA;/figure&#xA;&#xA;Logging into test test systems&#xA;&#xA;Well, since we set up your ~/ssh/.config for you, all you gotta do now is just ssh in to the target host you want to test, it will already have the shiny new kernel installed and booted into it:&#xA;&#xA;figure&#xA;iframe src=&#34;https://drive.google.com/file/d/1a55tvfyV1GQ3nkSg7NOeS5lcmqrEAJY/preview&#34; width=&#34;640&#34; height=&#34;480&#34;/iframe&#xA;/figure&#xA;&#xA;Motivations for kdevops&#xA;&#xA;Below I&#39;ll document just a bit of the motivation behind this project. The documentation and demo projects should hopefully suffice for how to use all this.&#xA;&#xA;Testing ain&#39;t easy, brah!&#xA;&#xA;Getting contributors to your subsystem / driver in Linux is wonderful, however ensuring it doesn&#39;t break anything is a completely separate matter. It is my belief that testing a patch to ensure no testable regressions exist should be painless, and simple, however that has never been the case.&#xA;&#xA;Testing frameworks ain&#39;t easy to setup, brah!&#xA;&#xA;Linux kernel testing frameworks should also be really easy to set up. But that is typically never the case either. One example case of complexity in setting a test framework is fstests used to tests Linux kernel filesystems, and to ensure to the best of our ability that a new patch doesn&#39;t regress the kernel against a baseline. But wait, what is the baseline?&#xA;&#xA;Setting up test systems ain&#39;t easy to ramp up, brah!&#xA;&#xA;Another difficulty with testing the Linux kernel comes with the fact that you don&#39;t want to test the kernel on same kernel you&#39;re laptop is running on, otherwise you&#39;d crash it, and if you&#39;re testing filesystems you may even end up corrupting your filesystem. So typically folks end up using virtualization technologies to setup virtual machines, boot into them, and then use the virtualized hosts as test vehicles. Another alternative is to use cloud service providers such as OpenStack, Azure, Amazon Web Services, Google Cloud Compute to create hosts on the cloud and use these instead. But I&#39;ve heard complaints about how even setting up KVM can be complex, even from kernel developers! Even some kernel developers don&#39;t want to know how to set up a virtual environment to test things.&#xA;&#xA;I hear ya, brah!&#xA;&#xA;My litmus test for a full set up complexity is all the work required to setup fstests to test a Linux filesystem. If a solution for all the woes above were to ever be provided, I figured it&#39;d have to allow to you easily setup  fstests to test XFS without you doing much work.&#xA;&#xA;I started looking into this effort first by trying to provide my own set of wrappers around KVM to let you easily setup KVM. Then I extended this effort to easily setup fstests. Both efforts were all shell hacks... It worked for me, but I was still not really happy with it all. It seemed hacky.&#xA;&#xA;Ted Ts&#39;o&#39;s xfstests-bld.git provided a cloud environment solution for using setting up fstests on Google Cloud Compute for ext filesystemes (ext2, ext3, ext4), however I was not satisfied with this given I wanted it easy to allow you to test any filesystem, and be Cloud provider agnostic.&#xA;&#xA;ansible provides a proper replacement for shell hacks, in a distribution agnostic manner, and even OS agnostic manner. Vagrant lets me replace all those terrible original bash hacks to setup KVM with simple elegant descriptions of what I want a set of target set of hosts to look like. It also lets me support not only KVM but also Virtualbox, and even support Mac OS X. Terraform accomplishes the same but for cloud environments, and supports different providers.&#xA;&#xA;Feedback and rants welcomed&#xA;&#xA;So, give the repositories a shot, I welcome feedback and rants.&#xA;&#xA;kdevops is intended to be used as the de-facto example for all of the ansible roles, and terraform modules.&#xA;&#xA;fw-kdevops  is intended to be forked by folks wanting a simple two host test setup where all you need is linux-next and to run selftests.&#xA;&#xA;oscheck is already actively used to help advance XFS on the stable kernel releases, and is intended to be forked by folks who want to use  fstests to test any filesystem on any kernel release.]]&gt;</description>
      <content:encoded><![CDATA[<p>I&#39;m announcing the release of <a href="https://github.com/mcgrof/kdevops/" rel="nofollow">kdevops</a> which aims at making setting up <em>and</em> testing the Linux kernel for any project as easy as possible. Note that <em>setting up</em> testing for a subsystem and <em>testing</em> a subsystem are two separate operations, however we strive for both. This is not a new test framework, it allows you to <em>use</em> existing frameworks, and <em>set those frameworks up</em> as easily can humanly be possible. It relies on a series of modern hip devops frameworks, it relies on <a href="https://www.ansible.com/" rel="nofollow">ansible</a>, <a href="https://www.vagrantup.com" rel="nofollow">vagrant</a> and <a href="https://www.terraform.io/" rel="nofollow">terraform</a>, <a href="https://galaxy.ansible.com/" rel="nofollow">ansible roles</a> through the Ansible Galaxy, and <a href="https://registry.terraform.io/" rel="nofollow">terraform modules</a>.</p>

<p>Three example demo projects are released which demo it&#39;s use:</p>
<ul><li><a href="https://github.com/mcgrof/kdevops/" rel="nofollow">kdevops</a> – skeleton generic example using linux-stable</li>
<li><a href="https://github.com/mcgrof/fw-kdevops/" rel="nofollow">fw-kdevops</a> – used for testing firmware loading using linux-next. This example demo was written in about one hour tops by forking kdevops, trimming it, adding a new ansible galaxy for selftests. You are expected to be able to fork it and add your respective kernel selftest fork in a minute</li>
<li><a href="https://github.com/mcgrof/oscheck" rel="nofollow">oscheck</a> – actively being used to test and advance the XFS filesystem for stable kernel releases. If you fork this to try to add support for testing a new filesystem under a new project, please let me know how long it took you to do that.</li></ul>

<h2 id="fancy-pictures-in-a-nutshell">Fancy pictures in a nutshell</h2>

<p>Of course you just want pictures and the ability to go home after seeing them. Should these be on instagram as well? Gosh.</p>

<h3 id="a-first-run-of-kdevops">A first run of kdevops</h3>

<p>On a first run:</p>

<figure>
<iframe src="https://drive.google.com/file/d/1fOkYPyFfVvnM1Fz7TGeMuOfrurTuSEuM/preview" width="640" height="480"></iframe>
</figure>

<h3 id="running-the-bootlinux-role-on-just-one-host">Running the bootlinux role on just one host</h3>

<p>Example run of just running the ansible bootlinux role on just one host:</p>

<figure>
<iframe src="https://drive.google.com/file/d/1VzwXNmNN3CSFNjLwqN0lAwQSHFF7Lx4V/preview" width="640" height="480"></iframe>
</figure>

<h3 id="end-of-running-the-bootlinx-ansible-role-on-just-one-host">End of running the bootlinx ansible role on just one host</h3>

<p>This shows what it looks like at the end of running the ansible bootlinux role after the host has booted into the new shiny kernel:</p>

<figure>
<iframe src="https://drive.google.com/file/d/1RboLSIJwq-4ETMk_wtRBWOQxER29yPb4/preview" width="640" height="480"></iframe>
</figure>

<h3 id="logging-into-test-test-systems">Logging into test test systems</h3>

<p>Well, since we set up your ~/ssh/.config for you, all you gotta do now is just ssh in to the target host you want to test, it will already have the shiny new kernel installed and booted into it:</p>

<figure>
<iframe src="https://drive.google.com/file/d/1a55tvfy_V1GQ3nkSg7NOeS5lcmqrEAJY/preview" width="640" height="480"></iframe>
</figure>

<h2 id="motivations-for-kdevops">Motivations for kdevops</h2>

<p>Below I&#39;ll document just a bit of the motivation behind this project. The documentation and demo projects should hopefully suffice for how to use all this.</p>

<h3 id="testing-ain-t-easy-brah">Testing ain&#39;t easy, brah!</h3>

<p>Getting contributors to your subsystem / driver in Linux is wonderful, however ensuring it doesn&#39;t break anything is a completely separate matter. It is my belief that testing a patch to ensure no testable regressions exist should be painless, and simple, however that has never been the case.</p>

<h3 id="testing-frameworks-ain-t-easy-to-setup-brah">Testing frameworks ain&#39;t easy to setup, brah!</h3>

<p>Linux kernel testing frameworks should also be really easy to set up. But that is typically never the case either. One example case of complexity in setting a test framework is <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a> used to tests Linux kernel filesystems, and to ensure to the best of our ability that a new patch doesn&#39;t regress the kernel against a baseline. But wait, what is the baseline?</p>

<h3 id="setting-up-test-systems-ain-t-easy-to-ramp-up-brah">Setting up test systems ain&#39;t easy to ramp up, brah!</h3>

<p>Another difficulty with testing the Linux kernel comes with the fact that you don&#39;t want to test the kernel on same kernel you&#39;re laptop is running on, otherwise you&#39;d crash it, and if you&#39;re testing filesystems you may even end up corrupting your filesystem. So typically folks end up using virtualization technologies to setup virtual machines, boot into them, and then use the virtualized hosts as test vehicles. Another alternative is to use cloud service providers such as OpenStack, Azure, Amazon Web Services, Google Cloud Compute to create hosts on the cloud and use these instead. But I&#39;ve heard complaints about how even setting up KVM can be complex, even from kernel developers! Even some kernel developers don&#39;t want to know how to set up a virtual environment to test things.</p>

<h2 id="i-hear-ya-brah">I hear ya, brah!</h2>

<p>My litmus test for a full set up complexity is all the work required to setup <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a> to test a Linux filesystem. If a solution for all the woes above were to ever be provided, I figured it&#39;d have to allow to you easily setup  <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a> to test XFS without you doing much work.</p>

<p>I started looking into this effort first by trying to provide my own set of wrappers around KVM to let you easily setup KVM. Then I extended this effort to easily setup <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a>. Both efforts were all shell hacks... It worked for me, but I was still not really happy with it all. It seemed hacky.</p>

<p>Ted Ts&#39;o&#39;s <a href="https://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git/" rel="nofollow">xfstests-bld.git</a> provided a cloud environment solution for using setting up <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a> on Google Cloud Compute for ext filesystemes (ext2, ext3, ext4), however I was not satisfied with this given I wanted it easy to allow you to test <em>any</em> filesystem, and be Cloud provider agnostic.</p>

<p>ansible provides a proper replacement for shell hacks, in a distribution agnostic manner, and even OS agnostic manner. Vagrant lets me replace all those terrible original bash hacks to setup KVM with simple elegant descriptions of what I want a set of target set of hosts to look like. It also lets me support not only KVM but also Virtualbox, and even support Mac OS X. Terraform accomplishes the same but for cloud environments, and supports different providers.</p>

<h2 id="feedback-and-rants-welcomed">Feedback and rants welcomed</h2>

<p>So, give the repositories a shot, I welcome feedback and rants.</p>

<p><a href="https://github.com/mcgrof/kdevops/" rel="nofollow">kdevops</a> is intended to be used as the de-facto example for all of the ansible roles, and terraform modules.</p>

<p><a href="https://github.com/mcgrof/fw-kdevops/" rel="nofollow">fw-kdevops</a>  is intended to be forked by folks wanting a simple two host test setup where all you need is linux-next and to run selftests.</p>

<p><a href="https://github.com/mcgrof/oscheck" rel="nofollow">oscheck</a> is already actively used to help advance XFS on the stable kernel releases, and is intended to be forked by folks who want to use  <a href="https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/" rel="nofollow">fstests</a> to test <em>any</em> filesystem on <em>any</em> kernel release.</p>
]]></content:encoded>
      <author>mcgrof</author>
      <guid>https://people.kernel.org/read/a/srh813tbtc</guid>
      <pubDate>Fri, 16 Aug 2019 07:56:32 +0000</pubDate>
    </item>
    <item>
      <title>LTS kernel release for 2019</title>
      <link>https://people.kernel.org/gregkh/next-long-term-supported-kernel-release</link>
      <description>&lt;![CDATA[As I had this asked to me 3 times today (once in irc, and twice in email), no, the 5.3 kernel release is NOT the next planned Long Term Supported (LTS) release.&#xA;&#xA;I&#39;ve been saying for a few years now that I would pick the &#34;last released&#34; kernel of the year to be the next LTS release.  And as per the wonderful pointy-hair-crystal-ball, that looks to be the 5.4 kernel release this year.&#xA;&#xA;So, count on it being 5.4, unless something really bad happens in that release, such as people throwing in loads of crud because they &#34;need&#34; it for the LTS release.  If that happens again, I&#39;ll just have to pick a different release...]]&gt;</description>
      <content:encoded><![CDATA[<p>As I had this asked to me 3 times today (once in irc, and twice in email), no, the 5.3 kernel release is NOT the next planned Long Term Supported (LTS) release.</p>

<p>I&#39;ve been saying for a few years now that I would pick the “last released” kernel of the year to be the next LTS release.  And as per the wonderful <a href="http://phb-crystal-ball.org/" rel="nofollow">pointy-hair-crystal-ball</a>, that looks to be the 5.4 kernel release this year.</p>

<p>So, count on it being 5.4, unless something really bad happens in that release, such as people throwing in loads of crud because they “need” it for the LTS release.  If that happens again, I&#39;ll just have to pick a different release...</p>
]]></content:encoded>
      <author>Greg Kroah-Hartman</author>
      <guid>https://people.kernel.org/read/a/nw3vk1krq6</guid>
      <pubDate>Wed, 14 Aug 2019 21:10:57 +0000</pubDate>
    </item>
    <item>
      <title>Why I can&#39;t use a web email client for real work</title>
      <link>https://people.kernel.org/gregkh/why-i-cant-use-a-web-email-client-for-real-work</link>
      <description>&lt;![CDATA[On my personal blog I spend over 4000 words describing just a part of my normal kernel development workflow, all from within the mutt email client.&#xA;&#xA;It&#39;s long, and probably boring to anyone who isn&#39;t used to text-mode email clients, but at least I have a place to point people why they ask why kernel developers don&#39;t use gmail...]]&gt;</description>
      <content:encoded><![CDATA[<p>On my <a href="http://www.kroah.com/log/blog/2019/08/14/patch-workflow-with-mutt-2019/" rel="nofollow">personal blog</a> I spend over 4000 words describing just a part of my normal kernel development workflow, all from within the mutt email client.</p>

<p>It&#39;s long, and probably boring to anyone who isn&#39;t used to text-mode email clients, but at least I have a place to point people why they ask why kernel developers don&#39;t use gmail...</p>
]]></content:encoded>
      <author>Greg Kroah-Hartman</author>
      <guid>https://people.kernel.org/read/a/ixkmadcwlw</guid>
      <pubDate>Wed, 14 Aug 2019 10:55:18 +0000</pubDate>
    </item>
    <item>
      <title>How to design a proper USB-C™ power sink (hint, not the way Raspberry Pi 4 did it)</title>
      <link>https://people.kernel.org/bleung/how-to-design-a-proper-usb-c-power-sink-hint-not-the-way-raspberry-pi-4</link>
      <description>&lt;![CDATA[This issue came up recently for a high profile new gadget that has made the transition from Micro-USB to USB-C in its latest version, the Raspberry Pi 4. See the excellent blog post by Tyler (aka scorpia): https://www.scorpia.co.uk/2019/06/28/pi4-not-working-with-some-chargers-or-why-you-need-two-cc-resistors/&#xA;&#xA;The short summary is that bad things (no charging) happens if the CC1 and CC2 pins are shorted together anywhere in a USB-C system that is not an audio accessory. When combined with more capable cables (handling SuperSpeed data, or 5A power) this configuration will cause compliant chargers to provide 0V instead of 5V to the Pi.&#xA;&#xA;The Raspberry Pi folks made a very common USB-C hardware design mistake that I have personally encountered dozens of times in prototype hardware and in real gear that was sold to consumers.&#xA;&#xA;What this unique about this case is that Raspberry Pi has posted schematics (thanks open hardware!) of their board that very clearly show the error.&#xA;&#xA;Image&#xA;Excerpt from the reduced Pi4 Model B schematics, from https://www.scorpia.co.uk/wp-content/uploads/2019/06/image-300x292.png&#xA;&#xA;Both of the CC pins in the Pi4 schematic above are tied together on one end of resistor R79, which is a 5.1 kΩ pulldown.&#xA;&#xA;Contrast that to what the USB Type-C Specification mandates must be done in this case.&#xA;&#xA;Image&#xA;USB Type-C&#39;s Sink Functional Model for CC1 and CC2, from USB Type-C Specification 1.4, Section 4.5.1.3.2&#xA;&#xA;Each CC gets its own distinct Rd (5.1 kΩ), and it is important that they are distinct.&#xA;&#xA;The Raspberry Pi team made two critical mistakes here.&#xA;The first is that they designed this circuit themselves, perhaps trying to do something clever with current level detection, but failing to do it right. Instead of trying to come up with some clever circuit, hardware designers should simply copy the figure from the USB-C Spec exactly. The Figure 4–9 I posted above isn&#39;t simply a rough guideline of one way of making a USB-C receptacle. It&#39;s actually normative*, meaning mandatory, required by the spec in order to call your system a compliant USB-C power sink. Just copy it.&#xA;&#xA;The second mistake is that they didn&#39;t actually test their Pi4 design with advanced cables. I get it, the USB-C cable situation is confusing and messy, and I&#39;ve covered it in detail here that there are numerous different cables. However, cables with e-marker chips (the kind that would cause problems with Pi4&#39;s mistake) are not that uncommon. Every single Apple MacBook since 2016 has shipped with a cable with an e-marker chip. The fact that no QA team inside of Raspberry Pi&#39;s organization caught this bug indicates they only tested with one kind (the simplest) of USB-C cable.&#xA;&#xA;Raspberry Pi, you can do better. I urge you to correct your design as soon as you can so you can be USB-C compliant.]]&gt;</description>
      <content:encoded><![CDATA[<p>This issue came up recently for a high profile new gadget that has made the transition from Micro-USB to USB-C in its latest version, the Raspberry Pi 4. See the excellent blog post by Tyler (aka scorpia): <a href="https://www.scorpia.co.uk/2019/06/28/pi4-not-working-with-some-chargers-or-why-you-need-two-cc-resistors/" rel="nofollow">https://www.scorpia.co.uk/2019/06/28/pi4-not-working-with-some-chargers-or-why-you-need-two-cc-resistors/</a></p>

<p>The short summary is that bad things (no charging) happens if the CC1 and CC2 pins are shorted together anywhere in a USB-C system that is not an audio accessory. When combined with more capable cables (handling SuperSpeed data, or 5A power) this configuration will cause compliant chargers to provide 0V instead of 5V to the Pi.</p>

<p>The Raspberry Pi folks made a very common USB-C hardware design mistake that I have personally encountered dozens of times in prototype hardware and in real gear that was sold to consumers.</p>

<p>What this unique about this case is that Raspberry Pi has posted schematics (thanks open hardware!) of their board that very clearly show the error.</p>

<p><img src="https://www.scorpia.co.uk/wp-content/uploads/2019/06/image-300x292.png" alt="Image"></p>

<h6 id="excerpt-from-the-reduced-pi4-model-b-schematics-from-https-www-scorpia-co-uk-wp-content-uploads-2019-06-image-300x292-png">Excerpt from the reduced Pi4 Model B schematics, from <a href="https://www.scorpia.co.uk/wp-content/uploads/2019/06/image-300x292.png" rel="nofollow">https://www.scorpia.co.uk/wp-content/uploads/2019/06/image-300x292.png</a></h6>

<p>Both of the CC pins in the Pi4 schematic above are tied together on one end of resistor R79, which is a 5.1 kΩ pulldown.</p>

<p>Contrast that to what the USB Type-C Specification mandates must be done in this case.</p>

<p><img src="https://cdn-images-1.medium.com/max/800/1*Sh12bTBeMjkxFMtREBNxyg.png" alt="Image"></p>

<h6 id="usb-type-c-s-sink-functional-model-for-cc1-and-cc2-from-usb-type-c-specification-1-4-section-4-5-1-3-2">USB Type-C&#39;s Sink Functional Model for CC1 and CC2, from USB Type-C Specification 1.4, Section 4.5.1.3.2</h6>

<p>Each CC gets its own distinct Rd (5.1 kΩ), and it is important that they are distinct.</p>

<p>The Raspberry Pi team made two critical mistakes here.
The first is that they designed this circuit themselves, perhaps trying to do something clever with current level detection, but failing to do it right. Instead of trying to come up with some clever circuit, hardware designers should simply <em>copy the figure from the USB-C Spec exactly</em>. The Figure 4–9 I posted above isn&#39;t simply a rough guideline of one way of making a USB-C receptacle. It&#39;s actually <em><em>normative</em></em>, meaning mandatory, required by the spec in order to call your system a compliant USB-C power sink. Just copy it.</p>

<p>The second mistake is that they didn&#39;t actually test their Pi4 design with advanced cables. I get it, the USB-C cable situation is confusing and messy, and I&#39;ve covered it in detail <a href="https://people.kernel.org/bleung/how-many-kinds-of-usb-c-to-usb-c-cables-are-there" rel="nofollow">here</a> that there are numerous different cables. However, cables with e-marker chips (the kind that would cause problems with Pi4&#39;s mistake) are not that uncommon. Every single Apple MacBook since 2016 has shipped with a cable with an e-marker chip. The fact that no QA team inside of Raspberry Pi&#39;s organization caught this bug indicates they only tested with one kind (the simplest) of USB-C cable.</p>

<p>Raspberry Pi, you can do better. I urge you to correct your design as soon as you can so you can be USB-C compliant.</p>
]]></content:encoded>
      <author>Benson Leung</author>
      <guid>https://people.kernel.org/read/a/lchr2of316</guid>
      <pubDate>Fri, 05 Jul 2019 20:40:35 +0000</pubDate>
    </item>
    <item>
      <title>Runtimes And the Curse of the Privileged Container</title>
      <link>https://people.kernel.org/brauner/runtimes-and-the-curse-of-the-privileged-container</link>
      <description>&lt;![CDATA[Introduction (CVE-2019-5736)&#xA;&#xA;Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:&#xA;&#xA;  The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn&#39;t matter if the command is not attacker-controlled) as root within a container in either of these contexts:&#xA;    Creating a new container using an attacker-controlled image.&#xA;  Attaching (docker exec) into an existing container which the attacker had previous write access to.&#xA;&#xA;I&#39;ve been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.&#xA;&#xA;What are Privileged Containers?&#xA;&#xA;At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root.&#xA;Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because &#34;running as root&#34; can either be seen as meaning &#34;running as root as seen from the outside&#34; or &#34;running as root from the inside&#34; where &#34;outside&#34; means &#34;as seen from a task outside the container&#34; and &#34;inside&#34; means &#34;as seen from a task inside the container&#34;.&#xA;&#xA;What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say &#34;ceteris paribus&#34; because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.&#xA;&#xA;An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel&#39;s user namespace implementation has a bug.&#xA;&#xA;The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:&#xA;id: 0 100000 100000&#xA;This instructs the kernel to setup the following mapping:&#xA;id: containerid(0) -  hostid(100000)&#xA;id: containerid(1) -  hostid(100001)&#xA;id: containerid(2) -  hostid(100002)&#xA;.&#xA;.&#xA;.&#xA;&#xA;containerid(100000) -  hostid(200000)&#xA;With this mapping it&#39;s evident that containerid(0) != hostid(0). But now consider the following mapping:&#xA;id: 0 0 1&#xA;id: 1 100001 99999&#xA;This instructs the kernel to setup the following mapping:&#xA;id: containerid(0) -  hostid(0)&#xA;id: containerid(1) -  hostid(100001)&#xA;id: containerid(2) -  hostid(100002)&#xA;.&#xA;.&#xA;.&#xA;&#xA;containerid(99999) -  hostid(199999)&#xA;In contrast to the first example this has the consequence that containerid(0) == hostid(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged&#xA;container.&#xA;&#xA;As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost &#34;super-privileged&#34;.&#xA;&#xA;The Trouble with Privileged Containers&#xA;&#xA;The problem I see with privileged containers is essentially captured by LXC&#39;s and LXD&#39;s upstream security position which we have held since at least 2015 but probably even earlier. I&#39;m quoting from our notes about privileged containers:&#xA;&#xA;  Privileged containers are defined as any container where the container uid 0 is mapped to the host&#39;s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.&#xA;    Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.&#xA;    LXC upstream&#39;s position is that those containers aren&#39;t and cannot be root-safe.&#xA;    They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.&#xA;    We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren&#39;t blockable as they would require blocking so many core features that the average container would become completely unusable.&#xA;  [...]&#xA;&#xA;  As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.&#xA;&#xA;LXC&#39;s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.&#xA;&#xA;Running Untrusted Workloads in Privileged Containers&#xA;&#xA;is insane. That&#39;s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.&#xA;&#xA;CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root&#xA;&#xA;CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:&#xA;&#xA;could just be a binary that calls poweroff&#xA;could be a binary that spawns a root shell&#xA;could be a binary that kills other containers when called again to attach&#xA;could be suid cat&#xA;.&#xA;.&#xA;.&#xA;&#xA;The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.&#xA;&#xA;LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.&#xA;&#xA;The Lie that Privileged Containers can be safe&#xA;&#xA;Aside from mostly working on the Kernel I&#39;m also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC - the low-level container runtime - and LXD - the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:&#xA;&#xA;Privileged containers should never be used to run untrusted workloads.&#xA;Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: containerid(0) == hostid(0).&#xA;It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.&#xA;&#xA;Unprivileged Containers as Default&#xA;&#xA;As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren&#39;t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.&#xA;&#xA;For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.&#xA;&#xA;The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.&#xA;&#xA;Are LXC and LXD affected?&#xA;&#xA;I have seen this question asked all over the place so I guess I should add&#xA;a section about this too:&#xA;&#xA;Unprivileged LXC and LXD containers are not affected.&#xA;&#xA;Any privileged LXC and LXD container running on a read-only rootfs is not affected.&#xA;&#xA;Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an OPATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the OPATH file descriptor through /proc/self/fd/OPATH-nr in a loop as OWRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.&#xA;&#xA;Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel   does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.&#xA;&#xA;Chromebooks with Crostini using LXD are not affected&#xA;&#xA;Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXDUNPRIVILEGEDONLY flag. For more details see this link.&#xA;&#xA;Fixing CVE-2019-5736&#xA;&#xA;To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf.6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfdcreate() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXCMEMFDREXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.&#xA;&#xA;Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.&#xA;&#xA;[1]: https://www.cyphar.com/&#xA;[2]: https://seclists.org/oss-sec/2019/q1/119&#xA;[3]: https://github.com/lxc/lxc&#xA;[4]: https://github.com/lxc/lxd&#xA;[5]: https://github.com/lxc/lxd&#xA;[6]: https://linuxcontainers.org/lxc/security/#privileged-containers&#xA;[7]: https://www.reddit.com/r/Crostini/comments/apkz8t/crostinicontainerslikelyvulnerable_to/&#xA;[8]: https://github.com/lxc/linuxcontainers.org/commit/b1a45aef6abc885594aab2ce6bdeb2186c5e0973&#xA;[9]: https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb85f4e224348bf9d]]&gt;</description>
      <content:encoded><![CDATA[<h4 id="introduction-cve-2019-5736-2">Introduction (<a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">CVE-2019-5736</a>)</h4>

<p>Today, Monday, 2019-02-11, 14:00:00 CET <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">CVE-2019-5736</a> was released:</p>

<blockquote><p>The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn&#39;t matter if the command is not attacker-controlled) as root within a container in either of these contexts:</p>
<ul><li>Creating a new container using an attacker-controlled image.</li>
<li>Attaching (docker exec) into an existing container which the attacker had previous write access to.</li></ul>
</blockquote>

<p>I&#39;ve been working on a fix for this issue over the last couple of weeks together with <a href="https://www.cyphar.com/" rel="nofollow">Aleksa</a> a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for <a href="https://github.com/lxc/lxc" rel="nofollow">LXC</a> as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.</p>

<h4 id="what-are-privileged-containers">What are Privileged Containers?</h4>

<p>At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root.
Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.</p>

<p>What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.</p>

<p>An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel&#39;s user namespace implementation has a bug.</p>

<p>The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:</p>

<pre><code>id: 0 100000 100000
</code></pre>

<p>This instructs the kernel to setup the following mapping:</p>

<pre><code>id: container_id(0) -&gt; host_id(100000)
id: container_id(1) -&gt; host_id(100001)
id: container_id(2) -&gt; host_id(100002)
.
.
.

container_id(100000) -&gt; host_id(200000)
</code></pre>

<p>With this mapping it&#39;s evident that <code>container_id(0) != host_id(0)</code>. But now consider the following mapping:</p>

<pre><code>id: 0 0 1
id: 1 100001 99999
</code></pre>

<p>This instructs the kernel to setup the following mapping:</p>

<pre><code>id: container_id(0) -&gt; host_id(0)
id: container_id(1) -&gt; host_id(100001)
id: container_id(2) -&gt; host_id(100002)
.
.
.

container_id(99999) -&gt; host_id(199999)
</code></pre>

<p>In contrast to the first example this has the consequence that <code>container_id(0) == host_id(0)</code>. I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged
container.</p>

<p>As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the <code>--privileged</code> flag that Docker containers run unprivileged. This is wrong. What the <code>--privileged</code> flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.</p>

<h4 id="the-trouble-with-privileged-containers">The Trouble with Privileged Containers</h4>

<p>The problem I see with privileged containers is essentially captured by <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a>&#39;s and <a href="https://github.com/lxc/lxd" rel="nofollow">LXD</a>&#39;s upstream security position which we have held since at least <a href="https://github.com/lxc/linuxcontainers.org/commit/b1a45aef6abc885594aab2ce6bdeb2186c5e0973" rel="nofollow">2015</a> but probably even earlier. I&#39;m quoting from our <a href="https://linuxcontainers.org/lxc/security/#privileged-containers" rel="nofollow">notes about privileged containers</a>:</p>

<blockquote><p>Privileged containers are defined as any container where the container uid 0 is mapped to the host&#39;s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.</p>

<p>Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.</p>

<p>LXC upstream&#39;s position is that those containers aren&#39;t and cannot be root-safe.</p>

<p>They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.</p>

<p>We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren&#39;t blockable as they would require blocking so many core features that the average container would become completely unusable.</p></blockquote>

<p>[...]</p>

<blockquote><p>As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.</p></blockquote>

<p>LXC&#39;s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.</p>

<h4 id="running-untrusted-workloads-in-privileged-containers">Running Untrusted Workloads in Privileged Containers</h4>

<p>is insane. That&#39;s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.</p>

<h4 id="cve-2019-5736-2-is-a-very-very-very-bad-privilege-escalation-to-host-root"><a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">CVE-2019-5736</a> Is a Very Very Very Bad Privilege Escalation to Host Root</h4>

<p><a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">CVE-2019-5736</a> is an excellent illustration of such an attack. Think about it: a process running <strong>inside</strong> a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:</p>
<ul><li>could just be a binary that calls <code>poweroff</code></li>
<li>could be a binary that spawns a root shell</li>
<li>could be a binary that kills other containers when called again to attach</li>
<li>could be <code>suid</code> <code>cat</code></li>
<li>.</li>
<li>.</li>
<li>.</li></ul>

<p>The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.</p>

<p><a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our <code>oci</code> template with <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.</p>

<h4 id="the-lie-that-privileged-containers-can-be-safe">The Lie that Privileged Containers can be safe</h4>

<p>Aside from mostly working on the Kernel I&#39;m also a maintainer of <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> and <a href="https://github.com/lxc/lxd" rel="nofollow">LXD</a> alongside <a href="https://stgraber.org/" rel="nofollow">Stéphane Graber</a>. We are responsible for <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> – the low-level container runtime – and <a href="https://github.com/lxc/lxd" rel="nofollow">LXD</a> – the container management daemon using <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a>. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:</p>
<ol><li>Privileged containers should never be used to run untrusted workloads.</li>
<li>Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: <code>container_id(0) == host_id(0)</code>.
It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.</li></ol>

<h4 id="unprivileged-containers-as-default">Unprivileged Containers as Default</h4>

<p>As upstream for <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> and <a href="https://github.com/lxc/lxd" rel="nofollow">LXD</a> we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With <a href="https://github.com/lxc/lxd" rel="nofollow">LXD</a> we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren&#39;t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.</p>

<p>For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.</p>

<p>The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.</p>

<h4 id="are-lxc-and-lxd-affected">Are LXC and LXD affected?</h4>

<p>I have seen this question asked all over the place so I guess I should add
a section about this too:</p>
<ul><li><p>Unprivileged <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> and <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> containers are not affected.</p></li>

<li><p>Any privileged <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> and <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> container running on a read-only rootfs is not affected.</p></li>

<li><p>Privileged <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the <code>lxc-attach</code> binary does not exit before the program in the container has finished executing. This means an attacker would need to open an <code>O_PATH</code> file descriptor to <code>/proc/self/exe</code>, <code>fork()</code> itself into the background and re-open the <code>O_PATH</code> file descriptor through <code>/proc/self/fd/&lt;O_PATH-nr&gt;</code> in a loop as <code>O_WRONLY</code> and keep trying to write to the binary until such time as <code>lxc-attach</code> exits. Before that it will not succeed since the kernel will not allow modification of a running binary.</p></li>

<li><p>Privileged <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> daemon never exits so any write will fail because the kernel   does not allow modification of a running binary. If the <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.</p></li></ul>

<h4 id="chromebooks-with-crostini-using-lxd-are-not-affected">Chromebooks with Crostini using LXD are not affected</h4>

<p>Chromebooks use <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, <a href="https://github.com/lxc/lxc" rel="nofollow">LXD</a> does not allow running privileged containers on Chromebooks through the <code>LXD_UNPRIVILEGED_ONLY</code> flag. For more details see this <a href="https://www.reddit.com/r/Crostini/comments/apkz8t/crostini_containers_likely_vulnerable_to/" rel="nofollow">link</a>.</p>

<h4 id="fixing-cve-2019-5736">Fixing CVE-2019-5736</h4>

<p>To prevent this attack, <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf.<a href="https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb85f4e224348bf9d" rel="nofollow">6400238d08cdf1ca20d49bafb85f4e224348bf9d</a>). To do this <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> can be instructed to create an anonymous, in-memory file using the <code>memfd_create()</code> system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> binary. Also as the temporary, in-memory <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting <code>LXC_MEMFD_REXEC</code> in the environment. For our <code>lxc-attach</code> binary which is the only attack vector this is now done by default.</p>

<p>Workloads that place the <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a> binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing <code>--disable-memfd-rexec</code> during the <code>configure</code> stage when compiling <a href="https://seclists.org/oss-sec/2019/q1/119" rel="nofollow">LXC</a>.</p>
]]></content:encoded>
      <author>Christian Brauner</author>
      <guid>https://people.kernel.org/read/a/rozfujb0s2</guid>
      <pubDate>Tue, 18 Jun 2019 15:54:33 +0000</pubDate>
    </item>
    <item>
      <title>Replacing offlineimap with mbsync</title>
      <link>https://people.kernel.org/mcgrof/replacing-offlineimap-with-mbsync</link>
      <description>&lt;![CDATA[The offlineimap woes&#xA;&#xA;A long term goal I&#39;ve had for a while now was finding a reasonable replacement for offlineimap to get all my email for my development purposes. I knew offlineimap kept dying on me with out of memory (OOM) errors however it was not clear how bad the issue was. It was also not clear what I&#39;d replace it with until now. At least for now... I&#39;ve replaced offlineimap with mbsync. Below are some details comparing both, with shiny graphs of system utilization on both, I&#39;ll provide my recipes for fetching gmail nested labels over IMAP, glance over my systemd user unit files and explain why I use them, and hint what I&#39;m asking Santa for in the future.&#xA;&#xA;System setup when home is $HOME&#xA;&#xA;I used to host my mail fetching system at home, however, $HOME can get complicated if you travel often, and so for more flexibility I rely now on a digital ocean droplet with a small dedicated volume pool for storage for mail. This lets me do away with the stupid host whenever I&#39;m tired of it, and lets me collect nice  system utilization graphs without much effort.&#xA;&#xA;Graphing use on offlineimap&#xA;&#xA;Every now and then I&#39;d check my logs and see how offlineimap tends to run out of memory, and would tend to barf. A temporary solution I figure would work was to disable autorefresh, and instead run offlineimap once in a controlled timely loop using systemd unit timers. That solution didn&#39;t help in the end.  I finally had a bit of time to check my logs carefully and also check system utilization graphs on the sytem over time and to my surprise offlineimap was running out of memory every single damn time. Here&#39;s what I saw from results of running offlineimap for a full month:&#xA;&#xA;Full month graph of offlineimap&#xA;&#xA;Those spikes are a bit concerning, it&#39;s likely the system running out of memory. But let&#39;s zoom in to see how often with an hourly graph:&#xA;&#xA;Hourly graph of offlineimap&#xA;&#xA;Pretty much, I was OOM&#39;ing every single damn time! The lull you see towards the end was me just getting fed up and killing offlineimap until I found a replacement.&#xA;&#xA;The OOM risks&#xA;&#xA;Running out of memory every now and then is one thing, but every single time is just insanity. A system always running low on memory while doing writes is an effective way to stress test a kernel, and if the stars align against you, you might even end up with a corrupted filesystem. Fortunately this puny single threaded application is simple enough so I didn&#39;t run into that issue. But it was a risk.&#xA;&#xA;mbsync&#xA;&#xA;mbsync is written in C, actively maintained and has mutt code pedigree. Need I say more? Hell, I&#39;m only sad it took me so long to find out about it. mbsync works with idea of channels, for each it would have a master and local store. The master is where we fetch data from, and the local where we stash things locally.&#xA;&#xA;But in reading its documentation it was not exactly clear how I&#39;d use it for my development purpose to fetch email off of my gmail where I used nested labels for different public mailing lists.&#xA;&#xA;The documentation was also not clear on what to do when migrating and keeping old files.&#xA;&#xA;mbsync migration&#xA;&#xA;Yes in theory you could keep the old IMAP folder, but in practice I ran into a lot of issues. So much so, my solution to the problem was:&#xA;&#xA;$ rm -rf Mail/&#xA;&#xA;And just start fresh... Afraid to make the jump due to the amount of time it may take to sync one of your precious labels? Well, evaluate my different timer solution below.&#xA;&#xA;mbsync for nested gmail labels&#xA;&#xA;Here&#39;s what I ended up with. It demos getting mail to say my linux-kernel/linux-xfs and linux-kernel/linux-fsdevel mailing lists, and includes some empirical throttling to ensure you don&#39;t get punted by gmail for going over some sort of usage quota they&#39;ve concocted for an IMAP connection.&#xA;&#xA;A gmail example&#xA;&#xA;First generic defaults&#xA;This example was updated on 2021-05-27 to account&#xA;for the isync rename of Master/Slave for Far/Near&#xA;Create Near&#xA;SyncState &#xA;&#xA;IMAPAccount gmail&#xA;SSLType IMAPS&#xA;Host imap.gmail.com&#xA;User user@gmail.com&#xA;Must be an application specific password, otherwise google will deny access.&#xA;Pass example&#xA;Throttle mbsync so we don&#39;t go over gmail&#39;s quota: OVERQUOTA error would&#xA;eventually be returned otherwise. For more details see:&#xA;https://sourceforge.net/p/isync/mailman/message/35458365/&#xA;PipelineDepth 50&#xA;&#xA;MaildirStore gmail-local&#xA;The trailing &#34;/&#34; is important&#xA;Path ~/Mail/&#xA;Inbox ~/Mail/Inbox&#xA;Subfolders Verbatim&#xA;&#xA;IMAPStore gmail-remote&#xA;Account gmail&#xA;&#xA;emails sent directly to my kernel.org address&#xA;are stored in my gmail label &#34;korg&#34;&#xA;Channel korg&#xA;Far :gmail-remote:&#34;korg&#34;&#xA;Near :gmail-local:korg&#xA;&#xA;An example of nested labels on gmail, useful for large projects with&#xA;many mailing lists. We have to flatten out the structure locally.&#xA;Channel linux-xfs&#xA;Far :gmail-remote:&#34;linux-kernel/linux-xfs&#34;&#xA;Near :gmail-local:linux-kernel.linux-xfs&#xA;&#xA;Channel linux-fsdevel&#xA;Far :gmail-remote:&#34;linux-kernel/linux-fsdevel&#34;&#xA;Near :gmail-local:linux-kernel.linux-fsdevel&#xA;&#xA;Get all the gmail channels together into a group.&#xA;Group googlemail&#xA;Channel korg&#xA;Channel linux-xfs&#xA;Channel linux-fsdevel&#xA;&#xA;mbsync systemd unit files&#xA;&#xA;Now, some of these mailing lists (channels in mbsync lingo) have heavy traffic, and I don&#39;t need to be fetching email off of them that often. I also have a channel dedicated solely for emails sent directly to me, those I want right away. But also... since I&#39;m starting fresh, if I ran mbsync to fetch all my email it would mean that at one point mbsync would stall for any large label I&#39;d have. I&#39;d have to wait for those big labels before getting new email for smaller labels.  For this reason, ideally I &#39;d want to actually call mbsync at different intervals depending on the mailing list / mbsync channel. Fortunately mbsync locks per target local directory, and so the only missing piece was a way to configure timers / calls for mbsync in such a way I could still journal calls / issues.&#xA;&#xA;I ended up writing a systemd timer and a service unit file per mailing list. The nice thing about this, in favor over using good &#39;ol cron, is OnUnitInactiveSec=4m, for instance will call mbsync 4 minutes after it last finished. I also end up with a central place to collect logs:&#xA;&#xA;journalctl --user&#xA;&#xA;Or if I want to monitor:&#xA;&#xA;journalctl --user -f&#xA;&#xA;For my korg label, patches / rants sent directly to me, I want to fetch mail every minute:&#xA;&#xA;$ cat .config/systemd/user/mbsync-korg.timer&#xA;[Unit]&#xA;Description=mbsync query timer [0000-korg]&#xA;ConditionPathExists=%h/.mbsyncrc&#xA;&#xA;[Timer]&#xA;OnBootSec=1m&#xA;OnUnitInactiveSec=1m&#xA;&#xA;[Install]&#xA;WantedBy=default.target&#xA;&#xA;$ cat .config/systemd/user/mbsync-korg.service&#xA;[Unit]&#xA;Description=mbsync service [korg]&#xA;Documentation=man:mbsync(1)&#xA;ConditionPathExists=%h/.mbsyncrc&#xA;&#xA;[Service]&#xA;Type=oneshot&#xA;ExecStart=/usr/local/bin/mbsync 0000-korg&#xA;&#xA;[Install]&#xA;WantedBy=mail.target&#xA;&#xA;However for my linux-fsdevel... I could wait at least 30 minutes for a refresh:&#xA;&#xA;$ cat .config/systemd/user/mbsync-linux-fsdevel.timer&#xA;[Unit]&#xA;Description=mbsync query timer [linux-fsdevel]&#xA;ConditionPathExists=%h/.mbsyncrc&#xA;&#xA;[Timer]&#xA;OnBootSec=5m&#xA;OnUnitInactiveSec=30m&#xA;&#xA;[Install]&#xA;WantedBy=default.target&#xA;&#xA;And the service unit:&#xA;&#xA;$ cat .config/systemd/user/mbsync-linux-fsdevel.service&#xA;[Unit]&#xA;Description=mbsync service [linux-fsdevel]&#xA;Documentation=man:mbsync(1)&#xA;ConditionPathExists=%h/.mbsyncrc&#xA;&#xA;[Service]&#xA;Type=oneshot&#xA;ExecStart=/usr/local/bin/mbsync linux-fsdevel&#xA;&#xA;[Install]&#xA;WantedBy=mail.target&#xA;&#xA;Enabling and starting systemd user unit files&#xA;&#xA;To enable these unit files I just run for each, for instance for linux-fsdevel:&#xA;&#xA;The first command is now required on more recent versions&#xA;of systemd. Only older versions of systemd&#xA;you just need to enable the timer and start it&#xA;systemctl --user enable mbsync-linux-fsdevel.service&#xA;systemctl --user enable mbsync-linux-fsdevel.timer&#xA;systemctl --user start  mbsync-linux-fsdevel.timer&#xA;&#xA;Graphing mbsync&#xA;&#xA;So... how did it do?&#xA;&#xA;I currently have enabled 5 mbsync channels, all fetching my email in the background for me. And not a single one goes on puking with OOM. Here&#39;s what life is looking like now:&#xA;&#xA;mbsync hourly&#xA;&#xA;Peachy.&#xA;&#xA;Long term ideals&#xA;&#xA;IMAP does the job for email, it just seems utterly stupid for public mailing lists and I figure we can do much better. This is specially true in light of the fact of how much simpler it is for me to follow public code Vs public email threads these days. Keep in mind how much more complicated code management is over the goal of just wanting to get a simple stupid email Message ID onto my local Maildir directory. I really had my hopes on public-inbox but after looking into it, it seems clear now that its main objectives are for archiving -- not local storage / MUA use. For details refer to this [linux-kernel discussion on public-inbox &#xA; with a MUA focus](https://lore.kernel.org/lkml/20190307034453.pbhmllhmsjy6kvtc@dcvr/).&#xA;&#xA;If the issue with using public-inbox for local MUA usage was that archive was too big... it seems sensible to me to evaluate trying an even smaller epoch size, and default clients to fetch only one* epoch, the latest one. That alone wouldn&#39;t solve the issue though. How data files are stored on Maildir makes using git almost incompatible. A proper evaluation of using mbox would be in order.&#xA;&#xA;The social lubricant is out on the idea though, and I&#39;m in hopes a proper simple git Mail solution is bound to find us soon for public emails.]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="the-offlineimap-woes">The offlineimap woes</h2>

<p>A long term goal I&#39;ve had for a while now was finding a reasonable replacement for offlineimap to get all my email for my development purposes. I knew offlineimap kept dying on me with out of memory (OOM) errors however it was not clear <em>how</em> bad the issue was. It was also not clear what I&#39;d replace it with until now. At least for now... I&#39;ve replaced offlineimap with <a href="http://isync.sourceforge.net/mbsync.html" rel="nofollow">mbsync</a>. Below are some details comparing both, with shiny graphs of system utilization on both, I&#39;ll provide my recipes for fetching gmail nested labels over IMAP, glance over my systemd user unit files and explain why I use them, and hint what I&#39;m asking Santa for in the future.</p>

<h2 id="system-setup-when-home-is-home">System setup when home is $HOME</h2>

<p>I used to host my mail fetching system at home, however, $HOME can get complicated if you travel often, and so for more flexibility I rely now on a digital ocean droplet with a small dedicated volume pool for storage for mail. This lets me do away with the stupid host whenever I&#39;m tired of it, and lets me collect nice  system utilization graphs without much effort.</p>

<h2 id="graphing-use-on-offlineimap">Graphing use on offlineimap</h2>

<p>Every now and then I&#39;d check my logs and see how offlineimap tends to run out of memory, and would tend to barf. A temporary solution I figure would work was to disable <code>autorefresh</code>, and instead run offlineimap <em>once</em> in a controlled timely loop using systemd unit timers. That solution didn&#39;t help in the end.  I finally had a bit of time to check my logs carefully and also check system utilization graphs on the sytem over time and to my surprise offlineimap was running out of memory every single damn time. Here&#39;s what I saw from results of running offlineimap for a full month:</p>

<p><img src="https://lh3.googleusercontent.com/KT0ZeAPoGweP15DZmPXPEFh5aDPKDUlV5bXO3mdLT3IJoWAUI7zX3OALvA7dee76SvtvvDoDtM7HML94jo072unlSO9xAJUeDKJkNNYJrw6ERCvNOfEi5C7W7RWGAeliXoFKJshsU8jrgbNpq2OLnBen_nHLQzIRMEjxLwLJWnS4YpQVWD0rNYX4l6OMFxJ6jbDmydn4NmYaW4JZBDrWQ6loMgoPdKR2ug6oA7GiQO34Ik5EZ3LLyAphcj__yiB4TJ3oeDfdKMDg1K0Qyp2UGXsHt7CO4MvWk7TvM4LtEi9tZ_Ojf3JJ1ThxJZbKi0okRYECDtD4_x6IXUv_9kiycKsEd2Ih49om3a30jbeGyDQms1e3O7uIG1yWMUMnPqb5oLElSw6ax56a7kSw17BU-5e4H_R4dwpb-e-pN3iwE0GaXTAHjRCJsgp5--vF9ZaqI8BSceiYfJ2MK21-RZ2HXodoWqdlHibvBS3gXkt5fHTWaCFDwKjjyS018VVy3QJt_xpuBnZ8bjfAmBHzOMR_KOluNNN6ZnErhBDfF5GUe6cdBAMTl-M_2OjzFddd8zopTmarz0gUmIcCoIPHYyLpZvq1yFcWcYHIH2496tzxMaTdOvLZMlioLnBzStRoIpisg550wChbUOXERVp2ufCcVI_ZfElnCGL9qTuBOCu2yFXp-V9rn1TkkZDtc5X9o1xXJ13W2rLQV5eB5zypPtEV2BdpqQ=w2029-h1626-no" alt="Full month graph of offlineimap"></p>

<p>Those spikes are a bit concerning, it&#39;s likely the system running out of memory. But let&#39;s zoom in to see how often with an hourly graph:</p>

<p><img src="https://lh3.googleusercontent.com/AkcLGsh3y1j6bJwJkeGcdWyauvzALDuxWHnkSk7oEAKr7k6lYaINfE50SeR4HO1VcFIaUYpvhmMsuNlSMHtdiiybtVi4XjpSrFNb5eqTorfTBYUNz6bkgulERDJrHKkBqz0utWRg_njFJsj3HW0EAeIZTSSivbPeQH8ebj0IA_nz3qzhs6lZIMqfiVddxjToKeqaMj_iv9nsphtLE_U87Lif_GIQTFxW82ETx7EVIY8OqVVW7Igqvv5TUpxuZiCgTXPlcWnJjK-BSHtHlItbW9av8liBxWpMOT_gNEK3N9x0e5lp5KeWJhpsb97153wkc6KzkATb-fimDp4dh2FXLNCur9wggu_ZG5D0lkKTTmGfDOAgrKQl7dTjm9c0_W27njDDhMfTuVKFd76H_xXjwRrie6msSNk9o26LAlNY7Tayx5TQzy6uPrDeNsEgR1oUnmFbZrgGUEX4We0kpsVBLg-h_gBbs9pyCg6I2wx3v9wFacMRIgeeCGStXnM_xOq1TLDzo4qb5LSi3ql9iesKmiR4dcmK_B3qNGMkdL0VqcvJMtsymh7cUp4MUCeXDj_cUQJSK-2j5Vy3ixwz7LouC3r7karfVUWMQbgiaE3kfxheX41FzI0SoVyaiPNVM758WCabNHWb3Z5ugLtzEJc6Z1ItVXeTd1e3=w2055-h1097-no" alt="Hourly graph of offlineimap"></p>

<p>Pretty much, I was OOM&#39;ing every single damn time! The lull you see towards the end was me just getting fed up and killing offlineimap until I found a replacement.</p>

<h3 id="the-oom-risks">The OOM risks</h3>

<p>Running out of memory every now and then is one thing, but <em>every</em> single time is just insanity. A system <em>always</em> running low on memory while doing writes is an effective way to stress test a kernel, and if the stars align against you, you might even end up with a corrupted filesystem. Fortunately this puny single threaded application is simple enough so I didn&#39;t run into that issue. But it was a risk.</p>

<h2 id="mbsync">mbsync</h2>

<p>mbsync is written in C, actively maintained and has mutt code pedigree. Need I say more? Hell, I&#39;m only sad it took me so long to find out about it. mbsync works with idea of channels, for each it would have a master and local store. The master is where we fetch data from, and the local where we stash things locally.</p>

<p>But in reading its documentation it was not exactly clear how I&#39;d use it for my development purpose to fetch email off of my gmail where I used nested labels for different public mailing lists.</p>

<p>The documentation was also not clear on what to do when migrating and keeping old files.</p>

<h3 id="mbsync-migration">mbsync migration</h3>

<p>Yes in theory you could keep the old IMAP folder, but in practice I ran into a lot of issues. So much so, my solution to the problem was:</p>

<pre><code class="language-bash">$ rm -rf Mail/
</code></pre>

<p>And just start fresh... Afraid to make the jump due to the amount of time it may take to sync one of your precious labels? Well, evaluate my different timer solution below.</p>

<h2 id="mbsync-for-nested-gmail-labels">mbsync for nested gmail labels</h2>

<p>Here&#39;s what I ended up with. It demos getting mail to say my linux-kernel/linux-xfs and linux-kernel/linux-fsdevel mailing lists, and includes some empirical throttling to ensure you don&#39;t get punted by gmail for going over some sort of usage quota they&#39;ve concocted for an IMAP connection.</p>

<pre><code class="language-bash"># A gmail example
#
# First generic defaults
# This example was updated on 2021-05-27 to account
# for the isync rename of Master/Slave for Far/Near
Create Near
SyncState *

IMAPAccount gmail
SSLType IMAPS
Host imap.gmail.com
User user@gmail.com
# Must be an application specific password, otherwise google will deny access.
Pass example
# Throttle mbsync so we don&#39;t go over gmail&#39;s quota: OVERQUOTA error would
# eventually be returned otherwise. For more details see:
# https://sourceforge.net/p/isync/mailman/message/35458365/
# PipelineDepth 50

MaildirStore gmail-local
# The trailing &#34;/&#34; is important
Path ~/Mail/
Inbox ~/Mail/Inbox
Subfolders Verbatim

IMAPStore gmail-remote
Account gmail

# emails sent directly to my kernel.org address
# are stored in my gmail label &#34;korg&#34;
Channel korg
Far :gmail-remote:&#34;korg&#34;
Near :gmail-local:korg

# An example of nested labels on gmail, useful for large projects with
# many mailing lists. We have to flatten out the structure locally.
Channel linux-xfs
Far :gmail-remote:&#34;linux-kernel/linux-xfs&#34;
Near :gmail-local:linux-kernel.linux-xfs

Channel linux-fsdevel
Far :gmail-remote:&#34;linux-kernel/linux-fsdevel&#34;
Near :gmail-local:linux-kernel.linux-fsdevel

# Get all the gmail channels together into a group.
Group googlemail
Channel korg
Channel linux-xfs
Channel linux-fsdevel
</code></pre>

<h2 id="mbsync-systemd-unit-files">mbsync systemd unit files</h2>

<p>Now, some of these mailing lists (channels in mbsync lingo) have heavy traffic, and I don&#39;t need to be fetching email off of them that often. I also have a channel dedicated solely for emails sent directly to me, those I want right away. But also... since I&#39;m starting fresh, if I ran mbsync to fetch all my email it would mean that at one point mbsync would stall for any large label I&#39;d have. I&#39;d have to wait for those big labels before getting new email for smaller labels.  For this reason, ideally I &#39;d want to actually call mbsync at different intervals depending on the mailing list / mbsync channel. Fortunately mbsync locks per target local directory, and so the only missing piece was a way to configure timers / calls for mbsync in such a way I could still journal calls / issues.</p>

<p>I ended up writing a <em>systemd</em> timer and a service unit file per mailing list. The nice thing about this, in favor over using good &#39;ol cron, is OnUnitInactiveSec=4m, for instance will call mbsync 4 minutes <em>after</em> it last finished. I also end up with a central place to collect logs:</p>

<pre><code>journalctl --user
</code></pre>

<p>Or if I want to monitor:</p>

<pre><code class="language-bash">journalctl --user -f
</code></pre>

<p>For my korg label, patches / rants sent directly to me, I want to fetch mail every minute:</p>

<pre><code class="language-bash">$ cat .config/systemd/user/mbsync-korg.timer
[Unit]
Description=mbsync query timer [0000-korg]
ConditionPathExists=%h/.mbsyncrc

[Timer]
OnBootSec=1m
OnUnitInactiveSec=1m

[Install]
WantedBy=default.target

$ cat .config/systemd/user/mbsync-korg.service
[Unit]
Description=mbsync service [korg]
Documentation=man:mbsync(1)
ConditionPathExists=%h/.mbsyncrc

[Service]
Type=oneshot
ExecStart=/usr/local/bin/mbsync 0000-korg

[Install]
WantedBy=mail.target
</code></pre>

<p>However for my linux-fsdevel... I could wait at least 30 minutes for a refresh:</p>

<pre><code>$ cat .config/systemd/user/mbsync-linux-fsdevel.timer
[Unit]
Description=mbsync query timer [linux-fsdevel]
ConditionPathExists=%h/.mbsyncrc

[Timer]
OnBootSec=5m
OnUnitInactiveSec=30m

[Install]
WantedBy=default.target
</code></pre>

<p>And the service unit:</p>

<pre><code>$ cat .config/systemd/user/mbsync-linux-fsdevel.service
[Unit]
Description=mbsync service [linux-fsdevel]
Documentation=man:mbsync(1)
ConditionPathExists=%h/.mbsyncrc

[Service]
Type=oneshot
ExecStart=/usr/local/bin/mbsync linux-fsdevel

[Install]
WantedBy=mail.target
</code></pre>

<h3 id="enabling-and-starting-systemd-user-unit-files">Enabling and starting systemd user unit files</h3>

<p>To enable these unit files I just run for each, for instance for linux-fsdevel:</p>

<pre><code class="language-bash"># The first command is now required on more recent versions
# of systemd. Only older versions of systemd
# you just need to enable the timer and start it
systemctl --user enable mbsync-linux-fsdevel.service
systemctl --user enable mbsync-linux-fsdevel.timer
systemctl --user start  mbsync-linux-fsdevel.timer
</code></pre>

<h2 id="graphing-mbsync">Graphing mbsync</h2>

<p>So... how did it do?</p>

<p>I currently have enabled 5 mbsync channels, all fetching my email in the background for me. And not a single one goes on puking with OOM. Here&#39;s what life is looking like now:</p>

<p><img src="https://lh3.googleusercontent.com/ANEO4h0p98KWqcm4VTvRFWk83r-KFx8M2CbM1ihcOvFElojSa-u9EuM2z6A_hN3pjfvql6ffVqLmyXzrNeKmUSPuVeNr-HEFpz3Jcb5eREkqY84E6eqttu1GK_Ir6rH7DsZXGwBep2e0Hqr3deqw1fSo1l680wp4PVXC8ah4Mxq2_aAdic46d5LgaGlOr53TjISijVVDvLzMrmq6Ta0i6XXSgD8cjH3Ehj8-1yq8BMjJuWrJjs_Z00oMODhRL8o5Qm6a3JRjHocxV53OqZE4IS2QAFXZxsQ3931C6zlAkGS7mxQ4As36VP4R5WwmiqhqyQtlTSvLcx0NQPKyccpx4uwdT7wbHIHg2GesoDMSRV17jej1kTDityA5oTDN4SzEvStI54WFxyxj0Tj0QpKYsigQte8v2_hxG2uPXEL4_YpdvjSR84awkJNOV2tE23A3u3hYAa0tsj1oTUXyN63bvD0QPrPJ_JcFNjkBB-qV78TfKuClddnFNSk8qh24fohlRWHVkTD_E2N6Ayh7pgdLKawarKIHOFpXL8Sg5xLGl6YvXPSC_oDF1S_h-jF6mylYPXbpl3qZ6J1F6usvSShFYIH383TVt4qzAsvP-Vyu6OOxOxzn_MfrHSfnqn199UkpYvnbavbmaRoetM8porBccFcmrEFNrJa7atgm8ODaKeRNItWl8OX92ecmjSb1yhQC0_pHTkIMxeIVCiAB58F6xrhhSw=w2024-h1723-no" alt="mbsync hourly"></p>

<p>Peachy.</p>

<h1 id="long-term-ideals">Long term ideals</h1>

<p>IMAP does the job for email, it just seems utterly <em>stupid</em> for public mailing lists and I figure we can do much better. This is specially true in light of the fact of how much simpler it is for me to follow public code Vs public email threads these days. Keep in mind how much more complicated code management is over the goal of just wanting to get a simple stupid email Message ID onto my local Maildir directory. I <em>really</em> had my hopes on <a href="https://public-inbox.org/" rel="nofollow">public-inbox</a> but after looking into it, it seems clear now that its main objectives are for archiving — not local storage / MUA use. For details refer to this <a href="https://lore.kernel.org/lkml/20190307034453.pbhmllhmsjy6kvtc@dcvr/" rel="nofollow">linux-kernel discussion on public-inbox
 with a MUA focus</a>.</p>

<p>If the issue with using public-inbox for local MUA usage was that archive was too big... it seems sensible to me to evaluate trying an even smaller epoch size, and default clients to fetch only <em>one</em> epoch, the latest one. That alone wouldn&#39;t solve the issue though. How data files are stored on Maildir makes using git almost incompatible. A proper evaluation of using mbox would be in order.</p>

<p>The social lubricant is out on the idea though, and I&#39;m in hopes a proper simple git Mail solution is bound to find us soon for public emails.</p>
]]></content:encoded>
      <author>mcgrof</author>
      <guid>https://people.kernel.org/read/a/1rua7sg4z8</guid>
      <pubDate>Tue, 18 Jun 2019 04:24:22 +0000</pubDate>
    </item>
    <item>
      <title>Linux stable tree mirror at github</title>
      <link>https://people.kernel.org/gregkh/linux-stable-tree-mirror-at-github</link>
      <description>&lt;![CDATA[As everyone seems to like to put kernel trees up on github for random projects (based on the crazy notifications I get all the time), I figured it was time to put up a &#34;semi-official&#34; mirror of all of the stable kernel releases on github.com&#xA;&#xA;It can be found at:&#xA;  https://github.com/gregkh/linux&#xA;&#xA;It differs from Linus&#39;s tree at:&#xA;  https://github.com/torvalds/linux&#xA;in that it contains all of the different stable tree branches and stable releases and tags, which many devices end up building on top of.&#xA;&#xA;So, mirror away!&#xA;&#xA;Also note, this is a read-only mirror, any pull requests created on it will be gleefully ignored, just like happens on Linus&#39;s github mirror.]]&gt;</description>
      <content:encoded><![CDATA[<p>As everyone seems to like to put kernel trees up on github for random projects (based on the crazy notifications I get all the time), I figured it was time to put up a “semi-official” mirror of all of the stable kernel releases on github.com</p>

<p>It can be found at:
  <a href="https://github.com/gregkh/linux" rel="nofollow">https://github.com/gregkh/linux</a></p>

<p>It differs from Linus&#39;s tree at:
  <a href="https://github.com/torvalds/linux" rel="nofollow">https://github.com/torvalds/linux</a>
in that it contains all of the different stable tree branches and stable releases and tags, which many devices end up building on top of.</p>

<p>So, mirror away!</p>

<p>Also note, this is a read-only mirror, any pull requests created on it will be gleefully ignored, just like happens on Linus&#39;s github mirror.</p>
]]></content:encoded>
      <author>Greg Kroah-Hartman</author>
      <guid>https://people.kernel.org/read/a/vq7e48b1on</guid>
      <pubDate>Sat, 15 Jun 2019 15:18:58 +0000</pubDate>
    </item>
    <item>
      <title>How many kinds of USB-C™ to USB-C™ cables are there?</title>
      <link>https://people.kernel.org/bleung/how-many-kinds-of-usb-c-to-usb-c-cables-are-there</link>
      <description>&lt;![CDATA[tl;dr: There are 6, it&#39;s unfortunately very confusing to the end user.&#xA;&#xA;Classic USB from the 1.1, 2.0, to 3.0 generations using USB-A and USB-B connectors have a really nice property in that cables were directional and plugs and receptacles were physically distinct to specify a different capability. A USB 3.0 capable USB-B plug was physically larger than a 2.0 plug and would not fit into a USB 2.0-only receptacle. For the end user, this meant that as long as they have a cable that would physically connect to both the host and the device, the system would function properly, as there is only ever one kind of cable that goes from one A plug to a particular flavor of B plug.&#xA;&#xA;Does the same hold for USB-C™?&#xA;&#xA;Sadly, the answer is no. Cables with a USB-C plug on both ends (C-to-C), hitherto referred to as &#34;USB-C cables&#34;, come in several varieties. Here they are, current as of the USB Type-C™ Specification 1.4 on June 2019:&#xA;&#xA;USB 2.0 rated at 3A&#xA;USB 2.0 rated at 5A&#xA;USB 3.2 Gen 1 (5gbps) rated at 3A&#xA;USB 3.2 Gen 1 (5gbps) rated at 5A&#xA;USB 3.2 Gen 2 (10gbps) rated at 3A&#xA;USB 3.2 Gen 2 (10gpbs) rated at 5A&#xA;&#xA;We have a matrix of 2 x 3, with 2 current rating levels (3A max current, or 5A max current), and 3 data speeds (480mbps, 5gbps, 10gpbs).&#xA;&#xA;Adding a bit more detail, cables 3-6, in fact, have 10 more wires that connect end-to-end compared to the USB 2.0 ones in order to handle SuperSpeed data rates. Cables 3-6 are called &#34;Full-Featured Type-C Cables&#34; in the spec, and the extra wires are actually required for more than just faster data speeds.&#xA;&#xA;&#34;Full-Featured Type-C Cables&#34; are required for the most common USB-C Alternate Mode used on PCs and many phones today, VESA DisplayPort Alternate Mode. VESA DP Alt mode requires most of the 10 extra wires present in a Full-Featured USB-C cable.&#xA;&#xA;My new Pixelbook, for example, does not have a dedicated physical DP or HDMI port and relies on VESA DP Alt Mode in order to connect to any monitor. Brand new monitors and docking stations may have a USB-C receptacle in order to allow for a DisplayPort, power and USB connection to the laptop.&#xA;&#xA;Suddenly, with a USB-C receptacle on both the host and the device (the monitor), and a range of 6 possible USB-C cables, the user may encounter a pitfall: They may try to use the USB 2.0 cable that came with their laptop with the display and the display doesn&#39;t work, despite the plugs fitting on both sides because 10 wires aren&#39;t there.&#xA;&#xA;Why did it come to this? This problem was created because the USB-C connectors were designed to replace all of the previous USB connectors at the same time as vastly increasing what the cable could do in power, data, and display dimensions. The new connector may be and virtually impossible to plug in improperly (no USB superposition problem, no grabbing the wrong end of the cable), but sacrificed for that simplicity is the ability to intuitively know whether the system you&#39;ve connected together has all of the functionality possible. The USB spec also cannot simply mandate that all USB-C cables have the maximum number of wires all the time because that would vastly increase BOM cost for cases where the cable is just used for charging primarily.&#xA;&#xA;How can we fix this? Unfortunately, it&#39;s a tough problem that has to involve user education. USB-C cables are mandated by USB-IF to bear a particular logo in order to be certified:&#xA;&#xA;Image&#xA;&#xA;Collectively, we have to teach users that if they need DisplayPort to work, they need to find cables with the two logos on the right.&#xA;&#xA;Technically, there is something that software can do to help the education problem. Cables 2-6 are required by the USB specification to include an electronic marker chip which contains vital information about the cable. The host should be able to read that eMarker, and identify what its data and power capabilities are. If the host sees that the user is attempting to use DisplayPort Alternate Mode with the wrong cable, rather than a silent failure (ie, the external display doesn&#39;t light up), the OS should tell the user via a notification they may be using the wrong cable, and educate the user about cables with the right logo.&#xA;&#xA;This is something that my team is actively working on, and I hope to be able to show the kernel pieces necessary soon.]]&gt;</description>
      <content:encoded><![CDATA[<p><em>tl;dr</em>: There are 6, it&#39;s unfortunately very confusing to the end user.</p>

<p>Classic USB from the 1.1, 2.0, to 3.0 generations using USB-A and USB-B connectors have a really nice property in that cables were directional and plugs and receptacles were physically distinct to specify a different capability. A USB 3.0 capable USB-B plug was physically larger than a 2.0 plug and would not fit into a USB 2.0-only receptacle. For the end user, this meant that as long as they have a cable that would physically connect to both the host and the device, the system would function properly, as there is only ever one kind of cable that goes from one A plug to a particular flavor of B plug.</p>

<p>Does the same hold for USB-C™?</p>

<p>Sadly, the answer is no. Cables with a USB-C plug on both ends (C-to-C), hitherto referred to as “USB-C cables”, come in several varieties. Here they are, current as of the USB Type-C™ Specification 1.4 on June 2019:</p>
<ol><li>USB 2.0 rated at 3A</li>
<li>USB 2.0 rated at 5A</li>
<li>USB 3.2 Gen 1 (5gbps) rated at 3A</li>
<li>USB 3.2 Gen 1 (5gbps) rated at 5A</li>
<li>USB 3.2 Gen 2 (10gbps) rated at 3A</li>
<li>USB 3.2 Gen 2 (10gpbs) rated at 5A</li></ol>

<p>We have a matrix of 2 x 3, with 2 current rating levels (3A max current, or 5A max current), and 3 data speeds (480mbps, 5gbps, 10gpbs).</p>

<p>Adding a bit more detail, cables 3-6, in fact, have 10 more wires that connect end-to-end compared to the USB 2.0 ones in order to handle SuperSpeed data rates. Cables 3-6 are called “Full-Featured Type-C Cables” in the spec, and the extra wires are actually required for more than just faster data speeds.</p>

<p>“Full-Featured Type-C Cables” are required for the most common USB-C <em>Alternate Mode</em> used on PCs and many phones today, <em>VESA DisplayPort Alternate Mode</em>. VESA DP Alt mode requires most of the 10 extra wires present in a Full-Featured USB-C cable.</p>

<p>My new Pixelbook, for example, does not have a dedicated physical DP or HDMI port and relies on VESA DP Alt Mode in order to connect to any monitor. Brand new monitors and docking stations may have a USB-C receptacle in order to allow for a DisplayPort, power and USB connection to the laptop.</p>

<p>Suddenly, with a USB-C receptacle on both the host and the device (the monitor), and a range of 6 possible USB-C cables, the user may encounter a pitfall: They may try to use the USB 2.0 cable that came with their laptop with the display and the display doesn&#39;t work, despite the plugs fitting on both sides because 10 wires aren&#39;t there.</p>

<p>Why did it come to this? This problem was created because the USB-C connectors were designed to replace all of the previous USB connectors at the same time as vastly increasing what the cable could do in power, data, and display dimensions. The new connector may be and virtually impossible to plug in improperly (no USB superposition problem, no grabbing the wrong end of the cable), but sacrificed for that simplicity is the ability to intuitively know whether the system you&#39;ve connected together has all of the functionality possible. The USB spec also cannot simply mandate that all USB-C cables have the maximum number of wires all the time because that would vastly increase BOM cost for cases where the cable is just used for charging primarily.</p>

<p>How can we fix this? Unfortunately, it&#39;s a tough problem that has to involve user education. USB-C cables are mandated by USB-IF to bear a particular logo in order to be certified:</p>

<p><img src="https://www.applegazette.com/wp-content/uploads/2018/05/3644.USB-C-image-2.png" alt="Image"></p>

<p>Collectively, we have to teach users that if they need DisplayPort to work, they need to find cables with the two logos on the right.</p>

<p>Technically, there is something that software can do to help the education problem. Cables 2-6 are required by the USB specification to include an electronic marker chip which contains vital information about the cable. The host should be able to read that eMarker, and identify what its data and power capabilities are. If the host sees that the user is attempting to use DisplayPort Alternate Mode with the wrong cable, rather than a silent failure (ie, the external display doesn&#39;t light up), the OS should tell the user via a notification they may be using the wrong cable, and educate the user about cables with the right logo.</p>

<p>This is something that my team is actively working on, and I hope to be able to show the kernel pieces necessary soon.</p>
]]></content:encoded>
      <author>Benson Leung</author>
      <guid>https://people.kernel.org/read/a/ngzg4ilp2h</guid>
      <pubDate>Fri, 14 Jun 2019 22:47:12 +0000</pubDate>
    </item>
    <item>
      <title>My experiences upgrading to Fedora 30</title>
      <link>https://people.kernel.org/mchehab/my-experiences-upgrading-to-fedora-30</link>
      <description>&lt;![CDATA[Having a certain number of machines here with Fedora, I started working on April, 30 with the migration of those to use Fedora’s latest version: Fedora 30.&#xA;&#xA;  Note: this is a re-post of a blog entry I wrote back on May, 1st: https://linuxkernel.home.blog/2019/05/01/fedora-30-installation/ with one update at the end made on Jun, 26.&#xA;&#xA;First machine: a multi-monitor desktop&#xA;&#xA;I started the migration on a machine with multiple monitors connected on it. Originally, when Fedora was installed on it, the GPU Kernel driver for the chipset (called DRM KMS – Kernel ModeSet) was not available yet at Fedora’s Kernel. So, Fedora installer (Anaconda) added a nomodeset option to the Kernel parameters.&#xA;&#xA;As there was KMS support was just arriving upstream, I built my own Kernel on that time and removed the nomodeset option.&#xA;&#xA;By the time I did the upgrade, maybe except for the rescue mode, all Kernels were using KMS.&#xA;&#xA;I did the upgrade the same way I did in the past (as described here), e. g. by calling:&#xA;dnf system-upgrade --release 30 --allowerasing download&#xA;dnf system-upgrade reboot&#xA;The system-upgrade had to remove pgp-tools, with currently has a broken dependency, and eclipse. The last one was due to the fact that, on Fedora 29, I was with modular support enabled, with made it depend on a Java modular set of packages.&#xA;&#xA;After booting the Kernel, I had the first problem with the upgrade: Fedora now uses BootLoaderSpec - BLS by default, converting the old grub.cfg file to the new BLS mode. Well, the conversion simply re-added the nomodeset option to all Kernels, causing it to disable the extra monitors, as X11/Wayland would need to setup the video mode via the old way. On that time, I wasn’t aware of BLS, so I just ran this command:&#xA;cd /boot/efi/EFI/fedora/ &amp;&amp; cp grub.cfg.rpmsave grub.cfg&#xA;In order to restore the working grub.cfg file.&#xA;&#xA;Later, in order to avoid further problems on Kernel upgrades, I installed grubby-deprecated, as recommended at https://fedoraproject.org/wiki/Changes/BootLoaderSpecByDefault#Upgrade.2Fcompatibilityimpact, and manually edited /etc/default/grub in order to comment out the line with GRUBENABLE_BLSCFG. I probably could just fix the BLS setup instead, but I opted to be conservative here.&#xA;&#xA;After that, I worked to re-install eclipse. For that, I had to disable modular support, as eclipse depends on an ant package version that was not there yet inside Fedora modular repositories by the time I did the upgrade.&#xA;&#xA;In summary, my first install didn’t went smoothly.&#xA;&#xA;Second machine: a laptop&#xA;&#xA;At the second machine, I ran the same dnf system-upgrade commands as did at the first machine. As this laptop had a Fedora 29 installed last month from scratch, I was expecting a better luck.&#xA;&#xA;Guess what…&#xA;&#xA;… it ended to be an even worse upgrade… machine crashed after boot!&#xA;&#xA;Basically, systemd doesn’t want to mount a rootfs image if it doesn’t contain a valid file at /usr/lib/os-release. On Fedora 29, this is a soft link to another file inside /usr/lib/os.release.d. The specific file name depends if you installed Fedora Workstation, Fedora Server, …&#xA;&#xA;During the upgrade, the directory /usr/lib/os.release.d got removed, causing the soft link to point to nowhere. Due to that, after boot, systemd crashes the machine with a “brilliant” message, saying that it was generating a rdsosreport.txt, crowded of information that one would need to copy to some place else in order to analyze. Well, as it didn’t mount the rootfs, copying it would be tricky, without network nor the usual commands found at /bin and /sbin directories.&#xA;&#xA;So, instead, I just looked at the journal file, where it said that the failure was at /lib/systemd/system/initrd-switch-root.service. That basically calls systemctl, asking it to switch the rootfs to /sysroot (with is the root filesystem as listed at /etc/fstab). Well, systemctl checks if it recognizes os-release. If not, instead of mounting it, producing a warning and hoping for the best, it simply crashes the system!&#xA;&#xA;In order to fix it, I had to use vi to manually create a Fedora 30 release. Thankfully, I had already a valid os-release from my first upgraded machine. So, I just manually typed it.&#xA;&#xA;After that, the system booted smoothly.&#xA;&#xA;Other machines&#xA;&#xA;Knowing that Fedora 30 install was not trivial, I decided to go one step back, learning from my past mistakes.&#xA;&#xA;So, I decided to write a small “script” with the steps to be done for the upgrade. Instead of running it as a script, you may instead run it line by line (after the set -e line). Here it is:&#xA;/bin/bash&#xA;&#xA;should run as root&#xA;&#xA;If one runs it as a script, makes it abort on errors&#xA;set -e&#xA;&#xA;dnf config-manager --set-disabled fedora-modular&#xA;dnf config-manager --set-disabled updates-modular&#xA;dnf config-manager --set-disabled updates-testing-modular&#xA;dnf distro-sync&#xA;dnf upgrade --refresh&#xA;(cd /usr/lib/ &amp;&amp; cp $(readlink -f os-release) /tmp/os-release &amp;&amp; rm os-release &amp;&amp; cp /tmp/os-release os-release)&#xA;dnf system-upgrade --release 30 --allowerasing download&#xA;dnf system-upgrade reboot&#xA;Please notice that the scripts will removes os-release and copies the one from the linked file. Please check if it went well, as if the logic fails, you may end crashing your machine at the next boot.&#xA;&#xA;Also, please notice that it will disable Fedora modular support. Well, I don’t need anything there, so it works pretty fine for me.&#xA;&#xA;Post-install steps&#xA;&#xA;Please notice that, after an upgrade, Fedora may re-enable Fedora modular. That happened to me on one machine with had Fedora 26. If you don&#39;t want to keep it enabled, you should do:&#xA;dnf config-manager --set-disabled fedora-modular&#xA;dnf config-manager --set-disabled updates-modular&#xA;dnf config-manager --set-disabled updates-testing-modular&#xA;dnf distro-sync&#xA;&#xA;Results&#xA;&#xA;I repeated the same procedure on several other machines, one being a Fedora Server, using the above scripts. On all, it went smoothly.&#xA;]]&gt;</description>
      <content:encoded><![CDATA[<p>Having a certain number of machines here with Fedora, I started working on April, 30 with the migration of those to use Fedora’s latest version: Fedora 30.</p>

<blockquote><p>Note: this is a re-post of a blog entry I wrote back on May, 1st: <a href="https://linuxkernel.home.blog/2019/05/01/fedora-30-installation/" rel="nofollow">https://linuxkernel.home.blog/2019/05/01/fedora-30-installation/</a> with one update at the end made on Jun, 26.</p></blockquote>

<h2 id="first-machine-a-multi-monitor-desktop">First machine: a multi-monitor desktop</h2>

<p>I started the migration on a machine with multiple monitors connected on it. Originally, when Fedora was installed on it, the GPU Kernel driver for the chipset (called DRM KMS – Kernel ModeSet) was not available yet at Fedora’s Kernel. So, Fedora installer (Anaconda) added a <code>nomodeset</code> option to the Kernel parameters.</p>

<p>As there was KMS support was just arriving upstream, I built my own Kernel on that time and removed the <code>nomodeset</code> option.</p>

<p>By the time I did the upgrade, maybe except for the rescue mode, all Kernels were using KMS.</p>

<p>I did the upgrade the same way I did in the past (as described <a href="https://fedoramagazine.org/upgrading-fedora-29-to-fedora-30/" rel="nofollow">here</a>), e. g. by calling:</p>

<pre><code>dnf system-upgrade --release 30 --allowerasing download
dnf system-upgrade reboot
</code></pre>

<p>The system-upgrade had to remove <code>pgp-tools</code>, with currently has a broken dependency, and <code>eclipse</code>. The last one was due to the fact that, on Fedora 29, I was with modular support enabled, with made it depend on a Java modular set of packages.</p>

<p>After booting the Kernel, I had the first problem with the upgrade: Fedora now uses <a href="https://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/" rel="nofollow">BootLoaderSpec – BLS</a> by default, converting the old <code>grub.cfg</code> file to the new BLS mode. Well, the conversion simply re-added the <code>nomodeset</code> option to all Kernels, causing it to disable the extra monitors, as X11/Wayland would need to setup the video mode via the old way. On that time, I wasn’t aware of BLS, so I just ran this command:</p>

<pre><code>cd /boot/efi/EFI/fedora/ &amp;&amp; cp grub.cfg.rpmsave grub.cfg
</code></pre>

<p>In order to restore the working <code>grub.cfg</code> file.</p>

<p>Later, in order to avoid further problems on Kernel upgrades, I installed <code>grubby-deprecated</code>, as recommended at <a href="https://fedoraproject.org/wiki/Changes/BootLoaderSpecByDefault#Upgrade.2Fcompatibility_impact" rel="nofollow">https://fedoraproject.org/wiki/Changes/BootLoaderSpecByDefault#Upgrade.2Fcompatibility_impact</a>, and manually edited <code>/etc/default/grub</code> in order to comment out the line with <code>GRUB_ENABLE_BLSCFG</code>. I probably could just fix the BLS setup instead, but I opted to be conservative here.</p>

<p>After that, I worked to re-install <code>eclipse</code>. For that, I had to disable modular support, as eclipse depends on an <code>ant</code> package version that was not there yet inside Fedora modular repositories by the time I did the upgrade.</p>

<p>In summary, my first install didn’t went smoothly.</p>

<h2 id="second-machine-a-laptop">Second machine: a laptop</h2>

<p>At the second machine, I ran the same <code>dnf system-upgrade</code> commands as did at the first machine. As this laptop had a Fedora 29 installed last month from scratch, I was expecting a better luck.</p>

<p>Guess what…</p>

<p>… it ended to be an even worse upgrade… machine crashed after boot!</p>

<p>Basically, systemd doesn’t want to mount a <code>rootfs</code> image if it doesn’t contain a valid file at <code>/usr/lib/os-release</code>. On Fedora 29, this is a soft link to another file inside <code>/usr/lib/os.release.d</code>. The specific file name depends if you installed Fedora Workstation, Fedora Server, …</p>

<p>During the upgrade, the directory <code>/usr/lib/os.release.d</code> got removed, causing the soft link to point to nowhere. Due to that, after boot, systemd crashes the machine with a “brilliant” message, saying that it was generating a <code>rdsosreport.txt</code>, crowded of information that one would need to copy to some place else in order to analyze. Well, as it didn’t mount the rootfs, copying it would be tricky, without network nor the usual commands found at <code>/bin</code> and <code>/sbin</code> directories.</p>

<p>So, instead, I just looked at the journal file, where it said that the failure was at <code>/lib/systemd/system/initrd-switch-root.service</code>. That basically calls <code>systemctl</code>, asking it to switch the <code>rootfs</code> to <code>/sysroot</code> (with is the root filesystem as listed at <code>/etc/fstab</code>). Well, <code>systemctl</code> checks if it recognizes <code>os-release</code>. If not, instead of mounting it, producing a warning and hoping for the best, it simply crashes the system!</p>

<p>In order to fix it, I had to use vi to manually create a Fedora 30 release. Thankfully, I had already a valid <code>os-release</code> from my first upgraded machine. So, I just manually typed it.</p>

<p>After that, the system booted smoothly.</p>

<h2 id="other-machines">Other machines</h2>

<p>Knowing that Fedora 30 install was not trivial, I decided to go one step back, learning from my past mistakes.</p>

<p>So, I decided to write a small “script” with the steps to be done for the upgrade. Instead of running it as a script, you may instead run it line by line (after the <code>set -e</code> line). Here it is:</p>

<pre><code>#/bin/bash

#should run as root

# If one runs it as a script, makes it abort on errors
set -e

dnf config-manager --set-disabled fedora-modular
dnf config-manager --set-disabled updates-modular
dnf config-manager --set-disabled updates-testing-modular
dnf distro-sync
dnf upgrade --refresh
(cd /usr/lib/ &amp;&amp; cp $(readlink -f os-release) /tmp/os-release &amp;&amp; rm os-release &amp;&amp; cp /tmp/os-release os-release)
dnf system-upgrade --release 30 --allowerasing download
dnf system-upgrade reboot
</code></pre>

<p>Please notice that the scripts will removes os-release and copies the one from the linked file. Please check if it went well, as if the logic fails, you may end crashing your machine at the next boot.</p>

<p>Also, please notice that it will disable Fedora modular support. Well, I don’t need anything there, so it works pretty fine for me.</p>

<h2 id="post-install-steps">Post-install steps</h2>

<p>Please notice that, after an upgrade, Fedora may re-enable Fedora modular. That happened to me on one machine with had Fedora 26. If you don&#39;t want to keep it enabled, you should do:</p>

<pre><code>dnf config-manager --set-disabled fedora-modular
dnf config-manager --set-disabled updates-modular
dnf config-manager --set-disabled updates-testing-modular
dnf distro-sync
</code></pre>

<h2 id="results">Results</h2>

<p>I repeated the same procedure on several other machines, one being a Fedora Server, using the above scripts. On all, it went smoothly.</p>
]]></content:encoded>
      <author>Mauro Carvalho Chehab</author>
      <guid>https://people.kernel.org/read/a/5qcxei3xoq</guid>
      <pubDate>Wed, 01 May 2019 09:52:15 +0000</pubDate>
    </item>
  </channel>
</rss>