people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from Valentin Schneider

https://lore.kernel.org/all/20230612093537.614161713@infradead.org/T/

Intro

As I was catching up with the scheduler's “change pattern” sched_change patches, I figured it was time I got up to speed with the guard zoology.

The first part of this post is a code exploration with some of my own musings, the second part is a TL;DR with what each helper does and when you should use them (*)

(*) according to my own understanding, provided as-is without warranties of any kind, batteries not included.

It's __cleanup__ all the way down

The docstring for __cleanup kindly points us to the relevant gcc/clang documentation. Given I don't really speak clang, here's the relevant GCC bit:

https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute

cleanup (cleanup_function)
    The cleanup attribute runs a function when the variable goes out of
    scope. This attribute can only be applied to auto function scope variables;
    it may not be applied to parameters or variables with static storage
    duration. The function must take one parameter, a pointer to a type
    compatible with the variable. The return value of the function (if any) is
    ignored.

    When multiple variables in the same scope have cleanup attributes, at exit
    from the scope their associated cleanup functions are run in reverse order
    of definition (last defined, first cleanup).

So we get to write a function that takes a pointer to the variable and does cleanup for it whenever the variable goes out of scope. Neat.

DEFINE_FREE

That's the first one we'll meet in include/linux/cleanup.h and the most straightforward.

 * DEFINE_FREE(name, type, free):
 *	simple helper macro that defines the required wrapper for a __free()
 *	based cleanup function. @free is an expression using '_T' to access the
 *	variable. @free should typically include a NULL test before calling a
 *	function, see the example below.

Long story short, that's a __cleanup variable definition with some extra sprinkles on top:

#define __cleanup(func)			__attribute__((__cleanup__(func)))
#define __free(_name)	__cleanup(__free_##_name)

#define DEFINE_FREE(_name, _type, _free) \
	static __always_inline void __free_##_name(void *p) { _type _T = *(_type *)p; _free; }

So we can e.g. define a kfree() cleanup type and stick that onto any kmalloc()'d variable to get automagic cleanup without any goto's. Some languages call that a smart pointer.

DEFINE_FREE(kfree, void *, if (_T) kfree(_T))

void *alloc_obj(...)
{
     struct obj *p __free(kfree) = kmalloc(...);
     if (!p)
	return NULL;

     if (!init_obj(p))
	return NULL;

     return_ptr(p); // This does a pointer shuffle to prevent the kfree() from happening
}

I won't get into the return_ptr() faff, but if you have a look at it and wonder what's going on, it's mostly going to be because of having to do the shuffle with no double evaluation. This is relevant: https://lore.kernel.org/lkml/CAHk-=wiOXePAqytCk6JuiP6MeePL6ksDYptE54hmztiGLYihjA@mail.gmail.com/

DEFINE_CLASS

This one is pretty much going to be DEFINE_FREE() but with an added quality of life feature in the form of a constructor:

 * DEFINE_CLASS(name, type, exit, init, init_args...):
 *	helper to define the destructor and constructor for a type.
 *	@exit is an expression using '_T' -- similar to FREE above.
 *	@init is an expression in @init_args resulting in @type
#define DEFINE_CLASS(_name, _type, _exit, _init, _init_args...)		\
typedef _type class_##_name##_t;					\
static __always_inline void class_##_name##_destructor(_type *p)	\
{ _type _T = *p; _exit; }						\
static __always_inline _type class_##_name##_constructor(_init_args)	\
{ _type t = _init; return t; }

#define CLASS(_name, var)						\
	class_##_name##_t var __cleanup(class_##_name##_destructor) =	\
		class_##_name##_constructor

You'll note that yes, it can be expressed purely as a DEFINE_FREE(), but it saves us from a bit of repetition, and will enable us to craft stuff involving locks later on:

DEFINE_CLASS(fdget, struct fd, fdput(_T), fdget(fd), int fd)
void foo(void)
{
	fd = ...;
	CLASS(fdget, f)(fd);
	if (fd_empty(f))
		return -EBADF;

	// use 'f' without concern
}

DEFINE_FREE(fdput, struct fd *, if (_T) fdput(_T))
void foo(void)
{
	fd = ...;
	struct fd *f __free(fdput) = fdget(fd);
	if (fd_empty(f))
		return -EBADF;

	// use 'f' without concern

Futex_hash_bucket case

For a more complete example:

DEFINE_CLASS(hb, struct futex_hash_bucket *,
	     if (_T) futex_hash_put(_T),
	     futex_hash(key), union futex_key *key);

int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
	union futex_key key = FUTEX_KEY_INIT;
	DEFINE_WAKE_Q(wake_q);
	int ret;

	ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
	if (unlikely(ret != 0))
		return ret;

	CLASS(hb, hb)(&key);

	/* Make sure we really have tasks to wakeup */
	if (!futex_hb_waiters_pending(hb))
		return ret;

	spin_lock(&hb->lock);

	/* ... */
}

Using gcc -E to stop compilation after the preprocessor has expanded all of our fancy macros (*), the resulting code is fairly readable modulo the typedef:

typedef struct futex_hash_bucket * class_hb_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_hb_destructor(struct futex_hash_bucket * *p)
{
	struct futex_hash_bucket * _T = *p;
	if (_T)
		futex_hash_put(_T);
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
struct futex_hash_bucket *
class_hb_constructor(union futex_key *key)
{
	struct futex_hash_bucket * t = futex_hash(key);
	return t;
}
int futex_wake(u32 *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
	struct futex_q *this, *next;
	union futex_key key = (union futex_key) { .both = { .ptr = 0ULL } };
	struct wake_q_head wake_q = { ((struct wake_q_node *) 0x01), &wake_q.first };
	int ret;

	if (!bitset)
		return -22;

	ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
	if (__builtin_expect(!!(ret != 0), 0))
		return ret;

	if ((flags & 0x0100) && !nr_wake)
		return 0;

	class_hb_t hb __attribute__((__cleanup__(class_hb_destructor))) = class_hb_constructor(&key);


	if (!futex_hb_waiters_pending(hb))
		return ret;

	spin_lock(&hb->lock);

	/* ... */
}

(*) I use make V=1 on the file I want to expand, copy the big command producing the .o, ditch the -Wp,-MMD,**.o.d part and add a -E to it.

DEFINE_GUARD

For now, ignore the CONDITIONAL and LOCK_PTR stuff, this is only relevant to the scoped & conditional guards which we'll get to later.

#define DEFINE_CLASS_IS_GUARD(_name) \
	__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \
	__DEFINE_GUARD_LOCK_PTR(_name, _T)

#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
	DEFINE_CLASS(_name, _type, if (!__GUARD_IS_ERR(_T)) { _unlock; }, ({ _lock; _T; }), _type _T); \
	DEFINE_CLASS_IS_GUARD(_name)

#define guard(_name) \
	CLASS(_name, __UNIQUE_ID(guard))

So it's a CLASS with a constructor and destructor, but the added bonus is the automagic __cleanup variable definition.

Why is that relevant? Well, consider locks. You don't declare a variable for a lock acquisition & release, you manipulate an already-allocated object (e.g. a mutex). However, no variable declaration means no __cleanup. So this just declares a variable to slap __cleanup onto it and have an automagic out-of-scope cleanup callback.

Let's have a look at an example in the thermal subsystem with a mutex critical section:

DEFINE_GUARD(cooling_dev, struct thermal_cooling_device *, mutex_lock(&_T->lock),
	     mutex_unlock(&_T->lock))

static ssize_t
cur_state_store(struct device *dev, struct device_attribute *attr,
		const char *buf, size_t count)
{
	struct thermal_cooling_device *cdev = to_cooling_device(dev);
	unsigned long state;
	int result;
	/* ... */
	if (state > cdev->max_state)
		return -EINVAL;

	guard(cooling_dev)(cdev);

	result = cdev->ops->set_cur_state(cdev, state);
	if (result)
		return result;

The preprocessor output looks like so:

typedef struct thermal_cooling_device * class_cooling_dev_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) void
class_cooling_dev_destructor(struct thermal_cooling_device * *p)
{
	struct thermal_cooling_device * _T = *p;
	if (!({
				unsigned long _rc = (unsigned long)(_T);
				__builtin_expect(!!((_rc - 1) >= -4095 - 1), 0);
			})) {
		mutex_unlock(&_T->lock);
	};
}
static inline __attribute__((__gnu_inline__))
__attribute__((__unused__)) __attribute__((no_instrument_function))
__attribute__((__always_inline__)) struct thermal_cooling_device *
class_cooling_dev_constructor(struct thermal_cooling_device * _T)
{
	struct thermal_cooling_device * t =
		({ mutex_lock(&_T->lock); _T; });
	return t;
}

static __attribute__((__unused__)) const bool class_cooling_dev_is_conditional = false;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) void *
class_cooling_dev_lock_ptr(class_cooling_dev_t *_T)
{
	void *_ptr = (void *)(unsigned long)*(_T);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__)) int
class_cooling_dev_lock_err(class_cooling_dev_t *_T)
{
	long _rc = (unsigned long)*(_T);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) >= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}
static ssize_t
cur_state_store(struct device *dev, struct device_attribute *attr,
		const char *buf, size_t count)
{
	struct thermal_cooling_device *cdev = ({ void *__mptr = (void *)(dev); _Static_assert(__builtin_types_compatible_p(typeof(*(dev)), typeof(((struct thermal_cooling_device *)0)->device)) || __builtin_types_compatible_p(typeof(*(dev)), typeof(void)), "pointer type mismatch in container_of()"); ((struct thermal_cooling_device *)(__mptr - __builtin_offsetof(struct thermal_cooling_device, device))); });
	unsigned long state;
	int result;

	if (sscanf(buf, "%ld\n", &state) != 1)
		return -22;

	if ((long)state < 0)
		return -22;

	if (state > cdev->max_state)
		return -22;

	class_cooling_dev_t __UNIQUE_ID_guard_435 __attribute__((__cleanup__(class_cooling_dev_destructor))) = class_cooling_dev_constructor(cdev);

	result = cdev->ops->set_cur_state(cdev, state);
	if (result)
		return result;

	thermal_cooling_device_stats_update(cdev, state);

	return count;
}

DEFINE_LOCK_GUARD

Okay, we have sort-of-smart pointers, classes, guards for locks, what's next? Well, certain locks need more than just a pointer for the lock & unlock operations. For instance, the scheduler's runqueue locks need both a struct rq pointer and a struct rq_flags pointer.

So LOCK_GUARD's are going to be enhanced GUARD's manipulating a composite type instead of a single pointer:

#define __DEFINE_UNLOCK_GUARD(_name, _type, _unlock, ...)		\
typedef struct {							\
	_type *lock;							\
	__VA_ARGS__;							\
} class_##_name##_t;							\

Note that there is also the “no pointer” special case, which is when there is no accessible type for the manipulated lock – think preempt_disable(), migrate_disable(), rcu_read_lock(); Just like for GUARD, we still declare a variable to slap __cleanup onto it.

Let's look at the RCU case:

DEFINE_LOCK_GUARD_0(rcu,
	do {
		rcu_read_lock();
		/*
		 * sparse doesn't call the cleanup function,
		 * so just release immediately and don't track
		 * the context. We don't need to anyway, since
		 * the whole point of the guard is to not need
		 * the explicit unlock.
		 */
		__release(RCU);
	} while (0),
	rcu_read_unlock())
void wake_up_if_idle(int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	guard(rcu)();
	if (is_idle_task(rcu_dereference(rq->curr))) {
		// ....
	}
}
static __attribute__((__unused__)) const bool class_rcu_is_conditional = false;
typedef struct {
	void *lock;
} class_rcu_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_rcu_destructor(class_rcu_t *_T)
{
	if (!({
				unsigned long _rc = ( unsigned long)(_T->lock);
				__builtin_expect(!!((_rc - 1) >= -4095 - 1), 0);
			})) {
		rcu_read_unlock();
	}
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void *
class_rcu_lock_ptr(class_rcu_t *_T)
{
	void *_ptr = (void *)( unsigned long)*(&_T->lock);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
int
class_rcu_lock_err(class_rcu_t *_T)
{
	long _rc = ( unsigned long)*(&_T->lock);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) >= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
class_rcu_t
class_rcu_constructor(void)
{
	class_rcu_t _t = { .lock = (void*)1 }, *_T __attribute__((__unused__)) = &_t;
	do {
		rcu_read_lock();
		(void)0; // __release(RCU); just for sparse, see comment in definition
	} while (0);
	return _t;
}
void wake_up_if_idle(int cpu)
{
	struct rq *rq = (&(*({ do { const void __seg_gs *__vpp_verify = (typeof((&(runqueues)) + 0))((void *)0); (void)__vpp_verify; } while (0); ({ unsigned long __ptr; __asm__ ("" : "=r"(__ptr) : "0"((__typeof_unqual__(*((&(runqueues)))) *)(( unsigned long)((&(runqueues)))))); (typeof((__typeof_unqual__(*((&(runqueues)))) *)(( unsigned long)((&(runqueues)))))) (__ptr + (((__per_cpu_offset[((cpu))])))); }); })));
	class_rcu_t __UNIQUE_ID_guard_1486 __attribute__((__cleanup__(class_rcu_destructor))) =
		class_rcu_constructor();

	if (is_idle_task(...) {
		// ...
	}
}

Let's look at the runqueue lock:

DEFINE_LOCK_GUARD_1(rq_lock_irqsave, struct rq,
		    rq_lock_irqsave(_T->lock, &_T->rf),
		    rq_unlock_irqrestore(_T->lock, &_T->rf),
		    struct rq_flags rf)
static void sched_balance_update_blocked_averages(int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	guard(rq_lock_irqsave)(rq);
	update_rq_clock(rq);
	__sched_balance_update_blocked_averages(rq);
}
static __attribute__((__unused__)) const bool class_rq_lock_irqsave_is_conditional = false;

typedef struct {
	struct rq *lock;
	struct rq_flags rf;
} class_rq_lock_irqsave_t;

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void
class_rq_lock_irqsave_destructor(class_rq_lock_irqsave_t *_T)
{
	if (!({
				unsigned long _rc = ( unsigned long)(_T->lock);
				__builtin_expect(!!((_rc - 1) >= -4095 - 1), 0);
			})) {
		rq_unlock_irqrestore(_T->lock, &_T->rf);
	}
}
static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
void *
class_rq_lock_irqsave_lock_ptr(class_rq_lock_irqsave_t *_T)
{
	void *_ptr = (void *)( unsigned long)*(&_T->lock);
	if (IS_ERR(_ptr)) {
		_ptr = ((void *)0);
	}
	return _ptr;
}
static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
int
class_rq_lock_irqsave_lock_err(class_rq_lock_irqsave_t *_T)
{
	long _rc = ( unsigned long)*(&_T->lock);
	if (!_rc) {
		_rc = -16;
	}
	if (!__builtin_expect(!!((unsigned long)(void *)(_rc) >= (unsigned long)-4095), 0)) {
		_rc = 0;
	}
	return _rc;
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) __attribute__((__always_inline__))
class_rq_lock_irqsave_t
class_rq_lock_irqsave_constructor(struct rq *l)
{
	class_rq_lock_irqsave_t _t = { .lock = l }, *_T = &_t;
	rq_lock_irqsave(_T->lock, &_T->rf);
	return _t;
}
static void sched_balance_update_blocked_averages(int cpu)
{
 struct rq *rq = (&(*({ do { const void __seg_gs *__vpp_verify = (typeof((&(runqueues)) + 0))((void *)0); (void)__vpp_verify; } while (0); ({ unsigned long __ptr; __asm__ ("" : "=r"(__ptr) : "0"((__typeof_unqual__(*((&(runqueues)))) *)(( unsigned long)((&(runqueues)))))); (typeof((__typeof_unqual__(*((&(runqueues)))) *)(( unsigned long)((&(runqueues)))))) (__ptr + (((__per_cpu_offset[((cpu))])))); }); })));

 class_rq_lock_irqsave_t __UNIQUE_ID_guard_1377
	 __attribute__((__cleanup__(class_rq_lock_irqsave_destructor))) =
	 class_rq_lock_irqsave_constructor(rq);

 update_rq_clock(rq);
 __sched_balance_update_blocked_averages(rq);
}

SCOPES

Scope creation is slightly different for classes and guards, but follow the same principle.

Class

#define __scoped_class(_name, var, _label, args...)        \
	for (CLASS(_name, var)(args); ; ({ goto _label; })) \
		if (0) {                                   \
_label:                                                    \
			break;                             \
		} else

#define scoped_class(_name, var, args...) \
	__scoped_class(_name, var, __UNIQUE_ID(label), args)

That for+if+goto trinity looks a bit unholy at first, but let's look at what the requirements are for a macro that lets us create a new scope: – create a new scope – declare the __cleanup variable in that new scope – make the macro usable either with a single statement, or with curly braces

A for loop gives us the declaration and the scope. However that for loop needs to run once, and it'd be a shame to have to declare a loop counter. The “run exactly once” mechanism is thus encoded in the form of the if+goto.

Consider:

	for (CLASS(_name, var)(args); ; ({ goto _label; }))
		if (0) {
_label:
			break;
		} else {
		   stmt;
		}

The execution order will be:

CLASS(_name, var)(args);
stmt;
goto _label;
break;

We thus save ourselves the need for an extra variable at the cost of mild code reader confusion, a common trick used in the kernel.

Guards

For guard scopes, we find the same for+if+goto construct but with some added checks. For regular (unconditional) guards, this is pretty much the same as for CLASS'es:

/*
 * Helper macro for scoped_guard().
 *
 * Note that the "!__is_cond_ptr(_name)" part of the condition ensures that
 * compiler would be sure that for the unconditional locks the body of the
 * loop (caller-provided code glued to the else clause) could not be skipped.
 * It is needed because the other part - "__guard_ptr(_name)(&scope)" - is too
 * hard to deduce (even if could be proven true for unconditional locks).
 */
#define __scoped_guard(_name, _label, args...)				\
	for (CLASS(_name, scope)(args);					\
	     __guard_ptr(_name)(&scope) || !__is_cond_ptr(_name);	\
	     ({ goto _label; }))					\
		if (0) {						\
_label:									\
			break;						\
		} else

#define scoped_guard(_name, args...)	\
	__scoped_guard(_name, __UNIQUE_ID(label), args)

For conditional guards, we mainly factor in the fact that the constructor can “fail”. This is relevant for e.g. trylocks where the lock acquisition isn't guaranteed to succeed.

#define __scoped_cond_guard(_name, _fail, _label, args...)		\
	for (CLASS(_name, scope)(args); true; ({ goto _label; }))	\
		if (!__guard_ptr(_name)(&scope)) {			\
			BUILD_BUG_ON(!__is_cond_ptr(_name));		\
			_fail;						\
_label:									\
			break;						\
		} else

#define scoped_cond_guard(_name, _fail, args...)	\
	__scoped_cond_guard(_name, _fail, __UNIQUE_ID(label), args)

So in the end, that __DEFINE_CLASS_IS_CONDITIONAL() faff is there: – To help optimize unconditional guard scopes – To ensure conditional guard scopes are used correctly (i.e. the lock acquisition failure is expected)

Debuggability

You'll note that while guards delete an entire class of error associated with goto's, they shuffle the code around.

From my experimentation, if you put the constructor and the destructor on a separate line in the CLASS/GUARD definition, you'll at least be able to tell them apart during a splat:

DEFINE_LOCK_GUARD_1(raw_spinlock_irqsave_bug, raw_spinlock_t,
		    raw_spin_lock_irqsave(_T->lock, _T->flags),
spinlock.h:571:	    raw_spin_unlock_irqrestore_bug(_T->lock, _T->flags),
		    unsigned long flags)

int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	        ...
core.c:4108	scoped_guard (raw_spinlock_irqsave_bug, &p->pi_lock) {
	        }
	        ...
}
[    0.216287] kernel BUG at ./include/linux/spinlock.h:571!
[    0.217115] Oops: invalid opcode: 0000 [#1] SMP PTI
[    0.217285] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.217285] RIP: 0010:try_to_wake_up (./include/linux/spinlock.h:569 (discriminator 6) kernel/sched/core.c:4108 (discriminator 6))
[    0.217285] Call Trace:
[    0.217285]  <TASK>
[    0.217285]  ? __pfx_kthread_worker_fn (kernel/kthread.c:966)
[    0.217285]  __kthread_create_on_node (kernel/kthread.c:535)
[    0.217285]  kthread_create_worker_on_node (kernel/kthread.c:1043 (discriminator 1) kernel/kthread.c:1073 (discriminator 1))
[    0.217285]  ? vprintk_emit (kernel/printk/printk.c:4625 kernel/printk/printk.c:2433)
[    0.217285]  workqueue_init (kernel/workqueue.c:7873 kernel/workqueue.c:7922)
[    0.217285]  kernel_init_freeable (init/main.c:1675)
[    0.217285]  ? __pfx_kernel_init (init/main.c:1570)
[    0.217285]  kernel_init (init/main.c:1580)
[    0.217285]  ret_from_fork (arch/x86/kernel/process.c:164)
[    0.217285]  ? __pfx_kernel_init (init/main.c:1570)
[    0.217285]  ret_from_fork_asm (arch/x86/entry/entry_64.S:259)
[    0.217285]  </TASK>

TL;DR

  • DEFINE_FREE()

    • Sort-of-smart pointer
    • Definition tied to the freeing function, e.g. DEFINE_KFREE(kfree,...)
  • DEFINE_CLASS()

    • Like DEFINE_FREE() but with factorized initialization.
  • DEFINE_GUARD()

    • Like DEFINE_CLASS() but you don't need the underlying variable
    • e.g. locks don't require declaring a variable, you just lock and unlock them.
  • DEFINE_LOCK_GUARD()

    • Like DEFINE_GUARD() but when a single pointer isn't sufficient for lock/unlock operations.
    • Also for “special” locks with no underlying type such as RCU, preempt or migrate_disable.
 
Read more...

from Konstantin Ryabitsev

TLDR: use korgalore to bypass mailing list delivery problems

If you're a Gmail or Outlook user and you're subscribed to high-volume mailing lists, you're probably routinely missing mail. Korgalore is a tool that monitors mailing lists via lore.kernel.org and can import mail directly into your inbox so you don't miss any of it. You can also couple korgalore with lei for powerful filtering features that can reduce the firehose to what you'd actually find useful.

The problem with the “big 3”

If you're a user of Gmail or Outlook trying to participate in Linux kernel development, you're probably aware that it's... not great. Truth is, it's nearly impossible these days to run a technical mailing list and expect that it will be successfully delivered to the “big 3” consumer-grade mailbox providers — Gmail, Outlook, or Yahoo.

There are many reasons for it, and the primary one is that technical mail looks nothing like 99.99% of the mail traffic that their filters are trained on, and therefore when a technical message arrives, especially if it includes a patch, the automation thinks it's likely spam or something potentially unsafe. If you're not checking your junk folder daily, you're probably missing a lot of legitimate email.

Worst of all, if you're trying to subscribe to a high-volume mailing list using gmail or outlook, you can forget it — you will hit delivery quotas almost instantly. Our outgoing mail nodes routinely hit queues of 100,000+ messages, all because of “temporary delivery quotas” trying to deliver mail to gmail subscribers.

Korgalore is a tool that can help. It fetches messages directly from public-inbox archives (like lore.kernel.org) and delivers them directly to your mailbox, bypassing all the problematic mail routing that causes messages to go missing.

How korgalore helps

We cannot fix email delivery, but we can sidestep it entirely. Public-inbox archives like lore.kernel.org store all mailing list traffic in git repositories. In its simplest configuration, korgalore can shallow-clone these repositories directly and upload any new messages straight to your mailbox using the provider's API.

This approach has several advantages:

  • Nothing gets lost — you get every message that was posted to the list
  • You control the labels/folders — organize messages however you want
  • Works with your existing workflow — messages appear in your regular inbox

Korgalore currently supports these delivery targets:

  • Gmail (via API with OAuth2)
  • Microsoft 365 (via IMAP with OAuth2)
  • Generic IMAP servers
  • JMAP servers (Fastmail, etc.)
  • Local maildir
  • Pipe to external command (e.g. so you can feed it to fetchmail)

Installing korgalore

The easiest way to install korgalore is via pipx:

$ pipx install korgalore
[...]
$ kgl --version
kgl, version 0.4

For the GUI application, you'll also need GTK and AppIndicator libraries. On Fedora:

$ sudo dnf install python3-gobject gtk3 libappindicator-gtk3
$ pipx install 'korgalore[gui]'

Getting started with Gmail

This is the hardest part of the process, because Google makes it unreasonably hard to get API access to your own inbox. It's like they don't want you to even try it.

Getting API access credentials

You will need to start by getting OAUTH2 client credentials.

If you are a kernel maintainer with an active kernel.org account, you can run the following command to get what you'll need directly from us:

$ ssh git@gitolite.kernel.org get-kgl-creds

If you're not a kernel maintainer, then I'm afraid you're going to have to jump through a bajillion hoops. The process is described on this page:

Authenticating with Google

Once you have the json file with your credentials, run kgl edit-config. An editor will open with the following content:

### Targets ###

[targets.personal]
type = 'gmail'
credentials = '~/.config/korgalore/credentials.json'
# token = '~/.config/korgalore/token.json'

### Deliveries ###

# [deliveries.lkml]
# feed = 'https://lore.kernel.org/lkml'
# target = 'personal'
# labels = ['INBOX', 'UNREAD']

Just save it for now without any edits, but make a note where the credentials path is. Save the json file you got from Google (or from us) to that location: ~/.config/korgalore/credentials.json.

Next, authenticate with your Gmail account:

$ kgl auth personal

This opens a browser window for OAuth2 authentication, so it needs to run on your workstation, because it will need to talk to localhost to complete the authentication.

Once you have obtained the token, it is stored locally and refreshed automatically unless revalidation is required (once a week for “testing” applications).

Configure a delivery

Let's say you want to subscribe to the netdev list.

Edit the configuration file again:

$ kgl edit-config

Add a delivery that maps the netdev feed to your Gmail account:

[deliveries.netdev]
feed = 'https://lore.kernel.org/netdev'
target = 'personal'
labels = ['INBOX', 'UNREAD']

Run kgl pull once after this to initialize the subscription. No mail will be delivered on the first run, so it's just like subscribing to a real list.

$ kgl pull
Updating feeds  [####################################]  1/1
Initialized new feed: netdev
Pull complete with no updates.

You can add any list hosted on lore.kernel.org (or on any other public-inbox server) as a separate delivery.

Periodic pulls

Next time you run kgl pull you will see something like this, assuming there is new mail to be delivered:

$ kgl pull
Updating feeds  [####################################]  1/1
Delivering to personal  [####################################]  2/2
Pull complete with updates:
  netdev: 2

If all went well, messages will appear in your Gmail inbox.

Targets other than gmail

Korgalore will happily deliver to the following targets:

  • Gmail
  • Outlook 365
  • Generic IMAP
  • JMAP (Fastmail)
  • Maildir
  • Pipe

Refer to the following documentation page on configuration details:

Yanking a random thread

See a thread on lore that you just really want to answer? You can yank it into your inbox by just pasting the URL to the message you want (or use --thread for the whole thread):

$ kgl yank --thread https://lore.kernel.org/netdev/CAFfO_h4cX0+L=ieA_JF7QBvH-dDYsHnTUuN4gApguqxVpWyy2g@mail.gmail.com
Found 5 messages in thread
Uploading thread  [####################################]  5/5
Successfully uploaded 5 messages from thread

Doing a lot more with lei

The lei tool is the client-side utility for querying and interacting with public-inbox servers. It should be installable on most distributions these days. For Fedora:

$ sudo dnf install lei

This will pull in a large number of Perl dependencies, but they are all fairly tiny.

Yank and track a thread

Sometimes you don't want to follow an entire list, just a specific hot topic discussion. The track command lets you yank a thread and then receive any follow-ups to it. Korgalore lets you do that easily:

$ kgl track add https://lore.kernel.org/lkml/20260116.feegh2ohQuae@digikod.net/
Creating lei search for thread: 20260116.feegh2ohQuae@digikod.net
Populating lei search repository...
Started tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6
Now tracking thread track-bcd0dc0604fd: Re: [GIT PULL] Landlock fix for v6.19-rc6
Target: personal, Labels: INBOX, UNREAD
Delivering 5 messages to target...
Delivered 5 messages.

This creates a persistent search that monitors lore.kernel.org for replies to that thread. New messages are automatically delivered during regular pull operations.

Tired of tracking a thread? Find it with kgl track list and then stop following it:

$ kgl track stop track-bcd0dc0604fd

Threads automatically expire after 30 days of inactivity, but can be resumed if the discussion picks up again.

Tracking messages for a specific subsystem

If you're a maintainer, you can track your entire subsystem using the track-subsystem command. This parses the kernel's MAINTAINERS file and creates queries for all relevant mailing list traffic:

$ kgl track-subsystem -m MAINTAINERS 'SELINUX SECURITY MODULE'
Found subsystem: SELINUX SECURITY MODULE
Creating mailinglist query: l:selinux.vger.kernel.org AND d:7.days.ago..
Creating patches query: (dfn:... OR dfn:... [...]) AND d:7.days.ago..
Created 2 lei queries for subsystem "SELINUX SECURITY MODULE"
Configuration written to: /home/user/.config/korgalore/conf.d/selinux_security_module.toml
Target: personal, Labels: INBOX, UNREAD

This effectively subscribes you to the selinux mailing list, plus creates a query that will match the patches touching that subsystem, using the patterns defined in MAINTAINERS.

The next time you run kgl pull, it will upload the last 7 days of messages matching both queries:

$ kgl pull
Updating feeds  [####################################]  3/3
Delivering to personal  [####################################]  37/37
Pull complete with updates:
  selinux_security_module-mailinglist: 33
  selinux_security_module-patches: 4

Arbitrary lei queries

Korgalore will happily follow arbitrary lei queries that you have defined. For example, if you want to receive a copy of all mail sent by a co-maintainer, you can run the following:

$ lei q --only https://lore.kernel.org/all \
    -o v2:/home/user/.lei/comaintainer-spying \
    f:torvalds@linux-foundation.org AND d:7.days.ago..

Then you can add the following section to korgalore.toml:

[deliveries.comaintainer-spying]
feed = 'lei:/home/user/.lei/comaintainer-spying'
target = 'personal'
labels = ['INBOX', 'UNREAD']

Filtering unwanted senders

Korgalore doesn't come with complicated filtering — lei is much more suited for that purpose. However, if there is someone whose mail you absolutely never want to see, you can add them to the bozofilter.

$ kgl bozofilter --add bozo@example.com --reason "off-topic noise"

Blocked messages are silently skipped during delivery.

Using the GUI taskbar app for background syncing

For day-to-day use, the GUI application runs in your system tray and syncs automatically:

$ kgl gui

The GUI provides:

  • Automatic background syncing at configurable intervals
  • Manual “Sync Now” when you want immediate updates
  • “Yank” dialog to fetch specific messages by URL or Message-ID
  • Network awareness — pauses sync when offline and resumes when connected
  • Re-authentication prompts when OAuth tokens expire
  • Quick editing of the config or the bozofilter

Here are a couple of videos demonstrating the gui app in action:

Documentation and source

Full documentation is available at:

https://korgalore.docs.kernel.org/

Source repository:

https://git.kernel.org/pub/scm/utils/korgalore/korgalore.git

If you run into issues or have feature requests, please send them to tools@kernel.org.

 
Read more...

from paulmck

TL;DR: Unless you are doing very strange things with RCU (read-copy update), not much!!!

So why has the guy most responsible for Linux-kernel spent so much time over the past five years working on the provenance-related lifetime-end pointer zap within the C++ Standards Committee?

But first...

What is Pointer Provenance?

Back in the old days, provenance was for objets d'art and the like, and we did not need them for our pointers, no sirree!!! Pointers had bits, those bits formed memory addresses, and as often as not we didn't even need to worry about these addresses being translated. But life is more complicated now. On the other hand, computing life is also much bigger, faster, more reliable, and (usually) more productive, so be extremely careful what you wish for from back in the Good Old Days!

These days, pointers have provenance as well as addresses, and this has consequences. The C++ Standard (recent draft) states that when an object's storage duration ends, any pointers to that object become invalid. For its part, the C Standard states that when an object's storage duration ends, any pointers to that object become indeterminate. In both standards, the wording is more precise, but this will serve for our purposes.

For the remainder of this document, we will follow C++ and say “invalid”, which is shorter than “indeterminate”. We will balance this out by using C-language example code. Those preferring C++ will be happy to hear that this is the language that I use in my upcoming CPPCON presentation.

Neither standard places any constraints on what a compiler can do with an invalid pointer value, even if all you are doing is loading or storing that value.

Those of us who cut our teeth on assembly language might quite reasonably ask why anyone would even think to make pointers so grossly invalid that you cannot even load or store them. To see the historical reasons, let's start by looking at pointer comparisons using this code fragment:

p = kmalloc(...);
might_kfree(p);         // Pointer might become invalid (AKA "zapped")
q = kmalloc(...);       // Assume that the addresses of p and q are equal.
if (p == q)             // Compiler can optimize as "if (false)"!!!
    do_something();

Both p and q contain addresses, but the compiler also keeps track of the fact that their values were obtained from different invocations of kmalloc(). This information forms part of each pointer's provenance. This means that p and q have different provenance, which in turn means that the compiler does not need to generate any code for the p == q comparison. The two pointers' provenance differs, so no matter what the addresses might be, the result cannot be anything other than false.

And this is one motivation for pointer provenance and invalidity: The results of operations on invalid pointers are not guaranteed, which provides additional opportunities for optimization. This example perhaps seems a bit silly, but modern compilers can use pointer provenance and invalidity to carry out serious points-to and aliasing analysis.

Yes, you can have hardware provenance. Examples include ARM MTE, the CHERI research prototype (which last I checked had issues with C++'s requirement that pointers are trivially copiable), and the venerable IBM System i. Conventional systems provide pointer provenance of a sort via their page tables, which is used by a variety of memory-allocation-use debuggers, for but one example, the efence library. The pointer-provenance features of ARM MTE and IBM System i are not problematic, but last I checked, the jury was still out on CHERI.

Of course, using invalid (AKA “dangling”) pointers is known to be a bad idea. So why are we even talking about it???

Why Would Anyone Use Invalid/Dangling Pointers?

Please allow me to introduce you to the famous and frequently re-invented LIFO Push algorithm. You can find this in many places, but let's focus on the Linux kernel's llist_add_batch() and llist_del_all() functions. The former atomically pushes a list of elements on a linked-list stack, and the latter just as atomically removes the entire contents of the stack:

static inline bool llist_add_batch(struct llist_node *new_first,
                                   struct llist_node *new_last,
                                   struct llist_head *head)
{
    struct llist_node *first = READ_ONCE(head->first);

    do {
        new_last->next = first;
    } while (!try_cmpxchg(&head->first, &first, new_first));

    return !first;
}

static inline struct llist_node *llist_del_all(struct llist_head *head)
{
    return xchg(&head->first, NULL);
}

As lockless concurrent algorithms go, this one is pretty straightforward. The llist_add_batch() function reads the list header, fills in the ->next pointer, then does a compare-and-exchange operation to point the list header at the new first element. The llist_del_all() function is even simpler, doing a single atomic exchange operation to NULL out the list header and returning the elements that were previously on the list. This algorithm also has excellent forward-progress properties: the llist_add_batch() function is lock-free and the llist_del_all() function is wait-free.

So what is not to like?

In assembly language, or with a simple compiler, not much. But more heavily optimized languages have serious pointer-provenance issue with this code. To see them, consider the following sequence of events:

  1. CPU 0 allocates an llist_node B and passes it via both the new_first and new_last parameters of llist_add_batch().
  2. CPU 0 picks up the head->first pointer and places it in the first local variable, then assigns it to new_last->next. This new_last->next pointer now references llist_node A.
  3. CPU 1 invokes llist_del_all(), which returns a list containing llist_node A. The caller of llist_del_all() processes A and passes it to kfree().
  4. CPU 0's new_last->next pointer is now invalid due to llist_node A having been freed. But CPU 0 does not know this, though a sufficiently all-knowing compiler just might.
  5. CPU 1 allocates an llist_node C that happens to have the same address as the old llist_node A. It passes C via both the new_first and new_last parameters of llist_add_batch(), which runs to completion. The head pointer now points to llist_node C, which happens to have the same address as the now storage-duration-ended llist_node A. However, the two pointers reference objects created by different memory-allocation calls, and thus have different provenance, and thus are not necessarily equal.
  6. CPU 0 finally gets around to executing its try_cmpxchg(), which will succeed, courtesy of the fact that try_cmpxchg() compares only the bits actually represented in the pointer, and not any implicit pointer provenance (and please note that the same is true of both the C and C++ compare-and-exchange operations). The llist now contains an llist_node B that contains an invalid pointer to dead llist_node A, but whose address happens to reference the shiny new llist_node C. (We term this invalid pointer a “zombie pointer” because it has in some assembly-language sense come back from the dead.)
  7. Some CPU invokes llist_del_all() and gets back an llist containing an invalid ->next pointer.

One could argue that the Linux-kernel implementation of LIFO Push is simply buggy and should be fixed. Except that there is no reasonable way to fix it. Which of course raises the question...

What Are Unreasonable Fixes?

We can protect pointers from invalidity by storing them as integers, but:

  1. Suppose someone has an element that they are passing to a library function. They should not be required to convert all their ->next pointers to integer just because the library's developers decide to switch to the LIFO Push algorithm for some obscure internal operation.
  2. In addition, switching to integer defeats type-checking, because integers are integers no matter what type of pointer they came from.
  3. We could restore some type-checking capability by wrapping the integer into a differently named struct for each pointer type. Except that this requires a struct with some particular name to be treated as compatible with pointers of some type corresponding to that name, a notion that current compilers do not support.
  4. In C++, we could use template metaprogramming to wrap an integer into a class that converts automatically to and from compatibly typed pointers. But there would then be windows of time in which there was a real pointer, and at that time there would still be the possibility of pointer invalidity.
  5. All of the above hack-arounds put additional obstacles in the way of developers of concurrent software.

Alternatively, in environments such as the Linux kernel that provides their own memory allocators, we can hide them from the compiler. But this is not free, in fact, the patch that exposed the Linux-kernel's memory allocators to the compiler resulted in a small but significant improvement.

However, it is fair to ask...

Why Do We Care About Strange New Algorithms???

Let's take a look at the history, courtesy of Maged Michael's diligent software archaeology.

In 1986, R. K. Treiber presented an assembly language implementation of the LIFO Push algorithm in technical report RJ 5118 entitled “Systems Programming: Coping with Parallelism” while at the IBM Almaden Research Center.

In 1975, an assembly language implementation of this same algorithm (except with pop() instead of popall(), but still having the same ABA properties) was presented in the IBM System 370 Principles of Operation as a method for managing a concurrent freelist.

US Patent 3,886,525 was filed in June 1973, just a few months before I wrote my first line of code, and contains a prior-art reference to the LIFO Push algorithm (again with pop() instead of popall()) as follows: “Conditional swapping of a single address is sufficient to program a last-in, first-out single-user-at-a-time sequencing mechanism.” (If you were to ask a patent attorney, you would likely be told that this 50-year-old patent has long since expired. Which should be no surprise, given that it is even older than Dennis Ritchie's setuid Patent 4,135,240.)

All three of these references describe LIFO push as if it was straightforward and well known.

So we don’t know who first invented LIFO Push or when they invented it, but it was well known in 1973. Which is well over a decade before C was first standardized, more than two decades before C++ was first standardized, and even longer before Rust was even thought of.

And its combination of (relative) simplicity and excellent forward-progress properties just might be why this algorithm was anonymously invented so long ago and why it is so persistently and repeatedly reinvented. This frequent reinvention puts paid to any notion that LIFO Push is strange.

So sorry, but LIFO Push is neither new nor strange.

Nor is it the only situation where lifetime-end pointer zap causes problems. Please see the “Zap-Susceptible Algorithms” section of P1726R5 (“Pointer lifetime-end zap and provenance, too”) for additional use cases.

So What Do We Do?

The lifetime-end pointer-zap story is not yet over, and we are in fact currently pushing for the changes in four working papers.

Nondeterministic Pointer Provenance

P2434R4 (“Nondeterministic pointer provenance”) is the basis for the other three papers. It asks that when converting a pointer to an integer and back, the implementation must choose a qualifying pointed-to object (if there is one) whose storage duration began before or concurrently with the conversion back to a pointer. In particular, the implementation is free to ignore a qualifying pointed-to object when the conversion to pointer happens before the beginning of that object’s storage duration.

The “qualifying” qualifier includes compatible type, as well as sufficiently early and long storage duration.

But why restrict the qualifying pointed-to object's storage duration to begin before or concurrently with the conversion back to a pointer?

An instructive example by Hans Boehm may be found in P2434R4, which shows that reasonable (and more important, very heavily used) optimizations would be invalidated by this approach. Several examples that manage to be even more sobering may be found in David Goldblatt's P3292R0 (“Provenance and Concurrency”).

Pointer Lifetime-End Zap Proposed Solutions: Atomics and Volatile

P2414R10 (“Pointer lifetime-end zap proposed solutions: Atomics and volatile”) is motivated by the observation that atomic pointers are subject to update at any time by any thread, which means that the compiler cannot reasonably do much in the way of optimization. This paper therefore asks (1) that atomic operations be redefined to yield and to store prospective pointers values and (2) that operations on volatile pointers be defined to yield and to store prospective pointer values. The effect is as if atomic pointers were stored internally as integers. This includes the “old” pointer passed by reference to compare_exchange().

This helps, but is not a full solution because atomic pointers are converted to non-atomic pointers prior to use, at which point they are subject to lifetime-end pointer zap. And the standard does not even guarantee that a zapped pointer can even be loaded, stored, passed to a function, or returned from a function. Which brings us to the next paper.

Pointer Lifetime-End Zap Proposed Solutions: Tighten IDB for Invalid Pointers

P3347R4 (“Pointer lifetime-end zap proposed solutions: Tighten IDB for invalid pointers”) therefore asks that all non-comparison non-arithmetic non-dereference computations involving pointers, specifically including normal loads and stores, are fully defined even if the pointers are invalid. This permits invalid pointers to be loaded, stored, passed as arguments, and returned. Fully defining comparisons would rule out optimizations, and fully defining arithmetic would be complex and thus far unneeded. Fully defining dereferencing of invalid pointers would of course be problematic.

If these first three papers are accepted into the standard, the C++ implementation of LIFO Push show above becomes valid code. This is important because this algorithm has been re-invented many times over the past half century, and is often open coded. This frequent open coding makes it infeasible to construct tools that find LIFO Push implementations in existing code.

P3790R1: Pointer Lifetime-End Zap Proposed Solutions: Bag-of-Bits Pointer Class

P3790R1 (“Pointer lifetime-end zap proposed solutions: Bag-of-bits pointer class”) asks that (1) the addition to the C++ standard library of the function launder_ptr_bits() that takes a pointer argument and returns a prospective pointer value corresponding to its argument; and (2) the addition to the C++ standard library of the class template std::ptr_bits<T> that is a pointer-like type that is still usable after the pointed-to object’s lifetime has ended. Of course, such a pointer still cannot be dereferenced unless there is a live object at that pointer's address. Furthermore, some systems, such as ARMv9 with memory tagging extensions (MTE) enabled have provenance as well as address bits in the pointer, and on such systems dereferencing will fail unless the pointer's provenance bits happen to match those of the pointed-to object.

This function and template class is nevertheless quite useful, for example, it may be used to maintain hash maps keyed by pointers after the pointed-to object's lifetime has ended. These can be extremely useful for debugging, especially in cases where the overhead of full-up address sanitizers cannot be tolerated.

Unlike LIFO Push, source-code changes are required for these use cases. This is unfortunate, but we have thus far been unable to come up with a same-source-code approach.

Those who have participated in standards work (or even open-source work) will understand that the names launder_ptr_bits() and std::ptr_bits<T> just might still be subject to bikeshedding.

A Happen Lifetime-End Pointer Zap Ending?

It is still too early to say for certain, but thus far these proposals are making much better progress than did their predecessors. So who knows? Perhaps C++29 will address lifetime-end pointer zap.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

The other posts in this series help with small improvements over a long time. But what do you do if you only have a few weeks until your presentation? Yes, it is best to avoid procrastination, but sometimes you simply don't have all that much notice.

First, have a very clear picture of what you want the audience to gain from your presentation. A carefully chosen and tight focus will save you time that might otherwise have been wasted on irrelevant details.

Second, do dry-run presentations, preferably to people who won't be shy about giving you honest feedback. If your dry-run audience has shy people, you can ask them questions to see if they picked up on the key points of your presentation. If you cannot scare up a human audience on short notice, record your presentation (on your smartphone if nothing else) and review it. In the old pre-smartphone days, we would do our audience-free dry runs in front of a mirror, which can still be useful, for example, if your smartphone's battery is empty.

Third, repeat the important portions of your presentation, which usually includes the opening, the conclusion, and any surprise “reveals” in the middle of the presentation. If it is an important presentation (but aren't they all?), do about 20 repetitions of the important portions. If it is an extremely important presentation, dry-run the entire presentation about 20 times. Yes, this can take time, but on the other hand, most of my extremely important presentations were quite short, on the order of 3-5 minutes.

Fourth and finally, get a good night's sleep before the day of the presentation.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

I have been consciously working on speaking skills for more than half a century.  This section lists a few of the experiences along the way. My hope is that this motivates you to take the easier and faster approaches laid out in the rest of this blog series.

Comic Relief

A now-disgraced comedian who was immensely popular in the 1960s was said to have learned his craft at school.  They said that he discovered that if he could make the schoolyard bullies laugh, they would often forget about roughing him up.  I tried the same approach, though with just barely enough success to persist.  Part of my problem was that I spent most of my time focusing on academic skills, which certainly proved to be a wise choice longer term, but did limit the time available to improve my comedic capabilities.  I was also limited by my not-so-wise insistence on taking myself too seriously.  Choices, choices!

My classmates often told very funny jokes, and I firmly believed that making up jokes was a cognitive skill, and I just as firmly believed (and with some reason) that I was a cognitive standout.  If they could do it, so could I!!!

But for a very long time, my jokes were extremely weak compared to theirs.

Until one day, I told a joke that everyone laughed at.  Hard.  For a long time.  (And no, I do not remember that joke, but then again, it was a joke targeted towards seventh graders and you most likely are not in seventh grade.)

Once they recovered, one of them asked “What show did you see that on?”

Suddenly the awful truth dawned on me.  My classmates were not making up these jokes.  They were seeing them on television, and rushing to be the first to repeat them the next day.  Why was this not obvious to me?  Because my family did not have a television.

My surprise did not prevent me from replying “The Blank Wall”.  Which was the honest truth: I had in fact been staring at a blank wall the previous evening while composing my first successful joke.

The next day, my classmates asked me what channel “The Blank Wall” was on.  I of course gave evasive answers, but in a few minutes they figured out that I meant a literal blank wall.  They were not impressed with my attitude.  You saw jokes on television, after all, and no one in their right mind would even try to make one up!

I also did some grade-school acting, though my big role was Jonathan Brewster in a seventh-grade production of “Arsenic and Old Lace” rather than anything comedic.  The need to work prevented my acting in any high-school plays, though to be fair it is not clear that my acting abilities would have kept up with those of my classmates in any case.

Besides, those working in retail can attest that carefully deployed humor can be extremely useful.  So my high-school grocery-store job likely provided me with more and better experience than the high-school plays could possibly have done.  At least that is what I keep telling myself!

Speech Team

For reasons that were never quite clear to me, the high-school speech-team coach asked me to try out. I probably would have ignored her, but I well recalled my father telling me that those who have nothing to say, but can say it well, will often do better than those who have something to say but cannot say it. So, against my better 13-year-old judgment, I signed up.

I did quite well in extemporaneous speech during my first year due to my relatively deep understanding of the science behind the hot topic of that time, namely the energy crisis. During later years, the hot topics reverted to the usual political and evening-news fare, so the remaining three years were good practice, but did not result in wins. Until the end of my senior year, when the coach suggested that I try radio commentary, which had the great advantage of hiding my horribly geeky teenaged face from the judges. I did quite well, qualifying for district-level competition on the strength of my first-ever radio-commentary speech.

But I can only be thankful that my 17-year-old self decided to go to an engineering university as opposed to seeking employment at a local radio station.

University Coursework

I tested out of Freshman English Composition, but I did take a couple of courses on technical writing and technical presentation. A ca. 1980 mechanical-engineering presentation on ground-loop heat pumps featured my first use of cartoons in a technical presentation, courtesy of a teammate who knew a professional cartoonist. The four of us were quite proud of having kept the class’s attention during the full duration of our talk, which took place only a few days before the start of Christmas holidays.

1980s and 1990s Presentations

I did impromptu work-related presentations for my contract-programming work in the early 1980s. In the late 1980s, I joined a research institute where I was expected to do formal presentations, including at academic venues. I joined a startup in 1990, where I continued academic presentations, but focused mainly on internal training presentations.

Toastmasters

I became a founding member of a local Toastmasters club in 1993, and during the next seven years received CTM (“Competent Toastmaster) and ATM (“Advanced Toastmaster”) certifications. There is very likely a Toastmasters club near you, and you can search here: https://www.toastmasters.org/.

The purpose of Toastmasters is to help people develop public-speaking skills in a friendly environment. The members of the club help each other, evaluating each others’ short speeches and providing topics for even shorter impromptu speeches. The CTM and ATM certifications each have a manual that guides the member through a series of different types of speeches. For example, the 1990s CTM manual starts with a 4-6-minute speech in which the member introduces themselves. This has the benefit of ensuring that the speaker is expert on the topic, though I have come across an amnesiac who was an exception that proves this rule.

For me, the best of Toastmasters was “table topics”, in which someone is designated to bring a topic to the next meeting. The topic is called out, and people are expected to volunteer to give a short speech (a minute or two) on that topic. This is excellent preparation for those times when someone calls you out during a meeting.

Benchmarking

By the year 2000, I felt very good about my speaking ability. I was aware of some shortcomings, for example, I had difficulty with audiences larger than about 100 people, but was doing quite well, both in my own estimation and that of others. In short, it was time to benchmark myself against a professional speaker.

In that year, I attended an event whose keynote was given by none other than one of the least articulate of the US Presidents, George H. W. Bush. Now, Bush’s speaking abilities might have been unfairly compared to the larger-than-life capabilities of his predecessor (Ronald Reagan, AKA “The Great Communicator”) and his successor (Bill Clinton, whose command of people skills is the stuff of legends). In contrast, here is Ann Richards’s assessment of Bush’s skills: “born with a silver foot in his mouth”.

As noted above, I had just completed seven years in Toastmasters, so I was more than ready to do a Toastmasters-style evaluation of Bush’s keynote. I would record all the defects in this speech and email it to my Toastmasters group for their amusement.

Except that it didn’t turn out that way.

Bush gave a one-hour speech during which he did everything that I knew how to do, and did it effortlessly. Not only that, there were instances where he clearly expected a reaction from the audience, and got that reaction. I was watching him like a hawk the whole time and had absolutely no idea how he had made it happen.

Bush might well have been the most inarticulate of the US Presidents, but he was incomparably better than this software developer will ever be.

But that does not mean that I cannot continue to improve. In fact, I can now do a better job of presenting that Bush can. Not just due to my having spent the intervening decades practicing (practice makes perfect!), but mostly due to the fact that Bush has since passed away.

Linux Community

I joined the Linux community in 2001, where I faced large and diverse audiences. It quickly became obvious that I needed to apply my youthful Warner Brothers lessons, especially given that I was presenting things like RCU to audiences that were mostly innocent of any knowledge of or experience in concurrency.

This experience also gave me much-needed practice dealing with larger audiences, in a few cases, on the order of 1,000.

So I continue to improve, but there is much more for me to learn.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

This blog series has covered why public speaking is important, ways and means, building bridges from your audience to where they need to go, who owns your words, telling stories, knowing your destination, use of humor, and speaking on short notice.

But if you would rather learn about what I actually did rather than what I advise you to do, please see here.

I close this series by reiterating the value and ubiquity of Toastmasters and the usefulness of both dry runs and reviewing videos of your past talks.

Best of everything in your presentations!

Acknowledgments

And last, but definitely not least, a big “thank you” (in chronological order) to Anne Nicolas, Willy Tarreau, Steven Rostedt, Gregory Price, and Michael Opendacker for their careful review of early versions of this series.

 
Read more...

from paulmck

This is part of the Kernel Recipes 2025 blog series.

Humor is both difficult and dangerous, especially in a large and diverse group such as the audience for Kernel Recipes. My advice is to do many formal presentations before attempting much in the way of humor.

This section will nevertheless talk about use of humor in technical presentations.

One issue is that audience members have a wide range of languages and dialects, and a given joke in (say) American English might not go over well to (say) Welsh English speakers. And it might be completely mangled in translation to another language. For example, during a 1980s visit to China, George Bush Senior is said to have quipped “We are oriented to the Orient.” This translates to something like ”我们面向东方”, which translates back to something like “We face East”, completely destroying Bush’s oriented/Orient pun. So what did the poor translator say? “是笑话,笑吧”, which translates to something like “It is a joke, laugh.”

So if you tell jokes, keep translations to other cultures and languages firmly in mind. (To be fair, this is advice that I could do well to better heed myself!)

In addition, jokes make fun of some person or group or are based on what is considered to be abnormal, excessive, or unacceptable, all of which differ greatly across cultures. Besides which, given a large and diverse audience such as that of Kernel Recipes, there will almost certainly be someone in attendance who identifies with the person or group in question or who has strong feelings about the joke’s implications about abnormality, excessiveness, or unacceptability. That someone just might have a strong negative reaction. And this should be absolutely no surprise, given that humor is used with great effect as a weapon in social conflicts.

In my youth, there were outgroups that were frequently the butt of jokes. These were often groups that were not represented in my small community, but were just as often a single-person outgroup made up of some hapless fellow student. Then as now, the most cruel jokes all too often get the best laughs.

Yet humor can also make a speech much more enjoyable. So what is a speaker to do?

Outgroups are often used, with technical talks making jokes at the expense of managers, salespeople, marketing departments, lawyers, users, and occasionally even an especially incompetent techie. But these jokes always eventually find their way to the outgroup in question, sometimes with devastating consequences to the hapless speaker.

It is better to tell jokes where you yourself are the butt of the joke. This can be difficult at first: Let’s face it, most of us would prefer to be taken seriously. However, becoming comfortable with this is well worth the effort. For one thing, once you have demonstrated a willingness to make a joke at your own expense, the audience will usually be much more willing to accept their own shortcomings and need for improvement. Such an audience will usually also be more willing to learn, and the best technical talks are after all those that audiences learn from.

What jokes should you tell on yourself? I paraphrase advice from the late humorist Patrick McManus: The worst day of your life will make the audience laugh the hardest.

That said, you need to make sure that the audience can relate to the challenges you faced on that day. For example, my interactions with the legal profession would likely seem strange and irrelevant to a general audience. However, almost all members of a Kernel Recipes audience will have chased down a difficult bug, so a story about some idiotic mistake I made while chasing down an RCU bug will likely resonate. And this might be one way of entertaining a general audience while providing needed information to those wanting an RCU deep dive.

Or maybe you can figure out how to work some bathroom humor into your talk. Who is the butt of this joke? You decide! ;–)

Adding humor to your talk often does not come for free. Time spent telling jokes is not available for presenting on technology. This tradeoff can be tricky: Too much humor makes for a lightweight talk, and too little for a dry talk. Especially if you are just starting out, I strongly advise you to err in the direction of dryness. Instead, make your technical content be the source of your audience’s excitement.

Use of humor in technical talks is both difficult and dangerous, but careful use of humor can be a very powerful public-speaking tool.

Perhaps some day I, too, will master the use of humor.

 
Read more...

from linusw

As I was working my way toward a mergeable version of generic entry for ARM32, there was an especially nasty bug that I could not for my life iron out: when booting Debian for armhf I just kept running into a boot splat, while everything else worked fine. It would look something like this:

8<--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute
[eaffff76] *pgd=eae1141e(bad)
Internal error: Oops: 8000000d [#1] SMP ARM
CPU: 0 UID: 997 PID: 304 Comm: sd-resolve Not tainted 6.13.0-rc1+ #22
Hardware name: ARM-Versatile Express
PC is at 0xeaffff76
LR is at __invoke_syscall_ret+0x0/0x18
pc : [<eaffff76>]    lr : [<80100a68>]    psr: a0030013
sp : fbc11f68  ip : fbc11e78  fp : 76539420
r10: 10c5387d  r9 : 841f4ec0  r8 : 80100284
r7 : ffffffff  r6 : 7653941c  r5 : 76cb6000  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : 00080003  r0 : ffffff9f
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 8222006a  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: NULL pointer
Register r5 information: non-paged memory
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: non-slab/vmalloc memory
Register r9 information: slab task_struct start 841f4ec0 pointer offset 0 size 2240
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: 2-page vmalloc region starting at 0xfbc10000 allocated at copy_process+0x150/0xd88
Process sd-resolve (pid: 304, stack limit = 0xbab1c12b)
Stack: (0xfbc11f68 to 0xfbc12000)
1f60:                   00000000 76cb6000 fbc11fb0 ffffffff 80100284 80cdd330
1f80: 80100284 841f4ec0 10c5387d 80111eac 00000000 76cb6000 7653941c 00000119
1fa0: 76539420 80100280 00000000 76cb6000 ffffff9f 00080003 00000000 00000000
1fc0: 00000000 76cb6000 7653941c 00000119 76539c48 76539c44 76539b6c 76539420
1fe0: 76f3a450 765392c4 76c72a4d 76c60108 20030030 00000010 00000000 00000000
Call trace: 
Code: 00000000 00000000 00000000 00000000 (00000000) 
---[ end trace 0000000000000000 ]---

The paging request means that we are in kernel mode, and we have tried to page in a page that does not exist, such as reading from random uninitialized memory somewhere. If this was userspace, we would get a “segmentation fault”. So this is a pretty common error in C programs.

Notice the following: no call trace. This always happens when you least of all want it: how am I supposed to know how we got here?

But the die() splat has this helpful information: PID: 304 Comm: sd-resolve which reads: this was caused by the process initiated by the command sd-resolve executing as PID 304. But sd-resolve is just something systemd fires temporarily when bringing up some other service, so it must be part of a service. Luckily we also have dmesg:

       Starting systemd-timesyncd… - Network Time Synchronization...
8<--- cut here ---
Unable to handle kernel paging request at virtual address eaffff76 when execute

Aha it's the NTP service. We can verify that this process is causing the mess by issuing:

systemctl restart systemd-timesyncd

And indeed, we get a second reproducable splat. OK great, let's use ftrace to tell us what happened.

The excellent article Secrets of the Ftrace function tracer tells us that it's as simple as echoing the PID of the process into set_ftrace_pid in the kernel debug/tracing sysfs filetree.

There is a problem though: this process is by its very nature transient. I don't know the PID! After some googling it turns out you can ask systemd what PID a certain service is running as:

systemctl show --property MainPID --value systemd-timesyncd

So ... we can echo this into set_ftrace_pid, and then start the trace, but the service is transient, how can I restart the service, obtain the PID, start tracing and wait for the service to finish restarting and then end the trace? No that's a tall order.

After some fooling around with trying to restart the service in one window, then quickly switching to another window and start the trace while the restart is happening, I realized what everyone should already know: never put a person to do a machine's job.

I had to write a script that would restart the service, start the trace, wait for the restart to finish and then stop the trace. It ended up looking like this:

#!/bin/bash
TRACEDIR=/sys/kernel/debug/tracing
SERVICE=systemd-timesyncd
TRACEFILE=/root/trace.dat

echo 0 > ${TRACEDIR}/tracing_on
echo "function" > ${TRACEDIR}/current_tracer
(systemctl restart ${SERVICE})& PID=$!
echo 1 > ${TRACEDIR}/tracing_on
echo "Wait for restart to commence"
wait "${PID}"
echo 0 > ${TRACEDIR}/tracing_on
echo "Restarted "
trace-cmd extract -o ${TRACEFILE}
scp ${TRACEFILE} linus@169.254.1.2:/tmp/trace.dat

This does what we want: turn off the tracing, activate the function tracer, restart the systemd-timesyncd service and capture it's PID, start tracing, wait for the restart to commence, then extract the trace and copy it to the development system. No need to figure out the PID of the forked sd-resolve, just capture everything: this window will be small enough that we can capture all the relevant trace information.

After this I brought up kernelshark to inspect the resulting tracefile trace.dat (right-click open in new tab/window to see the details):

Kernelshark looking at logs

We search for the die() invocation (here I had a pr_info() added with the word “CRASH” as well). Sure enough there it is and the task is indeed sd-resolve. But what happens before? We need to know why we crashed here.

For that we need to re-run the trace but now with the function_graph tracer so we can see the program flow, the indentation helps us to follow the messiness:

Kernelshark looking at logs for function graph

So we switch to the function_graph view and start off from die(), then we just move upward: first we find a prefetch abort, and that is what we already knew: we are getting a page fault, in kernel mode, for a page that does not have a backing storage.

So browse backwards, and:

Kernelshark looking at logs for function graph

Aha. We are invoking some syscall, and that doesn't really work. So what kind of odd syscall can we invoke that just crash on us like that? We instrument invoke_syscall() to print that out for us, but not for all invocations but just for the task we are interested in, and we know that is sd-resolve:

if (!strcmp(current->comm, "sd-resolve"))
   pr_info("%s invoke syscall %d\n", current->comm, scno);

We take advantage of the fact that the global variable current in the kernel always points to the currently active task. And the field .comm should contain the command name that it was invoked with, such as “sd-resolve”.

Then we run this and oh:

[  OK  ] Started systemd-timesyncd.…0m - Network Time Synchronization.
sd-resolve invoke syscall 291
sd-resolve invoke syscall -1
8<--- cut here ---
Unable to handle kernel paging request at virtual address e7f001f2 when execute
[e7f001f2] *pgd=e7e1141e(bad)

So system call -1 was invoked. This may seem weird to you, but it is actually a legal value: when tracing system calls (such as with strace) the kernel will filter system calls, and the filter will return -1 to indicate that the system call should not even be taken, “skipped”, and my new generic entry code was not taking this properly into account, and the low-level assembly tried to vector -1 into a table and it failed miserably, vectoring us out in the unknown.

At this point I could quickly patch up the code and call it a day.

I have no idea why sd-resolve turns on system call tracing by default, because it obviously does. It could be related to some seccomp security features that are being called in BPF programs prior to every system call in the same code path? I think those need to intercept the system calls anyway. Not particularly efficient, but I suppose quite secure.

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Another busy year has passed so let us punctuate the never ending stream of development with a retrospective of our accomplishments over the last 12 months. The previous, 2023 retrospective has covered changes from Linux v6.3 to v6.8, for 2024 we will cover Linux v6.9 to v6.13, one fewer, as Linux releases don’t align with calendar years. We will focus on the work happening directly on the netdev mailing list, having neither space nor expertise to do justice to developments within sub-subsystems like WiFi, Bluetooth, BPF etc.

Core

After months of work and many patch revisions we have finally merged support for Device Memory TCP, which allows TCP payloads to be placed directly in accelerator (GPU, TPU, etc.) or user space memory while still using the kernel stack for all protocol (header) processing (v6.12). The immediate motivation for this work is obviously the GenAI boom, but some of the components built to enable Device Memory TCP, for example queue control API (v6.10), should be more broadly applicable.

The second notable area of development was busy polling. Additions to the epoll API allow enabling and configuring network busy polling on a per-epoll-instance basis, making the feature far easier to deploy in a single application (v6.9). Even more significant was the addition of a NAPI suspension mechanism which allows for efficient and automatic switching between busy polling and IRQ-driven operation, as most real life applications are not constantly running under highest load (v6.12). Once again the work was preceded by paying off technical debt, it is now possible to configure individual NAPI instances rather than an entire network interface (v6.13).

Work on relieving the rtnl_lock pressure has continued throughout the year. The rtnl_lock is often mentioned as one of the biggest global locks in the kernel, as it protects all of the network configuration and state. The efforts can be divided into two broad categories – converting read operations to rely on RCU protection or other fine grained locking (v6.9, v6.10), and splitting the lock into per-network namespace locks (preparations for which started in v6.13).

Following discussions during last year’s LPC, the Real Time developers have contributed changes which make network processing more RT-friendly by allowing all packet processing to be executed in dedicated threads, instead of the softirq thread (v6.10). They also replaced implicit Bottom Half protection (the fact that code in BH context can’t be preempted, or migrated between CPUs) with explicit local locks (v6.11).

The routing stack has seen a number of small additions for ECMP forwarding, which underpins all modern datacenter network fabrics. ECMP routing can now maintain per-path statistics to allow detecting unbalanced use of paths (v6.9), and to reseed the hashing key to remediate the poor traffic distribution (v6.11). The weights used in ECMP’s consistent hashing have been widened from 8 bits to 16 bits (v6.12).

The ability to schedule sending packets at a particular time in the future has been extended to survive network namespace traversal (v6.9), and now supports using the TAI clock as a reference (v6.11). We also gained the ability to explicitly supply the timestamp ID via a cmsg during a sendmsg call (v6.13).

The number of “drop reasons”, helping to easily identify and trace packet loss in the stack is steadily increasing. Reason codes are now also provided when TCP RST packets are generated (v6.10).

Protocols

The protocol development wasn’t particularly active in 2024. As we close off the year 3 large protocol patch sets are being actively reviewed, but let us not steal 2025’s thunder, and limit ourselves to changes present in Linus’s tree by the end of 2024.

AF_UNIX socket family has a new garbage collection algorithm (v6.10). Since AF_UNIX supports file descriptor passing, sockets can hold references to each other, forming reference cycles etc. The old home grown algorithm which was a constant source of bugs has been replaced by one with more theoretical backing (Tarjan’s algorithm).

TCP SYN cookie generation and validation can now be performed from the TC subsystem hooks, enabling scaling out SYN flood handling across multiple machines (v6.9). User space can peek into data queued to a TCP socket at a specified offset (v6.10). It is also now possible to set min_rto for all new sockets using a sysctl, a patch which was reportedly maintained downstream by multiple hyperscalers for years (v6.11).

UDP segmentation now works even if the underlying device doesn’t support checksum offload, e.g. TUN/TAP (v6.11). A new hash table was added for connected UDP sockets (4-tuple based), significantly speeding-up connected socket lookup (v6.13).

MPTCP gained TCP_NOTSENT_LOWAT support (v6.9), and automatic tracking of destinations which blackhole MPTCP traffic (6.12).

IPsec stack now adheres to RFC 4301 when it comes to forwarding ICMP Error messages (v6.9).

Bonding driver supports independent control state machine in addition to the traditional coupled one, per IEEE 802.1AX-2008 5.4.15 (v6.9).

The GTP protocol gained IPv6 support (v6.10).

The High-availability Seamless Redundancy (HSR) protocol implementation gained the ability to work as a proxy node connecting non-HSR capable node to an HSR network (RedBOX mode) (v6.11).

The netconsole driver can attach arbitrary metadata to the log messages (v6.9).

The work on making Netlink easier to interface with in modern languages continued. The Netlink protocol descriptions in YAML can now express Netlink “polymorphism” (v6.9), i.e. a situation where parsing of one attribute depends on the value of another attribute (e.g. link type determines how link attributes are parsed). 7 new specs have been added, as well as a lot of small spec and code generation improvements. Sadly we still only have bindings/codegen for C, C++ and Python.

Device APIs

The biggest addition to the device-facing APIs in 2024 was the HW traffic shaping interface (v6.13). Over the years we have accumulated a plethora of single-vendor, single-use case rate control APIs. The new API promises to express most use cases, ultimately unifying the configuration from the user perspective. The immediate use for the new API is rate limiting traffic from a group of Tx queues. Somewhat related to this work was the revamp of the RSS context API which allows directing Rx traffic to a group of queues (v6.11, v6.12, v6.13). Together the HW rate limiting and RSS context APIs will hopefully allow container networking to leverage HW capabilities, without the need for complex full offloads.

A new API for reporting device statistics has been created (qstat) within the netdev netlink family (v6.9). It allows reporting more detailed driver-level stats than old interfaces, and breaking down the stats by Rx/Tx queue.

Packet processing in presence of TC classifier offloads has been sped up, the software processing is now fully skipped if all rules are installed in HW-only mode (v6.10).

Ethtool gained support for flashing firmware to SFP modules, and configuring thresholds used by automatic IRQ moderation (v6.11). The most significant change to ethtool APIs in 2024 was, however, the ability to interact with multiple PHYs for a single network interface (v6.12).

Work continues on adding configuration interfaces for supplying power over network wiring. Ethtool APIs have been extended with Power over Ethernet (PoE) support (v6.10). The APIs have been extended to allow reporting more information about the devices and failure reasons, as well as setting power limits (v6.11).

Configuration of Energy Efficient Ethernet is being reworked because the old API did not have enough bits to cover new link modes (2.5GE, 5GE), but we also used this as an opportunity to share more code between drivers (especially those using phylib), and encourage more uniform behavior (v6.9).

Testing

2024 was the year of improving our testing. We spent the previous winter break building out an automated testing system, and have been running the full suite of networking selftests on all code merged since January. The pre-merge tests are catching roughly one bug a day.

We added a handful of simple libraries and infrastructure for writing tests in Python, crucially allowing easy use of Netlink YAML bindings, and supporting tests for NIC drivers (v6.10).

Later in the year we added native integration of packetdrill tests into kselftest, and started importing batches of tests from the packetdrill library (v6.12).

Community and process

The maintainers, developers and community members have met at two conferences, the netdev track at Linux Plumbers and netconf in Vienna, and the netdev.conf 0x18 conference in Santa Clara.

We have removed the historic requirement for special formatting of multi-line comments in netdev (although it is still the preferred style), documented our guidance on the use of automatic resource cleanup, as well as sending cleanup patches (such as “fixing” checkpatch warnings in existing code).

In April, we announced the redefinition of the “Supported” status for NIC drivers, to try to nudge vendors towards more collaboration and better testing. Whether this change has the desired effect remains to be seen.

Last but not least Andrew Lunn and Simon Horman have joined the netdev maintainer group.

6.9: https://lore.kernel.org/20240312042504.1835743-1-kuba@kernel.org 6.10: https://lore.kernel.org/20240514231155.1004295-1-kuba@kernel.org 6.11: https://lore.kernel.org/20240716152031.1288409-1-kuba@kernel.org 6.12: https://lore.kernel.org/20240915172730.2697972-1-kuba@kernel.org 6.13: https://lore.kernel.org/20241119161923.29062-1-pabeni@redhat.com

 
Read more...

from kees

Or, how to break all the tools that parse the “Fixes:” tag

Kees Cook

There was a recent discussion about how Linux's “Fixes” tag, which traditionally uses the 12 character commit SHA prefix, has an ever increasing chance of collisions. There are already 11-character collisions, and Geert wanted to raise the minimum short id to 16 characters. This was met with push-back for various reasons. One aspect that bothered me was some people still treating this like a theoretical “maybe in the future” problem. To clear up that problem, I generated a 12-character prefix collision against the start of Git history, commit 1da177e4c3f4 (“Linux-2.6.12-rc2”), which shows up in “Fixes” tags very often:

$ git log --no-merges --oneline --grep 'Fixes: 1da177e4c3f4' | wc -l
590

Tools like linux-next's “Fixes tag checker”, the Linux CNA's commit parser, and my own CVE lifetime analysis scripts do programmatic analysis of the “Fixes” tag and had no support for collisions (even shorter existing collisions).

So, in an effort to fix these tools, I broke them with commit 1da177e4c3f4 (“docs: git SHA prefixes are for humans”):

$ git show 1da177e4c3f4
error: short object ID 1da177e4c3f4 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

This is not yet in the upstream Linux tree, for fear of breaking countless other tools out in the wild. But it can serve as a test commit for those that want to get this fixed ahead of any future collisions (or this commit actually landing).

Lots of thanks to the lucky-commit project, which will grind trailing commit message whitespace in an attempt to find collisions. Doing the 12-character prefix collision took about 6 hours on my OpenCL-enabled RTX 3080 GPU.

For any questions, comments, etc, see this thread.

 
Read more...

from linusw

In kernel v6.10 we managed to merge two security hardening patches to the ARM32 architecture:

  • PAN for LPAE CONFIG_CPU_TTBR0_PAN
  • KCFI on ARM32 CONFIG_CFI_CLANG

As of kernel v6.12 these seem sufficiently stable for users such as distributions and embedded systems to look closer at. Below are the technical details!

A good rundown of these and other historically interesting security features can be found in Russell Currey's abridged history of kernel hardening which sums up what has been done up to now in a very approachable form.

PAN for LPAE

PAN is an abbreviation for the somewhat grammatically incorrect Privileged Access Never.

The fundamental idea with PAN on different architectures is to disable any access from kernelspace to the userspace memory, unless explicitly requested using the dedicated functions get_from_user() and put_to_user(). Attackers may want to compromise userspace from the kernel to access things such as keys, and we want to make this hard for them, and in general it protects userspace memory from corruption from kernelspace.

In some architectures such as S390 the userspace memory is completely separate from the kernel memory, but most simpler CPUs will just map the userspace into low memory (address 0x00000000 and forth) and there it is always accessible from the kernel.

The ARM32 hardware has for a few years had a config option named CONFIG_SW_DOMAIN_PAN which uses a hardware feature whereby userspace memory is made inaccessible from kernelspace. There is a special bit in the page descriptors saying that a certain page or segment etc belongs to userspace, so this is possible for the hardware to deduce.

For modern ARM32 systems with large memories configured to use LPAE nothing like PAN was available: this version of the MMU simply did not implement a PAN option.

As of the patch originally developed by Catalin Marinas, we deploy a scheme that will use the fact that LPAE has two separate translation table base registers (TTBR:s): one for userspace (TTBR0) and one for kernelspace (TTBR1).

By simply disabling the use of any translations (page walks) on TTBR0 when executing in kernelspace – unless explicitly enabled in get|put_[from|to]_user() – we achieve the same effect as PAN. This is now turned on by default for LPAE configurations.

KCFI on ARM32

The Kernel Control Flow Integrity is a “forward edge control flow checker”, which in practice means that the compiler will store a hash of the function prototype right before every target function call in memory, so that an attacker cannot easily insert a new call site.

KCFI is currently only implemented in the LLVM CLANG compiler, so the kernel needs to be compiled using CLANG. This is typically achieved by passing the build flag LLVM=1 to the kernel build. As the CLANG compiler is universal for all targets, the build system will figure out the rest.

Further, to support KCFI a fairly recent version of CLANG is needed. The kernel build will check if the compiler is new enough to support the option -fsanitize=kcfi else the option will be disabled.

The patch set is pretty complex but gives you an overview of how the feature was implemented on ARM32. It involved patching the majority of functions written in assembly and called from C with the special SYM_TYPED_FUNC_START() and SYM_FUNC_END() macros, inserting KCFI hashes also before functions written in assembly.

The overhead of this feature seems to be small so I recommend checking it out if you are able to use the CLANG compiler.

 
Read more...

from Gustavo A. R. Silva

The counted_by attribute

The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.

Here is an example of a flexible array annotated with this attribute:

struct bounded_flex_struct {
    ...
    size_t count;
    struct foo flex_array[] __attribute__((__counted_by__(count)));
};

In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.

In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).

The __counted_by() macro

In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.

#if __has_attribute(__counted_by__)
# define __counted_by(member)  __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif
  • c8248faf3ca27 (“Compiler Attributes: counted_by: Adjust name...“)

And with this we have been annotating flexible-array members across the whole kernel tree over the last year.

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };
  • ceba9725fb45 (“cxgb4: Annotate struct sched_table with ...“)

However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.

__counted_by() annotations in the kernel

There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) > packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event->datalen = datalen;
 	event->code = code;
 	event->ifidx = event_packet->msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&event->emsg, &event_packet->msg, sizeof(event->emsg));
 	memcpy(event->data, data, datalen);
-	event->datalen = datalen;
 	memcpy(event->ifaddr, event_packet->eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);
  • 62d19b358088 (“wifi: brcmfmac: fweh: Add __counted_by...“)

In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.

In the kernel, we've been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:

  • 6dc445c19050 (“clk: bcm: rpi: Assign –>num before accessing...“)

  • 9368cdf90f52 (“clk: bcm: dvp: Assign –>num before accessing...“)

Another common issue is when the counter is updated inside a loop. See the patch below.

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request->wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif->scan_request = request;
 	mod_timer(&vif->scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil->discovery_mode && cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil->discovery_mode && cmd->scan_type == WMI_ACTIVE_SCAN) {
+		cmd->discovery_mode = 1;
 		wil_dbg_misc(wil, "active scan with discovery_mode=1\n");
 	}
 
 	if (vif->mid == 0)
 		wil->radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif->mid,
-		      &cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd->num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)
  • 34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by...“)

The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that's material for a future post. For now, let's focus on the following excerpt.

-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;

Notice that in this case, num_channels is our counter, and it's set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.

As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.

Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id < 0xf0 &&
-	     pdu.cp.num_cis < ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis < pdu->num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &conn->iso_qos;
 
-		cis = &pdu.cis[pdu.cp.num_cis++];
+		cis = &pdu->cis[aux_num_cis++];
 		cis->cis_id = cis_id;
 		cis->c_sdu  = cpu_to_le16(conn->iso_qos.ucast.out.sdu);
 		cis->p_sdu  = cpu_to_le16(conn->iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis->c_rtn  = qos->ucast.out.rtn;
 		cis->p_rtn  = qos->ucast.in.rtn;
 	}
+	pdu->num_cis = aux_num_cis;
 
 ...
  • ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and...“)

Again, the entire patch does more than merely annotate the flexible-array member, but let's just focus on how aux_num_cis is used to access flexible array pdu->cis[].

In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.

Both solutions are acceptable, so use whichever is convenient for you. :)

Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:

  • [PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*

In a future post, I'll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don't do it!

Latest version: How to use the new counted_by attribute in C (and Linux)

 
Read more...

from Konstantin Ryabitsev

Message-ID's are used to identify and retrieve messages from the public-inbox archive on lore.kernel.org, so it's only natural to want to use memorable ones. Or maybe it's just me.

Regardless, here's what I do with neomutt and coolname:

  1. If coolname isn't yet packaged for your distro, you can install it with pip:

    pip install --user coolname
    
  2. Create this file as ~/bin/my-msgid.py:

    #!/usr/bin/python3
    import sys
    import random
    import string
    import datetime
    import platform
    
    from coolname import generate_slug
    
    parts = []
    parts.append(datetime.datetime.now().strftime('%Y%m%d'))
    parts.append(generate_slug(3))
    parts.append(''.join(random.choices(string.hexdigits, k=6)).lower())
    
    sys.stdout.write('-'.join(parts) + '@' + platform.node().split('.')[0])
    
  3. Create this file as ~/.mutt-fix-msgid:

    my_hdr Message-ID: <`/path/to/my/bin/my-msgid.py`>
    
  4. Add this to your .muttrc (works with mutt and neomutt):

    send-hook . "source ~/.mutt-fix-msgid"
    
  5. Enjoy funky message-id's like 20240227-flawless-capybara-of-drama-e09653@lemur. :)

 
Read more...

from Jakub Kicinski

Developments in Linux kernel networking accomplished by many excellent developers and as remembered by Andew L, Eric D, Jakub K and Paolo A.

Intro

The end of the Linux v6.2 merge coincided with the end of 2022, and the v6.8 window had just begun, meaning that during 2023 we developed for 6 kernel releases (v6.3 – v6.8). Throughout those releases netdev patch handlers (DaveM, Jakub, Paolo) applied 7243 patches, and the resulting pull requests to Linus described the changes in 6398 words. Given the volume of work we cannot go over every improvement, or even cover networking sub-trees in much detail (BPF enhancements… wireless work on WiFi 7…). We instead try to focus on major themes, and developments we subjectively find interesting.

Core and protocol stack

Some kernel-wide winds of development have blown our way in 2023. In v6.5 we saw an addition of SCM_PIDFD and SO_PEERPIDFD APIs for credential passing over UNIX sockets. The APIs duplicate existing ones but are using pidfds rather than integer PIDs. We have also seen a number of real-time related patches throughout the year.

v6.5 has brought a major overhaul of the socket splice implementation. Instead of feeding data into sockets page by page via a .sendpage callback, the socket .sendmsg handlers were extended to allow taking a reference on the data in struct msghdr. Continuing with the category of “scary refactoring work” we have also merged overhaul of locking in two subsystems – the wireless stack and devlink.

Early in the year we saw a tail end of the BIG TCP development (the ability to send chunks of more than 64kB of data through the stack at a time). v6.3 added support for BIG TCP over IPv4, the initial implementation in 2021 supported only IPv6, as the IPv4 packet header has no way of expressing lengths which don’t fit on 16 bits. v6.4 release also made the size of the “page fragment” array in the skb configurable at compilation time. Larger array increases the packet metadata size, but also increases the chances of being able to use BIG TCP when data is scattered across many pages.

Networking needs to allocate (and free) packet buffers at a staggering rate, and we see a continuous stream of improvements in this area. Most of the work these days centers on the page_pool infrastructure. v6.5 enabled recycling freed pages back to the pool without using any locks or atomic operations (when recycling happens in the same softirq context in which we expect the allocator to run). v6.7 reworked the API making allocation of arbitrary-size buffers (rather than pages) easier, also allowing removal of PAGE_SIZE-dependent logic from some drivers (16kB pages on ARM64 are increasingly important). v6.8 added uAPI for querying page_pool statistics over Netlink. Looking forward – there’s ongoing work to allow page_pools to allocate either special (user-mapped, or huge page backed) pages or buffers without struct page (DMABUF memory). In the non-page_pool world – a new slab cache was also added to avoid having to read struct page associated with the skb heads at freeing time, avoiding potential cache misses.

Number of key networking data structures (skb, netdevice, page_pool, sock, netns, mibs, nftables, fq scheduler) had been reorganized to optimize cacheline consumption and avoid cache misses. This reportedly improved TCP RPC performance with many connections on some AMD systems by as much as 40%.

In v6.7 the commonly used Fair Queuing (FQ) packet scheduler has gained built-in support for 3 levels of priority and ability to bypass queuing completely if the packet can be sent immediately (resulting in a 5% speedup for TCP RPCs).

Notable TCP developments this year include TCP Auth Option (RFC 5925) support, support for microsecond resolution of timestamps in the TimeStamp Option, and ACK batching optimizations.

Multi-Path TCP (MPTCP) is slowly coming to maturity, with most development effort focusing on reducing the features gap with plain TCP in terms of supported socket options, and increasing observability and introspection via native diag interface. Additionally, MPTCP has gained eBPF support to implement custom packet schedulers and simplify the migration of existing TCP applications to the multi-path variant.

Transport encryption continues to be very active as well. Increasing number of NICs support some form of crypto offload (TLS, IPsec, MACsec). This year notably we gained in-kernel users (NFS, NVMe, i.e. storage) of TLS encryption. Because kernel doesn’t have support for performing the TLS handshake by itself, a new mechanism was developed to hand over kernel-initiated TCP sockets to user space temporarily, where a well-tested user space library like OpenSSL or GnuTLS can perform a TLS handshake and negotiation, and then hand the connection back over to the kernel, with the keys installed.

The venerable bridge implementation has gained a few features. Majority of bridge development these days is driven by offloads (controlling hardware switches), and in case of data center switches EVPN support. Users can now limit the number of FDB and MDB auto-learned entries and selectively flush them in both bridge and VxLAN tunnels. v6.5 added the ability to selectively forward packets to VxLAN tunnels depending on whether they had missed the FDB in the lower bridge.

Among changes which may be more immediately visible to users – starting from v6.5 the IPv6 stack no longer prints the “link becomes ready” message when interface is brought up.

The AF_XDP zero-copy sockets have gained two major features in 2023. In v6.6 we gained multi-buffer support which allows transferring packets which do not fit in a single buffer (scatter-gather). v6.8 added Tx metadata support, enabling NIC Tx offloads on packets sent on AF_XDP sockets (checksumming, segmentation) as well as timestamping.

Early in the year we merged specifications and tooling for describing Netlink messages in YAML format. This work has grown to cover most major Netlink families (both legacy and generic). The specs are used to generate kernel ops/parsers, the uAPI headers, and documentation. User space can leverage the specs to serialize/deserialize Netlink messages without having to manually write parsers (C and Python have the support so far).

Device APIs

Apart from describing existing Netlink families, the YAML specs were put to use in defining new APIs. The “netdev” family was created to expose network device internals (BPF/XDP capabilities, information about device queues, NAPI instances, interrupt mapping etc.)

In the “ethtool” family – v6.3 brough APIs for configuring Ethernet Physical Layer Collision Avoidance (PLCA) (802.3cg-2019, a modern version of shared medium Ethernet) and MAC Merge layer (IEEE 802.3-2018 clause 99, allowing preemption of low priority frames by high priority frames).

After many attempts we have finally gained solid integration between the networking and the LED subsystems, allowing hardware-driven blinking of LEDs on Ethernet ports and SFPs to be configured using Linux LED APIs. Driver developers are working through the backlog of all devices which need this integration.

In general, automotive Ethernet-related contributions grew significantly in 2023, and with it, more interest in “slow” networking like 10Mbps over a single pair. Although the Data Center tends to dominate Linux networking events, the community as a whole is very diverse.

Significant development work went into refactoring and extending time-related networking APIs. Time stamping and time-based scheduling of packets has wide use across network applications (telcos, industrial networks, data centers). The most user visible addition is likely the DPLL subsystem in v6.7, used to configure and monitor atomic clocks and machines which need to forward clock phase between network ports.

Last but not least, late in the year the networking subsystem gained the first Rust API, for writing PHY drivers, as well as a driver implementation (duplicating an existing C driver, for now).

Removed

Inspired by the returning discussion about code removal at the Maintainer Summit let us mention places in the networking subsystem where code was retired this year. First and foremost in v6.8 wireless maintainers removed a lot of very old WiFi drivers, earlier in v6.3 they have also retired parts of WEP security. In v6.7 some parts of AppleTalk have been removed. In v6.3 (and v6.8) we have retired a number of packet schedulers and packet classifiers from the TC subsystem (act_ipt, act_rsvp, act_tcindex, sch_atm, sch_cbq, sch_dsmark). This was partially driven by an influx of syzbot and bug-bounty-driven security reports (there are many ways to earn money with Linux, turns out 🙂) Finally, the kernel parts of the bpfilter experiment were removed in v6.8, as the development effort had moved to user space.

Community & process

The maintainers, developers and community members had a chance to meet at the BPF/netdev track at Linux Plumbers in Richmond, and the netdev.conf 0x17 conference in Vancouver. 2023 was also the first time since the COVID pandemic when we organized the small netconf gathering – thanks to Meta for sponsoring and Kernel Recipes for hosting us in Paris!

We have made minor improvements to the mailing list development process by allowing a wider set of folks to update patch status using simple “mailbot commands”. Patch authors and anyone listed in MAINTAINERS for file paths touched by a patch series can now update the submission state in patchwork themselves.

The per-release development statistics, started late in the previous year, are now an established part of the netdev process, marking the end of each development cycle. They proved to be appreciated by the community and, more importantly, to somewhat steer some of the less participatory citizens towards better and more frequent contributions, especially on the review side.

A small but growing number of silicon vendors have started to try to mainline drivers without having the necessary experience, or mentoring needed to effectively participate in the upstream process. Some without consulting any of our documentation, others without consulting teams within their organization with more upstream experience. This has resulted in poor quality patch sets, taken up valuable time from the reviewers and led to reviewer frustration.

Much like the kernel community at large, we have been steadily shifting the focus on kernel testing, or integrating testing into our development process. In the olden days the kernel tree did not carry many tests, and testing had been seen as something largely external to the kernel project. The tools/testing/selftests directory was only created in 2012, and lib/kunit in 2019! We have accumulated a number of selftest for networking over the years, in 2023 there were multiple large selftest refactoring and speed up efforts. Our netdev CI started running all kunit tests and networking selftests on posted patches (although, to be honest, selftest runner only started working in January 2024 🙂).

syzbot stands out among “external” test projects which are particularly valuable for networking. We had fixed roughly 200 syzbot-reported bugs. This took a significant amount of maintainer work but in general we find syzbot bug reports to be useful, high quality and a pleasure to work on.

6.3: https://lore.kernel.org/all/20230221233808.1565509-1-kuba@kernel.org/ 6.4: https://lore.kernel.org/all/20230426143118.53556-1-pabeni@redhat.com/ 6.5: https://lore.kernel.org/all/20230627184830.1205815-1-kuba@kernel.org/ 6.6: https://lore.kernel.org/all/20230829125950.39432-1-pabeni@redhat.com/ 6.7: https://lore.kernel.org/all/20231028011741.2400327-1-kuba@kernel.org/ 6.8: https://lore.kernel.org/all/20240109162323.427562-1-pabeni@redhat.com/

 
Read more...

from arnd

Most compilers have an option to warn about a function that has a global definition but no declaration, gcc has had -Wmissing-prototypes as far back as the 1990s, and the sparse checker introduced -Wdecl back in 2005. Ensuring that each function has a declaration helps validate that the caller and the callee expect the same argument types, it can help find unused functions and it helps mark functions as static where possible to improve inter-function optimizations.

The warnings are not enabled in a default build, but are part of both make W=1 and make C=1 build, and in fact this used to cause most of the output of the former. As a number of subsystems have moved to eliminating all the W=1 warnings in their code, and the 0-day bot warns about newly introduced warnings, the amount of warning output from this has gone down over time.

After I saw a few patches addressing individual warnings in this area, I had a look at what actually remains. For my soc tree maintenance, I already run my own build bot that checks the output of “make randconfig” builds for 32-bit and 64-bit arm as well as x86, and apply local bugfixes to address any warning or error I get. I then enabled -Wmissing-prototypes unconditionally and added patches to address every single new bug I found, around 140 in total.

I uploaded the patches to https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=missing-prototypes and am sending them to the respective maintainers separately. Once all of these, or some other way to address each warning, can be merged into the mainline kernel, the warning option can be moved from W=1 to the default set.

The patches are all independent of one another, so I hope that most of them can get applied to subsytems directly as soon as I post them.

Some of the remaining architectures are already clean, while others will need follow-up patches for this. Another possible follow-up is to also address -Wmissing-variable-declarations warnings. This option is understood by clang but not enabled by the kernel build system, and not implemented by gcc, with the feature request being open since 2017.

 
Read more...

from Jakub Kicinski

The LWN's development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.

Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.

Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.

Local stats

With that in mind it's only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.

Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list's history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel community not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.

netdev stats

During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven't received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.

We've also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can't take all the full credit :)

Random example

I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.

Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################

As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven't said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I'll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I'm hoping we can learn something by tracking its changes over time.

Fin

To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer's time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN's code is also available publicly (search for gitdm, IIRC).

 
Read more...