# Christian Brauner

## The Seccomp Notifier – Cranking up the crazy with bpf()

In my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))

Last time I mentioned two new features that we had landed:

The 2. feature just landed in the merge window for v5.9. So what better time than now to boot a v5.9 pre-rc1 kernel and play with the new features.

I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are open(), connect(), dup2(), but also bpf(). People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than bpf().

As we know, bpf() and unprivileged containers don't get along too well. But that doesn't need to be the case. For the demo you're about to see I enabled LXD to supervise the bpf() syscalls for tasks running in unprivileged containers. We will intercept the bpf() syscalls for the BPF_PROG_LOAD command for BPF_PROG_TYPE_CGROUP_DEVICE program types and the BPF_PROG_ATTACH, and BPF_PROG_DETACH commands for the BPF_CGROUP_DEVICE attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.

This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up here. Let's see if the rest of the team thinks I'm going crazy. :)

## Seccomp Notify – New Frontiers in Unprivileged Container Development

#### Introduction

As most people know by know we do a lot of upstream kernel development. This stretches over multiple areas and of course we also do a lot of kernel work around containers. In this article I'd like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users. I've talked about this feature quite a few times at various conferences (just recently again at OSS NA) over the last two years but never actually sat down to write a blogpost about it. This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.

#### The Limits of Unprivileged Containers

That (Linux) Containers are a userspace fiction is a well-known dictum nowadays. It simply expresses the fact that there is no container kernel object in the Linux kernel. Instead, userspace is relatively free to define what a container is. But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system. This is achieved by combining a multitude of Linux kernel features. One of the better known kernel features that is used to build containers are namespaces. The number of namespaces the kernel supports has grown over time and we are currently at eight. Before you go and look them up on namespaces(7) here they are:

• cgroup: cgroup_namespaces(7)
• ipc: ipc_namespaces(7)
• network: network_namespaces(7)
• mount: mount_namespaces(7)
• pid: pid_namespaces(7)
• time: time_namespaces(7)
• user: user_namespaces(7)
• uts: uts_namespaces(7)

Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.

Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. CAP_MKNOD is required to create device nodes) and it seems that when the answer is “yes” then the task is actually a privileged task. But as usual things aren't that simple. What the task thinks it's checking for and what the kernel really is checking for are possibly two very different things. A naive task, i.e. a task not aware of user namespaces, might think it's asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.

In most cases the kernel will not check whether the task is privileged with respect to the whole system. Instead, it will almost always call a function called ns_capable() which is the kernel's way of checking whether the calling task has privilege in its current user namespace.

For example, when a new user namespace is created by setting the CLONE_NEWUSER flag in unshare(2) or in clone3(2) the kernel will grant a full set of capabilities to the task that called unshare(2) or the newly created child task via clone3(2) within the new user namespace. When this task now e.g. checks whether it has the CAP_MKNOD capability the kernel will report back that it indeed has that capability. The key point though is that this “yes” is not a global “yes”, i.e. the question “Am I privileged enough to perform this operation?” only applies to the current user namespace (and technically any nested user namespaces) not the host itself.

This distinction is important when trying to understand why a task running as root in a new user namespace with all capabilities raised will still see EPERM when e.g. trying to call mknod("/dev/mem", makedev(1, 1)) even though it seems to have all necessary privileges. The reason for this counterintuitive behavior is that the kernel isn't always checking whether you are privileged against your current user namespace. Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host's user namespace.

Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create /dev/kmem or any other critical device and use the device to take over the host. So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace. This is of course technically inconsistent since capabilities are per user namespace as we observed above.

Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices. So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it. On the other hand, some filesystems like cgroup, cgroup2, tmpfs, proc, sysfs, and fuse can be mounted in user namespace (with some caveats for proc and sys but we're ignoring those details for now) because the kernel can guarantee that this is safe.

But of course these restrictions are annoying. Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely. Quite often a container manager like LXD will know better than the kernel when an operation that a container tries to perform is safe.

A good example are device nodes. Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:

/dev/console
/dev/full
/dev/null
/dev/random
/dev/tty
/dev/urandom
/dev/zero


Allowing a container to create these devices would be safe. Of course, the container will simply bind-mount these devices during container startup into the container so this isn't really a serious problem. But any program running inside the container that wants to create these harmless devices nodes would fail.

The other example that was mentioned earlier is mounting of block-based filesystems. Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe. For example, they could have a dedicated disk for the container or they want to share data with or among containers. But the container could not mount any of those disks.

For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers. For example, here is how you'd hotplug /dev/zero into a running container:

 brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero

brauner@wittgenstein|~
> lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero

brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero
crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero


But of course, that doesn't help at all when a random application inside the container calls mknod(2) itself. In these cases LXD has no way of helping the application by hotplugging the device as it's unaware that a mknod syscall has been performed.

So the root of the problem seems to be: – A task inside the container performs a syscall that will fail. – The syscall would not need to fail since the container manager knows that it is safe. – The container manager has no way of knowing when such a syscall is performed. – Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.

So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail. So somehow we need to be able to interact with the syscalls of another task.

#### Seccomp – The Basics of Syscall Interception

The obvious candidate to look at is seccomp. Short for “secure computing” it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question. But seccomp allows even more advanced configurations through so-called “filters”. Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel. For example, a task could use a seccomp filter to only allow the mount() syscall and only those mount syscalls that create bind mounts. This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs. Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully. Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.

At its core seccomp is nothing but a syscall interception mechanism. One way or another every operating system has something that is at least roughly comparable. The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths. So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.

Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.

But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture's syscall table the kernel will first call into seccomp and run the loaded seccomp filter.

Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task. If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code. This way, it is for example possible to have the kernel pretend like it doesn't know the mount(2) syscall by creating a seccomp filter that reports back ENOSYS whenever the task tries to call mount(2).

But the way seccomp used to work isn't very dynamic. Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter. So there is no way to make a case-by-case decision which might come in handy in some scenarios.

In addition seccomp itself can't make a syscall actually succeed other than in the trivial way of reporting back success to the caller. So seccomp will only allow the kernel to pretend that a syscall succeeded. So while it is possible to instruct the kernel to return 0 for the mount(2) syscall it cannot actually be instructed to make the mount(2) syscall succeed. So just making the seccomp filter return 0 for mounting a dedicated ext4 disk device to /mnt will still not actually mount it at /mnt; it just pretends to the caller that it did. Of course that is in itself already a useful property for a bunch of use-cases but it doesn't really help with the mknod(2) or mount(2) problem outlined above.

#### Extending Seccomp

So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our mknod(2) and mount(2) problem. Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task. What is missing though is a way to bring another task such as the LXD container manager into the picture. Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.

The answer to these questions is seccomp notify. This is as good a time as any to bring in some historical context. The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized. The first concrete details around seccomp notify were conceived in early 2017 in the LXD team. The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container's Microconference in Los Angeles. The details of this talk are still listed here here and I'm sure Stéphane can still provide the slides we came up with. I didn't find a video recording even though I somehow thought we did have one. If someone is really curious I can try to investigate with the Linux Plumbers committee. After this talk implementation specifics were discussed in a hallway meeting later that day. And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team. The rest is history^wchangelog.

#### Seccomp Notify – Syscall Interception 2.0

In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter. When a container starts it will usually load a seccomp filter to restrict its attack surface. That is even done for unprivileged containers even though it is not strictly necessary.

With the addition of seccomp notify a container wishing to have a subset of syscalls handled by another process can set the new SECCOMP_RET_USER_NOTIF flag on its seccomp filter. This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter. This file descriptor is a seccomp notify file descriptor.

Of course, the seccomp notify fd is not very useful to the task itself. First, since it doesn't make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls. Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.

But what the task can do with the seccomp notify fd is to hand to another task. Usually the task that it will hand the seccomp notify fd to will be more privileged than itself. For a container the most obvious candidate would be the container manager of course.

Since the seccomp notify fd is pollable it is possible to put it into an event loop such as epoll(7), poll(2), or select(2) and wait for the file descriptor to become readable, i.e. for the kernel to return EPOLLIN to userspace. For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements. This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. mknod(2) or mount(2).

Put another way, this means the container manager can listen for syscall events for tasks running in the container. Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.

After the seccomp notify fd indicates that it is readable the container manager can use the new SECCOMP_IOCTL_NOTIF_RECV ioctl() associated with seccomp notify fds to read a struct seccomp_notif message for the syscall. Currently the data to be read from the seccomp notify fd includes the following pieces. But please be aware that we are in the process of discussing potentially intrusive changes for future versions:

struct seccomp_notif {
__u64 id;
__u32 pid;
__u32 flags;
struct seccomp_data data;
};


Let's look at this in a little more detail. The pid field is the pid of the task that performed the syscall as seen in the caller's pid namespace. To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called mknod(2) as seen in the pid namespace of the container manager. The id field is a unique identifier for the performed syscall. This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling. The flags argument is currently unused and reserved for future extensions.

The struct seccomp_data argument is probably the most interesting one as it contains the really exciting bits and pieces:

struct seccomp_data {
int nr;
__u32 arch;
__u64 instruction_pointer;
__u64 args[6];
};


The int field is the syscall number which can only be correctly interpreted relative to the arch field. The arch field is the (audit) architecture for which this syscall was made. This field is very relevant since compatible architectures (For the x86 architectures this encompasses at least x32, i386, and x86_64. The arm, mips, and power architectures also have compatible “sub” architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall foo() might have syscall number 1 on 32 bit and 2 on 64 bit. So the task reading the seccomp data can't simply assume that since it itself is running in a 32 bit environment the syscall number must be 1. Rather, it must check what the audit arch is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit. Otherwise the container manager might end up emulating mount() when it should be emulating mknod().). The instruction_pointer is set to the address of the instruction that performed the syscall. This is of course also architecture specific. And last the args member are the syscall arguments that the task performed the syscall with.

The args need to be interpreted and treated differently depending on the syscall layout and their type. If they are non-pointer arguments (unsigned int etc.) they can be copied into a local variable and interpreted right away. But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall. In the latter case the memory needs to be read and copied before it can be interpreted.

Let's look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments. Say the performed syscall was mount(2). In order to interpret the args field correctly we look at the syscall layout of mount(). (Please note, that I'm stressing that we need to look at the layout of syscall and the only reliable source for this is actually the kernel source code. The Linux manpages often list the wrapper provided by the system's libc and these wrapper do not necessarily line-up with the syscall itself (compare the waitid() wrapper and the waitid() syscall or the various clone() syscall layouts).) From the layout of mount(2) we see that args[0] is a pointer argument identifying the source path, args[1] is another pointer argument identifying the target path, args[2] is a pointer argument identifying the filesystem type, args[3] is a non-pointer argument identifying the options, and args[4] is another pointer argument identifying additional mount options.

So if we were to be interested in the source path of this mount(2) syscall we would need to open the /proc/<pid>/mem file of the task that performed this syscall and e.g. use the pread(2) function with args[0] as the offset into the task's virtual memory and read it into a buffer at least the length of a standard path. Alternatively, we can use a single syscall like process_vm_readv(2) to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.

A friendly advice: in general it is a good idea for the container manager to read all syscall arguments once into a local buffer and base its decisions on how to proceed on the data in this local buffer. Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it's also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to /proc/<pid>/mem and change the contents of e.g. args[0] or any other syscall argument. Also note, that the container manager should ensure that /proc/<pid> still refers to the same task after opening it by checking the validity of the syscall request via the id field and the associated SECCOMP_IOCTL_NOTIF_ID_VALID ioctl() to exclude the possibility of the task having exited, been reaped and its pid having been recycled.

But let's assume we have done all that. Now that the container manager has the task's syscall arguments available in a local buffer it can interpret the syscall arguments. While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed. After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the SECCOMP_IOCTL_NOTIF_SEND ioctl() on the seccomp notify fd to tell the kernel what it should do with the blocked task's syscall. The response is given in the form struct seccomp_notif_resp:

struct seccomp_notif_resp {
__u64 id;
__s64 val;
__s32 error;
__u32 flags;
};


Let's look at this struct in a little more detail too. The id field is set to the id of the syscall request to respond to and should correspond to the received id in the struct seccomp_notif that the container manager read via the SECCOMP_IOCTL_NOTIF_RECV ioctl() when the seccomp notify fd became readable. The val field is the return value of the syscall and is only set if the error field is set to 0. The error field is the error to return from the syscall and should be set to a negative errno(3) code if the syscall is supposed to fail (For example, to trick the caller into thinking that mount(2) is not supported on this kernel set error to -ENOSYS.). The flags value can be used to tell the kernel to continue the syscall by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag which I added to be able to intercept mount(2) and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments. More on that in a little bit.

With this machinery in place we are for now ;) done with the kernel bits.

#### Emulating Syscalls In Userspace

So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue. Probably emulate it. Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a very bad idea.).

Emulating syscalls in userspace is not a very new thing to do. It has been done for a long time. For example, libc's can choose to emulate the execveat(2) syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path. On a kernel that doesn't support the execveat(2) syscall the libc can emulate it by calling exec(3) with the path set to /proc/self/fd/<nr>. The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (fexecve(3) for our example). Any task using syscall(__NR_execveat, [...]) to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn't know that the task wants to perform the execveat(2) syscall and will not be able to emulate it in case the kernel doesn't support it.

Seccomp notify doesn't suffer from this problem since its syscall interception abilities aren't located in userspace at the library level but directly in the syscall path as we have seen. This greatly expands the abilities to emulate syscalls.

So now we have all the kernel pieces in place to solve our mknod(2) and mount(2) problem in unprivileged containers. Instead of simply letting the container fail on such harmless requests as creating the /dev/zero device node we can use seccomp notify to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it. Similarly, we can intercept mount(2) requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container. We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem. We actually support this feature in LXD. Since fuse is a safe way for unprivileged users to mount filesystems rewriting mount(2) requests is a great way to expose filesystems to containers.

In general, the possibilities of seccomp notify can't be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC. As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with seccomp notify seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to seccomp notify.

#### Seccomp Notify in action – LXD

After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses. Instead of simply exposing the raw seccomp notify fd for the container's seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients. Clients here are new containers who the administrator has signed up for syscall supervisions through LXD. Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.

When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd. The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials. LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments. If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall. For mknod(2) it will create the device node for the container and for mount(2) it will mount the filesystem for the container. Either by directly mounting it or by using a specified fuse binary for additional security.

If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container. The container then parses the message, verifying the credentials and will use the SECCOMP_IOCTL_NOTIF_SEND ioctl() sending a struct seccomp_notif_resp causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded. Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.

#### Show Me!

Ok, enough talk. Let's intercept some syscalls. The following demo shows how LXD uses the seccomp notify fd to emulate the mknod(2) and mount(2) syscalls for an unprivileged container:

#### Current Work and Future Directions

##### SECCOMP_USER_NOTIF_FLAG_CONTINUE

After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly. We realized we couldn't intercept the mount syscall. Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept. This is caused by seccomp not being able to handle pointer arguments. They are opaque for seccomp. So while it is possible to tell seccomp to only intercept mount(2) requests for real filesystems by only intercepting mount(2) syscalls where the MS_BIND flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about mount(2) syscalls for the ext4 or btrfs filesystem because the filesystem argument is a pointer.

But this means we will inadvertently intercept syscalls that we didn't intend to intercept. That is a generic problem but for some syscalls it's not really a big deal. For example, we know that mknod(2) fails for all character and block devices in unprivileged containers. So as long was we write a seccomp filter that intercepts only character and block device mknod(2) syscalls but no socket or fifo mknod() syscalls we don't have a problem. For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report EPERM and since the syscalls would fail anyway there's no problem.

But any system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace. But this would of course include all mount(2) syscalls for filesystems that can be mounted in unprivileged containers. I've listed a subset of them above. It includes at least tmpfs, proc, sysfs, devpts, cgroup, cgroup2 and probably a few others I'm forgetting. That's not ideal. We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.

The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall. To instruct the kernel to continue the syscall the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag can be set in struct seccomp_notif_resp's flag argument when instructing the kernel to unblock the task.

This is of course a very exciting feature and has a few readers probably thinking “Hm, I could implement a dynamic userspace seccomp policy.” to which I want to very loudly respond “No, you can't!”. In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace. I'm now going to mostly quote verbatim from my comment for the extension: The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with extreme caution! If set by the task supervising the syscalls of another task the syscall will continue. This is problematic is inherent because of TOCTOU (Time of Check-Time of Use). An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall. It should be absolutely clear that this means that seccomp notify cannot be used to implement a security policy on syscalls that read from dereferenced pointers in user space! It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe. In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.

Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the same syscall, the most recently added filter takes precedence. This means that the new SECCOMP_RET_USER_NOTIF filter can override any SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.

##### Retrieving file descriptors pidfd_getfd()

Another extension that was added by Sargun Dhillon recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task. This works even without seccomp notify since it is a new syscall but is of course especially useful in conjunction with it.

Often we would like to intercept syscalls such as connect(2). For example, the container manager might want to rewrite the connect(2) request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint. In these cases pidfd_getfd(2) can be used to retrieve a copy of the file descriptor of the task and perform the connect(2) for it. This unblocks another wide range of use-cases.

For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run getsockopt(2) on the file descriptor, and you can use options like TCP_INFO to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like TCP_NODELAY, to tune the socket without requiring the user's intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where connect(2) calls are intercepted, and the destination is changed to a real server instead.

Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.

##### Injecting file descriptors SECCOMP_NOTIFY_IOCTL_ADDFD

Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task. As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as open(2) that cause new file descriptors to be installed in the task performing the syscall.

The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on. This way it is possible to intercept syscalls such as open(2) or accept(2), and install (or replace, like dup2(2)) the container manager's resulting fd in the target task.

This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to perf_event_open(2), and bpf(2) for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.

Last, I want to note that Sargun Dhillon was kind enough to contribute paragraphs to the pidfd_getfd(2) and SECCOMP_NOTIFY_IOCTL_ADDFD sections. He also provided the graphic in the pidfd_getfd(2) sections to illustrate the performance benefits of this solution.

Christian

## Runtimes And the Curse of the Privileged Container

#### Introduction (CVE-2019-5736)

Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn't matter if the command is not attacker-controlled) as root within a container in either of these contexts:

• Creating a new container using an attacker-controlled image.

I've been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.

#### What are Privileged Containers?

At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root. Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.

What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.

An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel's user namespace implementation has a bug.

The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:

id: 0 100000 100000


This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(100000) -> host_id(200000)


With this mapping it's evident that container_id(0) != host_id(0). But now consider the following mapping:

id: 0 0 1
id: 1 100001 99999


This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(99999) -> host_id(199999)


In contrast to the first example this has the consequence that container_id(0) == host_id(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged container.

As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.

#### The Trouble with Privileged Containers

The problem I see with privileged containers is essentially captured by LXC's and LXD's upstream security position which we have held since at least 2015 but probably even earlier. I'm quoting from our notes about privileged containers:

Privileged containers are defined as any container where the container uid 0 is mapped to the host's uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.

Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.

LXC upstream's position is that those containers aren't and cannot be root-safe.

They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.

We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren't blockable as they would require blocking so many core features that the average container would become completely unusable.

[...]

As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.

LXC's upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.

#### Running Untrusted Workloads in Privileged Containers

is insane. That's about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.

#### CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root

CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:

• could just be a binary that calls poweroff
• could be a binary that spawns a root shell
• could be a binary that kills other containers when called again to attach
• could be suid cat
• .
• .
• .

The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.

LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.

#### The Lie that Privileged Containers can be safe

Aside from mostly working on the Kernel I'm also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC – the low-level container runtime – and LXD – the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:

1. Privileged containers should never be used to run untrusted workloads.
2. Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: container_id(0) == host_id(0). It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.

#### Unprivileged Containers as Default

As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren't safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.

For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.

The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.

#### Are LXC and LXD affected?

• Unprivileged LXC and LXD containers are not affected.

• Any privileged LXC and LXD container running on a read-only rootfs is not affected.

• Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an O_PATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the O_PATH file descriptor through /proc/self/fd/<O_PATH-nr> in a loop as O_WRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.

• Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.

#### Chromebooks with Crostini using LXD are not affected

Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXD_UNPRIVILEGED_ONLY flag. For more details see this link.

#### Fixing CVE-2019-5736

To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf.6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXC_MEMFD_REXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.

Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.

## Android Binderfs

#### Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

#### The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

• /dev/binder
• /dev/hwbinder
• /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

#### Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

#### Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"


To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

#### Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn't share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

#### Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or – depending on the udev(7) implementation – will be created via mknod(2) – by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host's. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

#### Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There's a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs


at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control


#### binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here's an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
int fd, ret, saved_errno;
size_t len;
struct binderfs_device device = { 0 };

if (argc != 3)
exit(EXIT_FAILURE);

len = strlen(argv[2]);
if (len > BINDERFS_MAX_NAME)
exit(EXIT_FAILURE);

memcpy(device.name, argv[2], len);

fd = open(argv[1], O_RDONLY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open binder-control device\n",
strerror(errno));
exit(EXIT_FAILURE);
}

saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to allocate new binder device\n",
strerror(errno));
exit(EXIT_FAILURE);
}

printf("Allocated new binder device with major %d, minor %d, "
"and name %s\n", device.major, device.minor,
device.name);

exit(EXIT_SUCCESS);
}


What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
* struct binderfs_device - retrieve information about a new binder device
* @name:   the name to use for the new binderfs binder device
* @major:  major number allocated for binderfs binder devices
* @minor:  minor number allocated for the new binderfs binder device
*
*/
struct binderfs_device {
char name[BINDERFS_MAX_NAME + 1];
__u32 major;
__u32 minor;
};


and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1


#### binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder


Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

#### binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also – even though never merged in the kernel – kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control


There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

#### binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~\$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder


#### Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg's tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.