people.kernel.org

Reader

Read the latest posts from people.kernel.org.

from Vlastimil Babka

In this post I would like to raise the awareness a bit about an effort to reduce the limitations of anonymous VMA merging, in the form of an ongoing master thesis by Jakub Matena, which I'm supervising. I suspect there might be userspace projects that would benefit and maybe their authors are aware of the current limitations and would welcome if they were relaxed, but they don't read the linux-mm mailing list – the last version of the RFC posted there is here

In a high-level summary, merging of anonymous VMAs in Linux generally happens as soon as they become adjacent in the address space and have compatible access protection bits (and also mempolicies etc.). However due to internal implementation details (involving VMA and page offsets) some operations such as mremap() that moves VMAs around the address space can cause anonymous VMAs not to merge even if everything else is compatible. This is then visible as extra entries in /proc/pid/maps that could be in theory be one larger entry, the associated larger memory and CPU overhead of VMA operations, or even hitting the limit of VMAs per process, set by the vm.max_map_count sysctl. A related issue is that mremap() syscall itself cannot currently process multiple VMAs, so a process that needs to further mremap() the non-merged areas would need to somehow learn the extra boundaries first and perform a sequence of multiple mremap()'s to achieve its goal.

Does any of the above sound familiar because you found that out already while working on a Linux application? Then we would love your feedback on the RFC linked above (or privately). The issue is that while in many scenarios the merging limitations can be lifted by the RFC, it doesn't come for free in both of some overhead of e.g. mremap(), and especially the extra complexity of an already complex code. Thus identifying workloads that would benefit a lot would be helpful. Thanks!

 
Read more...

from metan's blog

The new LTP release will include changes that have introduced concept of a test maximal runtime so let me briefly explain what exactly that is. To begin with let's make an observation about a LTP test duration. Most of the LTP tests do fall into two categories when duration of the test is considered. First type of tests is fast, generally under a second or two and most of the time even fraction of that. These tests mostly prepare simple environment, call a syscall or two, clean up and are done. The second type of tests runs for longer and their duration is usually counted in minutes. These tests include I/O stress test, various regression tests that are looping in order to hit a race, timer precision tests that have to sample time intervals and so on.

Historically in LTP the test duration was limited by a single value called timeout, that defaulted to a compromise of 5 minutes, which is the worst value for both classes of the tests. That is because it's clearly too long for short running tests and at the same time too short for significant fraction of the long running tests. This was clear just by checking the tests that actually adjusted the default timeout. Quite a few short running tests that were prone to deadlocks decreased the default timeout to a much shorter interval and at the same time quite a few long running tests did increase it as well.

But back at how the test duration was handled in the long running tests. The test duration for long running tests is usually bounded by a time limit as well as a limit on a number of iterations and the test exits on whichever is hit first. In order to exit the test before the timeout these tests watched the elapsed runtime and did exit the main loop if the runtime got close enough to the test timeout. The problem was that close enough was loosely defined and implemented in each test differently. That obviously leads to a different problems. For instance if test looped until there was 10 seconds left to the timeout and the test cleanup did take more than 10 seconds on a slower hardware, there was no way how to avoid triggering the timeout which resulted in test failure. If test timeout was increased the test simply run for longer duration and hit the timeout at the end either way. At the same time if the test did use proportion of the timeout left out for the test cleanup things didn't work out when the timeout was scaled down in order to shorten the test duration.

After careful analysis it became clear that the test duration has to be bound by a two distinct values. The new values are now called timeout and max_runtime and the test duration is bound by a sum of these two. The idea behind this should be clear to the reader at this point. The max_runtime limits the test active part, that is the part where the actual test loop is executed and the timeout covers the test setup and cleanup and all inaccuracies in the accounting. Each of them can be scaled separately which gives us enough flexibility to be able to scale from small embedded boards all the way up to the supercomputers. This change also allowed us to change the default test timeout to 30 seconds. And if you are asking yourself a question how max_runtime is set for short running tests the answer is simple it's set to zero since the default timeout is more than enough to cope with these.

All of this also helps to kill the misbehaving tests much faster since we have much better estimation for the expected test duration. And yes this is a big deal when you are running thousands of testcases, it may speed up the testrun quite significantly even with a few deadlocked tests.

But things does not end here, there is a bit of added complexity on the top of this. Some of the testcases will call the main test loop more than once. That is because we have a few “multipliers” flags that can increase test coverage quite a bit. For instance we have so called .all_filesystems flag, that when set, will execute the test on the top of the most commonly used filesystems. There is also flag that can run the test for a different variants, which is sometimes used to run the test for a more than one syscall variant, e.g. for clock_gettime() we run the same test for both syscall and VDSO. All these multipliers have to be taken into an account when overall test duration is computed. However we do have all these flags in the metadata file now hence we are getting really close to a state where we will have a tool that can compute an accurate upper bound for duration for a given test. However that is completely different story for a different short article.

 
Read more...

from Konstantin Ryabitsev

Once every couple of years someone unfailingly takes advantage of the following two facts:

  1. most large git hosting providers set up object sharing between forks of the same repository in order to save both storage space and improve user experience
  2. git's loose internal structure allows any shared object to be accessed from any other repository

Thus, hilarity ensues on a fairly regular basis:

Every time this happens, many wonder how come this isn't treated like a nasty security bug, and the answer, inevitably, is “it's complicated.”

Blobs, trees, commits, oh my

Under the hood, git repositories are a bunch of objects — blobs, trees, and commits. Blobs are file contents, trees are directory listings that establish the relationship between file names and the blobs, and commits are like still frames in a movie reel that show where all the trees and blobs were at a specific point in time. Each next commit refers to the hash of the previous commit, which is how we know in what order these still frames should be put together to make a movie.

Each of these objects has a hash value, which is how they are stored inside the git directory itself (look in .git/objects). When git was originally designed, over a decade ago, it didn't really have a concept of “branches” — there was just a symlink HEAD pointing to the latest commit. If you wanted to work on several things at once, you simply cloned the repository and did it in a separate directory with its own HEAD. Cloning was a very efficient operation, as through the magic of hardlinking, hundreds of clones would take up about as much room on your disk as a single one.

Fast-forward to today

Git is a lot more complicated these days, but the basic concepts are the same. You still have blobs, trees, commits, and they are all still stored internally as hashes. Under the hood, git has developed quite a bit over the past decade to make it more efficient to store and retrieve millions and tens of millions of repository objects. Most of them are now stored inside special pack files, which are organized rather similar to compressed video clips — formats like webm don't really store each frame in a separate image, as there is usually very little difference between any two adjacent frames. It makes much more sense to store just the difference (“delta”) between two still images until you come to a designated “key frame”.

Similarly, when generating pack files, git will try to calculate the deltas between objects and only store their incremental differences — at least until it decides that it's time to start from a new “key frame” just so checking out a tag from a year ago doesn't require replaying a year worth of diffs. At the same time, there has been a lot of work to make the act of pushing/pulling objects more efficient. When someone sends you a pull request and you want to review their changes, you don't want to download their entire tree. Your git client and the remote git server compare what objects they already have on each end, with the goal to send you just the objects that you are lacking.

Optimizing public forks

If you look at the GitHub links above, check out how many forks torvalds/linux has on that hosting service. Right now, that number says “41.1k”. With the best kinds of optimizations in place, a bare linux.git repository takes up roughtly 3 GB on disk. Doing quick math, if each one of these 41.1k forks were completely standalone, that would require about 125 TB of disk storage. Throw in a few hundred terabytes for all the forks of Chromium, Android, and Gecko, and soon you're talking Real Large Numbers. Which is why nobody actually does it this way.

Remember how I said that git forks were designed to be extremely efficient and reuse the objects between clones? This is how forks are actually organized on GitHub (and git.kernel.org, for that matter), except it's a bit more complicated these days than simply hardlinking the contents of .git/objects around.

On git.kernel.org side of things we store the objects from all forks of linux.git in a single “object storage” repository (see https://pypi.org/project/grokmirror/ for the gory details). This has many positive side-effects:

  • all of git.kernel.org, with its hundreds of linux.git forks takes up just 30G of disk space
  • when Linus merges his usual set of pull requests and performs “git push”, he only has to send a very small subset of those objects, because we probably already have most of them
  • similarly, when maintainers pull, rebase, and push their own forks, they don't have to send any of the objects back to us, as we already have them

Object sharing allows to greatly improve not only the backend infrastructure on our end, but also the experience of git's end-users who directly benefit from not having to push around nearly as many bits.

The dark side of object sharing

With all the benefits of object sharing comes one important downside — namely, you can access any shared object through any of the forks. So, if you fork linux.git and push your own commit into it, any of the 41.1k forks will have access to the objects referenced by your commit. If you know the hash of that object, and if the web ui allows to access arbitrary repository objects by their hash, you can even view and link to it from any of the forks, making it look as if that object is actually part of that particular repository (which is how we get the links at the start of this article).

So, why can't GitHub (or git.kernel.org) prevent this from happening? Remember when I said that a git repository is like a movie full of adjacent still frames? When you look at a scene in a movie, it is very easy for you to identify all objects in any given still frame — there is a street, a car, and a person. However, if I show you a picture of a car and ask you “does this car show up in this movie,” the only way you can answer this question is by watching the entire thing from the beginning to the end, carefully scrutinizing every shot.

In just the same way, to check if a blob from the shared repository actually belongs in a fork, git has to look at all that repository's tips and work its way backwards, commit by commit, to see if any of the tree objects reference that particular blob. Needless to say, this is an extremely expensive operation, which, if enabled, would allow anyone to easily DoS a git server with only a handful of requests.

This may change in the future, though. For example, if you access a commit that is not part of a repository, GitHub will now show you a warning message:

Looking up “does this commit belong in this repository” used to be a very expensive operation, too, until git learned to generate commit graphs (see man git-commit-graph). It is possible that at some point in the future a similar feature will land that will make it easy to perform a similar check for the blob, which will allow GitHub to show a similar warning when someone accesses shared blobs by their hash from the wrong repo.

Why this isn't a security bug

Just because an object is part of the shared storage doesn't really have any impact on the forks. When you perform a git-aware operation like “git clone” or “git pull,” git-daemon will only send the objects actually belonging to that repository. Furthermore, your git client deliberately doesn't trust the remote to send the right stuff, so it will perform its own connectivity checks before accepting anything from the server.

If you're extra paranoid, you're encouraged to set receive.fsckObjects for some additional protection against in-flight object corruption, and if you're really serious about securing your repositories, then you should set up and use git object signing:

This is, incidentally, also how you would be able to verify whether commits were made by the actual Linus Torvalds or merely by someone pretending to be him.

Parting words

This neither proves nor disproves the identity of “Satoshi.” However, given Linus's widely known negative opinions of C++, it's probably not very likely that it's the language he'd pick to write some proof of concept code.

 
Read more...

from metan's blog

Unfortunately FOSDEM is going to be virtual again this year, but that does not stop us from organizing the testing and automation devroom. Have a look at our CfP and if you have something interesting to present go ahead and fill in a submission!

 
Read more...

from nmenon

One of the cool things with kernel.org is the fact that we can rotate maintainership depending on workload. So, https://git.kernel.org/pub/scm/linux/kernel/git/nmenon/linux.git/ is now my personal tree and we have picked up https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git/ as a co-maintained TI tree that Vignesh and I rotate responsibilities with Tony Lindgren and Tero in backup.

Thanks to Konstantin and Stephen in making this happen.!

NOTE: No change in Tony's tree @ https://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git/

 
Read more...

from Konstantin Ryabitsev

This is the second installment in the series where we're looking at using the public-inbox lei tool for interacting with remote mailing list archives such as lore.kernel.org. In the previous article we looked at delivering your search results locally, and today let's look at doing the same, but with remote IMAP folders. For feedback, send a follow-up to this message on the workflows list:

For our example query today, we'll do some stargazing. The following will show us all mail sent by Linus Torvalds:

f:torvalds AND rt:1.month.ago..

I'm mostly using it because it's short, but you may want to use something similar if you have co-maintainer duties and want to automatically receive a copy of all mail sent by your fellow subsystem maintainers.

Note on saving credentials

When accessing IMAP folders, lei will require a username and password. Unless you really like typing them in manually every time you run lei up, you will probably want to have them cached on your local system. Lei will defer to git-credential-helper for this purpose, so if you haven't already set this up, you will want to do that now.

The two commonly used credential storage backends on Linux are “libsecret” and “store”:

  • libsecret is the preferred mechanism, as it will work with your Desktop Environment's keyring manager to store the credentials in a relatively safe fashion (encrypted at rest).

  • store should only be used if you don't have any other option, as it will record the credentials without any kind of encryption in the ~/.git-credentials file. However, if nothing else is working for you and you are fairly confident in the security of your system, it's something you can use.

Simply run the following command to configure the credential helper globally for your environment:

git config --global credential.helper libsecret

For more in-depth information about this topic, see man git-credential.

Getting your IMAP server ready

Before you start, you should get some information about your IMAP server, such as your login information. For my examples, I'm going to use Gmail, Migadu, and a generic Dovecot IMAP server installation, which should hopefully cover enough ground to be useful for the vast majority of cases.

What you will need beforehand:

  • the IMAP server hostname and port (if it's not 993)
  • the IMAP username
  • the IMAP password

It will also help to know the folder hierarchy. Some IMAP servers create all subfolders below INBOX, while others don't really care.

Generic Dovecot

We happen to be running Dovecot on mail.codeaurora.org, so I'm going to use it as my “generic Dovecot” system and run the following command:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://mail.codeaurora.org/INBOX/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

The <<< bit above is a Bash-ism, so if you're using a different shell, you can use the POSIX-compliant heredoc format instead:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://mail.codeaurora.org/INBOX/torvalds <<EOF
f:torvalds AND rt:1.month.ago..
EOF

The first time you run it, you should get a username: and password: prompt, but after that the credentials should be cached and no longer required on each repeated access to the same imaps server.

NOTE: some IMAP servers use the dot . instead of the slash / for indicating folder hierarchy, so if INBOX/torvalds is not working for you, try INBOX.torvalds instead.

Refreshing and subscribing to IMAP folders

If the above command succeeded, then you should be able to view the IMAP folder in your mail client. If you cannot see torvalds in your list of available folders, then you may need to refresh and/or subscribe to the newly created folder. The process will be different for every mail client, but it shouldn't be too hard to find.

The same with Migadu

If you have a linux.dev account (see https://korg.docs.kernel.org/linuxdev.html), then you probably already know that we ask you not to use your account for subscribing to busy mailing lists. This is due to Migadu imposing soft limits on how much incoming email is allowed for each hosted domain — so using lei + IMAP is an excellent alternative.

To set this up with your linux.dev account (or any other account hosted on Migadu), use the following command:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://imap.migadu.com/lei/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

Again, you will need to subscribe to the new lei/torvalds folder to see it in your mail client.

The same with Gmail

If you are a Gmail user and aren't already using IMAP, then you will need to jump through a few additional hoops before you are able to get going. Google is attempting to enhance the security of your account by restricting how much can be done with just your Google username and password, so services like IMAP are not available without setting up a special “app password” that can only be used for mail access.

Enabling app passwords requires that you first enable 2-factor authentication, and then generate a random app password to use with IMAP. Please follow the process described in the following Google document: https://support.google.com/mail/answer/185833

Once you have the app password for use with IMAP, you can use lei and imaps just like with any other IMAP server:

lei q -I https://lore.kernel.org/all/ -d mid \
  -o imaps://imap.gmail.com/lei/torvalds \
  <<< 'f:torvalds AND rt:1.month.ago..'

It requires a browser page reload for the folder to show up in your Gmail web UI.

Automating lei up runs

If you're setting up IMAP access, then you probably want IMAP updates to happen behind the scenes without your direct involvement. All you need to do is periodically run lei up --all (plus -q if you don't want non-critical output).

If you're just getting started, then you can set up a simple screen session with a watch command at a 10-minute interval, like so:

watch -n 600 lei up --all

You can then detach from the screen terminal and let that command continue behind the scenes. The main problem with this approach is that it won't survive a system reboot, so if everything is working well and you want to make the command a bit more permanent, you can set up a systemd user timer.

Here's the service file to put in ~/.config/systemd/user/lei-up-all.service:

[Unit]
Description=lei up --all service
ConditionPathExists=%h/.local/share/lei

[Service]
Type=oneshot
ExecStart=/usr/bin/lei up --all -q

[Install]
WantedBy=mail.target

And the timer file to put in ~/.config/systemd/user/lei-up-all.timer:

[Unit]
Description=lei up --all timer
ConditionPathExists=%h/.local/share/lei

[Timer]
OnUnitInactiveSec=10m

[Install]
WantedBy=default.target

Enable the timer:

systemctl --user enable --now lei-up-all.timer

You can use journalctl -xn to view the latest journal messages and make sure that the timer is running well.

CAUTION: user timers only run when the user is logged in. This is not actually that bad, as your keyring is not going to be unlocked unless you are logged into the desktop session. If you want to run lei up as a background process on some server, you should set up a system-level timer and use a different git-credential mechanism (e.g. store) — and you probably shouldn't do this on a shared system where you have to worry about your account credentials being stolen.

Coming up next

In the next installment we'll look at some other part of lei and public-inbox... I haven't yet decided which. :)

 
Read more...

from metan's blog

As usual we had a LTP release at the end of the September. What was unusual though is the number of patches that went it, we got 483 patches, which is about +150 than the last three releases. And the number of patches did slowly grow even before that.

While it's great and I'm happy that the project is growing, there is a catch, grow like this puts additional strain on the maintainers, particularly on the patch reviewers. For me it was +120 patches reviewed during the four months period and that only counts the final versions of patches that were accepted to the repository, it's not unusual to have three or more revisions before the work is ready to be merged.

While I managed to cope with it reasonably fine the work that I had on TODO for the project was stalled. One of the things I finally want to move forward is making the runltp-ng official LTP test runner, but there is much more. So the obvious question is how to make things better and one of the things we came up was automation.

What we implemented for LTP is 'make check' that runs different tools on the test source code that is supposed to be used before patch is sent for a review. For C code we use the well known checkpatch.pl and custom sparse based checker to identify most common problems. The tooling is set up automatically when you call 'make check' for a first time and we tried to make it as effortless as possible, so that there is no reason not to use during the development. We also use checkbashism.pl for shell code and hopefully the number of checks will grow over the time. Hopefully this should eliminate on average at least one revision for a patchset which would be hundreds of patches during our development cycle.

Ideally this will fix the problem for a while and we will make more effective use of our resources, but eventually we will get to a point where more maintainers and reviewers are needed, which is problem that is hard to solve without your help.

 
Read more...

from metan's blog

We have reached an important milestone with latest LTP release – the amount of testcases written in the new test library finally outnumbers the amount of old library tests. Which is nice opportunity for a small celebration and also to look back a bit into a history and try to summarize what has happened over the last 10 years in LTP.

I've joined LTP development a bit more than 10 years ago in 2009. At that point we were really struggling with the basics. The build system was collection of random Makefiles and the build often failed for very random reasons. The were pieces of shell code embedded in Makefiles for instance to check for devel libraries, manually written shell loops over directories that prevented parallel build, and all kind of ugly mess like that. This has changed and at the end of 2009 as the build system was rewritten, with that LTP supported proper parallel build, started to use autoconf for feature checks, etc. We also switched from CVS to GIT at the end of the 2009, which was huge improvement as well.

However that was only a start, LTP was easier to build, git was nicer to use, but we still had tests that were mostly failing and fair amount of the tests were producing nothing but noise. There were also tests that didn't produce real results and always passed but it's really hard to count these unless you review the code carefully one testcase at a time, which is part of what we still do even after ten years of work.

From that point on it took us a few years to clear the worst parts and to deal with most of the troublemakers and the results from LTP were gradually getting greener and more stable as well. We are far from being bugless, there are still parts covered in dust that are waiting for attention, but we are getting there. For instance in this release we finally got a nice cgroup test library that simplifies cgroup testcases and we should fix rest of the cgroup tests ideally before the next one. Also I'm quite happy that the manpower put into LTP development slowly increases, however compared to the efforts put into the kernel development the situation is still dire. I used to tell people that the amount of work put into Linux automated testing is a bad joke back then. These days it's much better but still hardly optimal as we struggle to keep up with covering the newly introduced kernel features.

At the start I've mentioned new test library so I should explain how we came to this and why it's superior to what we had previously. First of all there was a test library in LTP that could be traced back to SGI and was released under GPL more than 20 years ago, it's probably even older than that though. The main problems with the library was that it was cumbersome to use. There were some API reporting functions, but these were not thread safe nor could be used in child processes. You had to propagate test results manually in these two cases which was prone to errors. Even worse since the test implemented the main() function you had to return the overall result manually as well and forgetting to do so was one of the common mistakes. At a point where most of the broken tests were finally fixed I had a bit of time to invest into a future and after seven years of dealing with a common test mistakes and I had a pretty good picture of what a test library should look like and what should be avoided. Hence I've sat down and designed library that is nice and fun to use and makes tests much easier to write. This library still evolves over the time, the version introduced in 2016 wasn't as nice as it is now, but even when it was introduced it included the most important bits, for instance thread safe and automatic test result propagation or synchronization primitives that could be used even to synchronize shell code against C binary.

The old library is still present in LTP since we are a bit more than halfway done converting the tests, which is no easy task since we have still more than 600 tests to go. And as we are converting the test we are also reviewing them to ensure that the assertions are correct and the coverage isn't lacking. We still find tests that fail to report results from time to time even now, which only show how hard is to eliminate mistakes like this and why preventing them in the first place is right thing to do. And if things will go well the rest of tests should be converted in about 5 years and LTP should be finally free of the historical baggage. At that point I guess that I will throw a small celebration since that would conclude a huge task I've been working on for a decade now.

 
Read more...

from linusw

This is a retrospect of my work on the KASan Kernel Address Sanitizer for the ARM32 platform. The name is a pun on the diving decompression stop that is something you perform after going down below the surface to avoid decompression sickness.

Where It All Began

The AddressSanitizer (ASan) is a really clever invention by Google, hats off. It is one of those development tools that like git just take on the world in a short time. It was invented by some smart russians, especially Андрей Коновалов (Andrey Konovalov) and Дмитрий Вьюков (Dmitry Vyukov). It appears to be not just funded by Google but also part of a PhD thesis work.

The idea with ASan is to help ensure memory safety by intercepting all memory accesses through compiler instrumentation, and consequently providing “ASan splats” (runtime problem detections) while stressing the code. Code instrumented with ASan gets significantly slower than normal and uses up a bunch of memory for “shadowing” (I will explain this) making it a pure development tool: it is not intended to be enabled on production systems.

The way that ASan instruments code is by linking every load and store into symbols like these:

__asan_load1(unsigned long addr);
__asan_store1(unsigned long addr);
__asan_load2(unsigned long addr);
__asan_store2(unsigned long addr);
__asan_load4(unsigned long addr);
__asan_store4(unsigned long addr);
__asan_load8(unsigned long addr);
__asan_store8(unsigned long addr);
__asan_load16(unsigned long addr);
__asan_store16(unsigned long addr);

As you can guess these calls loads or stores 1, 2, 4, 8 or 16 bytes of memory in a chunk into the virtual address addr and reflects how the compiler thinks that the compiled code (usually C) thinks about these memory accesses. ASan intercepts all reads and writes by placing itself between the executing program and any memory management. The above symbols can be implemented by any runtime environment. The address will reflect what the compiler assumed runtime environment thinks about the (usually virtual) memory where the program will execute.

You will instrument your code with ASan, toss heavy test workloads on the code, and see if something breaks. If something breaks, you go and investigate the breakage to find the problem. The problem will often be one or another instance of buffer overflow, out-of-bounds array or string access, or the use of a dangling pointer such as use-after-free. These problems are a side effect of using the C programming language.

When resolving the mentioned load/store symbols, ASan instrumentation is based on shadow memory, and this is on turn based on the idea that one bit in a single byte “shadows” 8 bytes of memory, so you allocate 1/8 the amount of memory that your instrumented program will use and shadow that to some other memory using an offset calculation like this:

shadow = (address >> 3) + offset

The shadow memory is located at offset, and if our instrumented memory is N bytes then we need to allocate N/8 = N >> 3 bytes to be used as shadow memory. Notice that I say instrumented memory not code: ASan shadows not only the actual compiled code but mainly and most importantly any allocations and referenced pointers the code maintains. Also the DATA (contants) and BSS (global variables) part of the executable image are shadowed. To achive this the userspace links to a special malloc() implementation that overrides the default and manages all of this behind the scenes. One aspect of it is that malloc() will of course return chunks of memory naturally aligned to 8, so that the shadow memory will be on an even byte boundary.

ASan shadow memory The ASan shadow memory shadows the memory you're interested in inspecting for memory safety.

The helper library will allocate shadow memory as part of the runtime, and use it to shadow the code, data and runtime allocations of the program. Some will be allocated up front as the program is started, some will be allocated to shadow allocations at runtime, such as dynamically allocated buffers or anything else you malloc().

The error detection was based on the observation that a shadowing byte with each bit representing an out-of-bounds access error can have a “no error” state (0x00) and 8 error states, in total 9 states. Later on a more elaborate scheme was adopted. Values 1..7 indicate how many of the bytes are valid for access (if you malloc() just 5 bytes then it will be 5) and then there are magic bytes for different conditions.

When a piece of memory is legally allocated and accessed the corresponding bits are zeroed. Uninitialized memory is “poisoned”, i.e. set to a completely illegal value != 0. Further SLAB allocations are padded with “red zones” poisoning memory in front and behind of every legal allocation. When accessing a byte in memory, it is easy to verify that the access is legal: is the shadow byte == 0? That means all 8 bytes can be freely accessed and we can quickly proceed. Else we need a closer look. Values 1 thru 7 means bytes 1 thru 7 are valid for access (partly addressable) so we check that and any other values means uh oh.

  • 0xFA and 0xFB means we have hit a heap left/right redzone so an out-of-bounds access has happened
  • 0xFD means access to a free:ed heap region, so use-after-free
  • etc

Decoding the hex values gives a clear insight into what access violation we should be looking for.

To be fair the one bit per byte (8-to-1) mapping is not compulsory. This should be pretty obvious. Other schemes such as mapping even 32 bytes to one byte have been discussed for memory-constrained systems.

All memory access calls (such as any instance of dereferencing a pointer) and all functions in the library such as all string functions are patched to check for these conditions. It's easy when you have compiler instrumentation. We check it all. It is not very fast but it's bareable for testing.

Researchers in one camp argue that we should all be writing software (including operating systems) in the programming language Rust in order to avoid the problems ASan is trying to solve altogether. This is a good point, but rewriting large existing software such as the Linux kernel in Rust is not seen as realistic. Thus we paper over the problem instead of using the silver bullet. Hybrid approaches to using Rust in kernel development are being discussed but so far not much has materialized.

KASan Arrives

The AddressSanitizer (ASan) was written with userspace in mind, and the userspace project is very much alive.

As the mechanism used by ASan was quite simple, and the compilers were already patched to handle shadow memory, the Kernel Address Sanitizer (KASan) was a natural step. At this point (2020) the original authors seem to spend a lot of time with the kernel, so the kernel hardening project has likely outgrown the userspace counterpart.

The magic values assigned to shadow memory used by KASan is different:

  • 0xFA means the memory has been free:ed so accessing it means use-after-free.
  • 0xFB is a free:ed managed resources (devm_* accessors) in the Linux kernel.
  • 0xFC and 0xFE means we access a kmalloc() redzone indicating an out-of-bounds access.

This is why these values often occur in KASan splats. The full list of specials (not very many) can be found in mm/kasan/kasan.h.

The crucial piece to create KASan was a compiler flag to indicate where to shadow the kernel memory: when the kernel Image is linked, addresses are resolved to absolute virtual memory locations, and naturally all of these, plus the area where kernel allocates memory (SLABs) at runtime need to be shadowed. As can be seen in the kernel Makefile.kasan include, this boils down to passing the flags -fsanitize=kernel-address and -asan-mapping-offset=$(KASAN_SHADOW_OFFSET) when linking the kernel.

The kernel already had some related tools, notably kmemcheck which can detect some cases of use-after-free and references to uninitialized memory. It was based on a slower mechanism so KASan has since effectively superceded it, as kmemcheck was removed.

KASan was added to the kernel in a commit dated february 2015 along with support for the x86_64 architecture.

To exercise the kernel to find interesting bugs, the inventors were often using syzkaller, a tool similar to the older Trinity: it bombs the kernel with fuzzy system calls to try to provoke undefined and undesired behaviours yielding KASan splats and revealing vulnerabilities.

Since the kernel is the kernel we need to explicitly assign memory for shadowing. Since we are the kernel we need to do some manouvers that userspace can not do or do not need to do:

  • During early initialization of the kernel we point all shadow memory to a single page of just zeroes making all accesses seem fine until we have proper memory management set up. Userspace programs do not need this phase as “someone else” (the C standard library) handles all memory set up for them.
  • Memory areas which are just big chunks of code and data can all point to a single physical page with poison. In the virtual memory it might look like kilobytes and megabytes of poison bytes but it all points to the same physical page of 4KB.
  • We selectively de-instrument code as well: code like KASan itself, the memory manager per se, or the code that patches the kernel for ftrace, or the code that unwinds the stack pointer for a kernel splat clearly cannot be instrumented with KASan: it is part of the design of these facilities to poke around at random locations in memory, it's not a bug. Since KASan was added all of these sites in the generic kernel code have been de-instrumented, more or less.

Once these generic kernel instrumentations were in place, other architectures could be amended with KASan support, with ARM64 following x86 soon in the autumn of 2015.

Some per-architecture code, usually found in arch/xxxx/mm/kasan_init.c is needed for KASan. What this code does is to initalize the shadow memory during early initialization of the virtual memory to point to a “zero page” and later on to populate all the shadow memory with poisoned shadow pages.

The shadow memory is special and needs to be populated accessing the very lowest layer of the virtual memory abstraction: we manipulate the page tables from the top to bottom: pgd, p4d, pud, pmd, pte to make sure that the $(KASAN_SHADOW_OFFSET) points to memory that has valid page table entries.

We need to use the kernel memblock early memory management to set up memory to hold the page tables themselves in some cases. The memblock memory manager also provide us with a list of all the kernel RAM: we loop over it using for_each_mem_range() and populate the shadow memory for each range. As mentioned we first point all shadows to a zero page, and later on to proper KASan shadow memory, and then KASan kicks into action.

A special case happens when moving from using the “zero page” KASan memory to proper shadow memory: we would risk running kernel threads into partially initialized shadow memory and pull the ground out under ourselves. Not good. Therefore the global page table for the entire kernel (the one that has all shadow memory pointing to a zero page) is copied and used during this phase. It is then replaced, finally, with the proper KASan-instrumented page table with pointers to the shadow memory in a single atomic operation.

Further all optimized memory manipulation functions from the standard library need to be patched as these often have assembly-optimized versions in the architecture. This concerns memcpy(), memmove() and memset() especially. While the quick optimized versions are nice for production systems, we replace these with open-coded variants that do proper memory accesses in C and therefore will be intercepted by KASan.

All architectures follow this pattern, more or less. ARM64 supports hardware tags, which essentially means that the architecture supports hardware acceleration of KASan. As this is pretty fast, there is a discussion about using KASan even on production systems to capture problems seen very seldom.

KASan on ARM32

Then there was the attempt to provide KASan for ARM32.

The very first posting of KASan in 2014 was actually targeting x86 and ARM32 and was already working-kind-of-prototype-ish on ARM32. This did not proceed. The main reason was that when using modules, these are loaded into a designated virtual memory area rather than the kernel “vmalloc area” which is the main area used for memory allocations and what most architectures use. So when trying to use loadable modules the code would crash as this RAM was not shadowed.

The developers tried to create the option to move modules into the vmalloc area and enable this by default when using KASan to work around this limitation.

The special module area is however used for special reasons. Since it was placed in close proximity to the main kernel TEXT segment, the code could be accessed using short jumps rather than long jumps: no need to load the whole 32-bit program counter anew whenever a piece of code loaded from a module was accessed. It made code in modules as quick as normal compiled-in kernel code +/– cache effects. This provided serious performance benefits.

As a result KASan support for ARM was dropped from the initial KASan proposal and the scope was limited to x86, then followed by ARM64. “We will look into this later”.

In the spring of 2015 I started looking into KASan and was testing the patches on ARM64 for Linaro. In june I tried to get KASan working on ARM32. Andrey Ryabinin pointed out that he actually had KASan running on ARM32. After some iterations we got it working on some ARM32 platforms and I was successfully stressing it a bit using the Trinity syscall fuzzer. This version solved the problem of shadowing the loadable modules by simply shadowing all that memory as well.

The central problem with running KASan on a 32-bit platform as opposed to a 64-bit platform was that the simplest approach used up 1/8 of the whole address space which was not a problem for 64-bit platforms that have ample virtual address space available. (Notice that the amount of physical memory doesn't really matter, as KASan will use the trick to just point large chunks of virtual memory to a single physical poison page.) On 32-bit platforms this approach ate our limited address space for lunch.

We were setting aside several static assigned allocations in the virtual address space, so we needed to make sure that we only shadow the addresses actually used by the kernel. We would not need to shadow the addresses used by userspace and the shadow memory virtual range requirement could thus be shrunk from 512 MB to 130 MB for the traditional 3/1 GB kernel/userspace virtual address split used on ARM32. (To understand this split, read my article How the ARM32 Kernel Starts which tries to tell the story.)

Sleeping Beauty

This more fine-grained approach to assigning shadow memory would create some devil-in-the-details bugs that will not come out if you shadow the whole virtual address space, as the 64-bit platforms do.

A random access to some memory that should not be poked (and thus lacking shadow memory) will lead to a crash. While QEMU and some hardware was certainly working, there were some real hardware platforms that just would not boot. Things were getting tedious.

KASan for ARM32 development ground to a halt because we were unable to hash out the bugs. The initial patches from Andrey started trading hands and these out-of-tree patches were used by some vendors to test code on some hardware.

Personally, I had other assignments and could not take over and develop KASan at this point. I'm not saying that I was necessarily a good candidate at the time either, I was just testing and tinkering with KASan because ARM32 vendors had a generic interest in the technology.

As a result, KASan for ARM32 was pending out-of-tree for almost 5 years. In 2017 Abbot Liu was working on it and fixed up the support for LPAE (large physical address extension) and in 2019 Florian Fainelli picked up where Abbot left off.

Some things were getting fixed along the road, but what was needed was some focused attention and these people had other things on their plate as well.

Finally Fixing the Bugs

In April 2020 I decided to pick up the patches and have a go at it. I sloppily named my first iteration “v2” while it was something like v7.

I quickly got support from two key people: Florian Fainelli and Ard Biesheuvel. Florian had some systems with the same odd behaviour of just not working as my pet Qualcomm APQ8060 DragonBoard that I had been using all along for testing. Ard was using the patches for developing and debugging things like EFI and KASLR.

During successive iterations we managed to find and patch the remaining bugs, one by one:

  • A hard-coded bitmask assuming thread size order to be 1 (4096 bytes) on ARMv4 and ARMv5 silicon made the kernel crash when entering userspace. KASan increases the thread order so that there would be space for redzones before and after allocations, so it needed more space. After reading assembly one line at the time I finally figured this out and patched it.
  • The code was switching MMU table by simply altering the TTBR0 register. This worked in some machines, especially ARMv7 silicon, but the right way to do it was to use the per-CPU macro cpu_switch_mm() which looks intuitive but is an ARM32-ism which is why the original KASan authors didn't know about it. This macro accounts for tiny differences between different ARM cores, some even custom to certain vendors.
  • Something fishy was going on with the attached device tree. It turns out, after much debugging, that the attached devicetree could end up in memory that was outside of the kernel 1:1 physical-to-virtual mapping. The page table entries that would have translated the physical memory area where the device tree was stored was wiped clean yielding a page fault. The problem was not caused by KASan per se: it was a result of the kernel getting over a certain size, and all the instrumentation added to the kernel makes it bigger to the point that it revealed the bug. We were en route to fix a bug related to big compressed kernel images. I developed debugging code specifically to find this bug and then made a patch for this making sure not to wipe that part of the mapping. (This post gives a detailed explanation of the problem.) Ard quickly came up with a better fix: let's move the device tree to determined place in the fixed mappings and handle it as if it was a ROM.

These were the major roadblocks. Fixing these bugs created new bugs which we also fixed. Ard and Florian mopped up the fallout.

In the middle of development, five level page tables were introduced and Mike Rapoport made some generic improvements and cleanup to the core memory management code, so I had to account for these changes as well, effectively rewriting the KASan ARM32 shadow memory initialization code. At one point I also broke the LPAE support and had to repair it.

Eventually the v16 patch set was finalized in october 2020 and submitted to Russell Kings patch tracker and he merged it for Linux v5.11-rc1.

Retrospect

After the fact three things came out nice in the design of KASan for ARM32:

  • We do not shadow or intercept highmem allocations, which is nice because we want to get rid of highmem altogether.
  • We do not shadow the userspace memory, which is nice because we want to move userspace to its own address space altogether.
  • Personally I finally got a detailed idea of how the ARM32 kernel decompresses and starts, and the abstract concepts of highmem, lowmem, and the rest of those wild animals. I have written three different articles on this blog as a result, with ideas for even more of them. By explaining how things work to others I realize what I can't explain and as a result I go and research it.

Andrey and Dmitry has since worked on not just ASan and KASan but also on what was intially called the KernelThreadSanitizer (KTSAN) but which was eventually merged under the name KernelConcurrencySanitizer (KCSAN). The idea is again to use shadow memory, but now for concurrency debugging at runtime. I do not know more than this.

 
Read more...

from linusw

This is the continuation of Setting Up the ARM32 Architecture, part 1.

As a recap, we have achieved the following:

  • We are executing in virtual memory
  • We figured out how to get the execution context of the CPU from the init task with task ID 0 using the sp register and nothing else
  • We have initialized the CPU
  • We have identified what type of machine (ARM system) we are running on
  • We have enumerated and registered the memory blocks available for the kernel to use with the primitive memblock memory manager
  • We processed the early parameters earlyparams
  • We have provided early fixmaps and early ioremaps
  • We have identified lowmem and highmem bounds

Memory Blocks – Part 2

We now return to the list of memory available for the Linux kernel.

arm_memblock_init() in arch/arm/mm/init.c is called, resulting in a number of memory reservations of physical memory the Linux memory allocator can NOT use, given as physical start address and size. So we saw earlier that memblock stores a list of available blocks of memory, and in addition to that it can set aside reserved memory.

For example the first thing that happens is this:

memblock_reserve(__pa(KERNEL_START), KERNEL_END - KERNEL_START);

It makes perfect sense that the kernel cannot use the physical memory occupied by by the kernel – the code of the kernel we are executing. KERNEL_END is set to _end which we know from previous investigation to cover not only the TEXT segment but also BSS of the kernel, i.e. all the memory the kernel is using.

Next we call arm_mm_memblock_reserve() which will reserve the memory used by the kernel initial page table, also known as pg_swapper_dir. Would be unfortunate if we overwrite that with YouTube videos.

Finally we reserve the memory used by the device tree (at the moment I write this there is a patch pending to fix a bug here) and any other memory reservations defined in the device tree.

A typical example of a memory reservation in the device tree is if you have a special video ram (VRAM). The following would be a typical example:

        reserved-memory {
                #address-cells = <1>;
                #size-cells = <1>;
                ranges;
 
                /* Chipselect 3 is physically at 0x4c000000 */
                vram: vram@4c000000 {
                            /* 8 MB of designated video RAM */
                            compatible = "shared-dma-pool";
                            reg = <0x4c000000 0x00800000>;
                            no-map;
                };
        };

This specific memory block (taken from the Versatile Express reference design) will be outside of the physical RAM memory and not disturb any other allocations, but it uses the very same facility in the device tree: anything with compatible “shared-dma-pool” will be set aside for special use.

When chunks of common (non-special-purpose) RAM is set aside these chunks are referred to as “carveouts”. A typical use of such carveouts is media buffers for video and audio.

Next, before we have started to allocate any memory on the platform we set aside memory to be used for contiguous memory allocation (CMA) if this memory manager is in use. The CMA memory pool can be used for other things than contiguous memory, but we cannot have unmovable allocations in there, so we better flag this memory as “no unmoveable allocations in here” as soon as possible.

As CMA is so good at handling contiguous memory it will be used to handle the random carveouts and special memory areas indicated by “shared-dma-pool” as well. So be sure to select the Kconfig symbols CMA and CMA_DMA if you use any of these.

Next we call memblock_dump_all() which will show us nothing, normally. However if we pass the command line parameter memblock=debug to the kernel we will get a view of what things look like, first how much memory is available in total and how much is reserved in total for the things we outlined above, and then a detailed list of the memory available and set aside, similar to this:

MEMBLOCK configuration:
 memory size = 0x08000000 reserved size = 0x007c1256
 memory.cnt  = 0x1
 memory[0x0]    [0x00000000-0x07ffffff], 0x08000000 bytes flags: 0x0
 reserved.cnt  = 0x3
 reserved[0x0]    [0x00004000-0x00007fff], 0x00004000 bytes flags: 0x0
 reserved[0x1]    [0x00008400-0x007c388b], 0x007bb48c bytes flags: 0x0
 reserved[0x2]    [0x00c38140-0x00c39f09], 0x00001dca bytes flags: 0x0
 reserved[0x3]    [0xef000000-0xefffffff], 0x01000000 bytes flags: 0x0

This is taken from the ARM Versatile reference platform and shows that we have one big chunk of memory that is 0x08000000 (128 MB) in size. In this memory we have chiseled out three reservations of in total 0x007C1256 (a bit more than 7 MB). The first reservation is the page table (swapper_pg_dir) the second is the kernel TEXT and BSS, the third is the DTB and the last reservation is the CMA pool of 16MB.

After this point, we know what memory in the system we can and cannot use: early memory management using memblock is available. This is a crude memory management mechanism that will let you do some rough memory reservations before the “real” memory manager comes up. We can (and will) call memblock_alloc() and memblock_phys_alloc() to allocate virtual and physical memory respectively.

If you grep the kernel for memblock_alloc() calls you will see that this is not a common practice: a few calls here and there. The main purpose of this mechanism is for the real memory manager to bootstrap itself: we need to allocate memory to be used for the final all-bells-and-whistles memory management: data structures to hold the bitmaps of allocated memory, page tables and so on.

At the end of this bootstrapping mem_init() will be called, and that in turn calls memblock_free_all() and effectively shuts down the memblock mechanism. But we are not there yet.

After this excursion among the memblocks we are back in setup_arch() and we call adjust_lowmem_bounds() a second time, as the reservations and carveouts in the device tree may have removed memory from underneath the kernel. Well we better take that into account. Redo the whole thing.

Setting up the paging

We now initialize the page table. early_ioremap_reset() is called, turning off the early ioremap facility (it cannot be used while the paging proper is being set up) and then we enable proper paging with the call to paging_init(). This call is really interesting and important: this is where we set up the system to perform the lower levels of proper memory management.

This is a good time to recap the inner workings of the ARM32 page tables before you move along. The relationship between kernel concepts such as PGD, PMD and PTE and the corresponding ARM level-1, level-2 and on LPAE level-3 page table descriptors need to be familiar.

The first thing we do inside paging_init() is to call prepare_page_table(). What it does is to go into the PGD and clear all the PMDs that are not in use by the kernel. As we know, the PGD has 4096 32bit/4-bytes entries on the classic ARM32 MMU grouped into 2048 PMDs and 512 64bit/8 byte entries on LPAE corresponding to one PMD each. Each of them corresponds to a 2MB chunk of memory. The action is going to hit the swapper_pg_dir at 0xC0004000 on classic MMUs or a combination of PGD pointers at 0xC0003000 and PMD pointers at 0xC0004000 on LPAE MMUs.

Our 1MB section mappings currently covering the code we are running and all other memory we use are first level page table entries, so these are covering 1 MB of virtual memory on the classic MMU and 2 MB of memory on LPAE systems. However the size we advance with is defined as PMD_SIZE which will always be 2 MB so the loop clearing the page PMDs look like so:

for (addr = 0; addr < PAGE_OFFSET; addr += PMD_SIZE)
        pmd_clear(pmd_off_k(addr));

(Here I simplified the code by removing the execute-in-place (XIP) case.)

We advance one PMD_SIZE (2 MB) chunk at a time and clear all PMDs that are not used, here we clear the PMDs covering userspace up to the point in memory where the linear kernel mapping starts at PAGE_OFFSET.

We have stopped using sections of 1 MB or any other ARM32-specific level-1 descriptor code directly. We are using PMDs of size 2 MB and the generic kernel abstractions. We talk to the kernel about PMDs and not “level-1 descriptors”. We are one level up in the abstractions, removed from the mundane internals of the ARM32 MMU.

pmd_clear() will in practice set the entry to point at physical address zero and all MMU attributes will also be set to zero so the memory becomes non-accessible for read/write and any other operations, and then flush the translation lookaside buffer and the L2 cache (if present) so that we are sure that all this virtual memory now points to unaccessible zero – a known place. If we try to obtain an instruction or data from one of these addresses we will generate a prefect abort of some type.

One of the PMDs that get wiped out in this process is the two section-mapping PMDs that were 1-to1-mapping __turn_mmu_on earlier in the boot, so we swipe the floor free from some bootstrapping.

Next we clear all PMDs from the end of the first block of lowmem up until VMALLOC_START:

end = memblock.memory.regions[0].base + memblock.memory.regions[0].size;
if (end >= arm_lowmem_limit)
    end = arm_lowmem_limit;

for (addr = __phys_to_virt(end); addr < VMALLOC_START; addr += PMD_SIZE)
    pmd_clear(pmd_off_k(addr));

What happens here? We take the first memory block registered with the memblock mechanism handling physical memory, then we cap that off at arm_lowmem_limit which we saw earlier is in most cases set to the end of the highest memory block, which will mostly be the same thing unless we have several memory blocks or someone passed in command parameters reserving a lot of vmalloc space, so this procedure assumes that the kernel is loaded into the first available (physical) memory block, and we will then start at the end of that memory block (clearly above the kernel image) and clear all PMDs until we reach VMALLOC_START.

VMALLOC_START is the end of the virtual 1-to-1 mapping of the physical memory + 8 MB. If we have 512 MB of physical memory at physical address 0x00000000 then that ends at 0x1FFFFFFF and VMALLOC_START will be at 0x20000000 + PAGE_OFFSET + 0x00800000 = 0xE0800000. If we have 1 GB of physical memory VMALLOC_START will run into the highmem limit at 0xF0000000 and the end of the 1-to-1 physical mapping will naturally be there, so the VMALLOC_START will be at 0xF0800000 for anything with more 768 MB memory. The 8 MB between the end of lowmem and VMALLOC_START is a “buffer” to catch stray references.

For example if a system has 128 MB of RAM starting at 0x00000000 in a single memory block and the kernel zImage is 8MB in size and gets loaded into memory at 0x10008000-0x107FFFFF with the PGD (page global directory) at 0x10004000, we will start at address 0x20000000 translated to virtual memory 0xE0000000 and clear all PMDs for the virtual memory up until we reach 0xF0800000.

The physical memory in the memblock that sits above and below the start of the kernel, in this example the physical addresses 0x00000000-0x0FFFFFFF and 0x10800000-0x1FFFFFFF will be made available for allocation as we initialize the memory manager: we have made a memblock_reserve() over the kernel and that is all that will actually persist – memory in lowmem (above the kernel image) and highmem (above the arm_lowmem_limit, if we have any) will be made available to the memory allocator as well.

This code only initializes the PMDs, i.e. the entries in the first level of the page table, to zero memory with zero access to protect us from hurting ourselves.

Clearing the PMDs This image illustrates what happens when we initialize the PMDs (the picture is not to scale). The black blocks are the PMDs we clear, making them unavailable for any references for now. We have one block of physical memory from 0x00000000-0x1FFFFFFF (512 MB) and we clear out from the end of that block in virtual memory until we reach VMALLOC_START.

Mapping lowmem

Next we call map_lowmem() which is pretty self-describing. Notice that we are talking about lowmem here: the linear kernelspace memory map that is accessed by adding or subtracting an offset from/to the physical or virtual memory address. We are not dealing with the userspace view of the memory at all, only the kernel view of the memory.

We round two physical address pointers to the start and end PMDs (we round on SECTION_SIZE, i.e. 1 MB bounds) of the executable portions of the kernel. The lower part of the kernel (typically starting at address 0xC0008000) is the executable TEXT segment so the start of this portion of the virtual memory is assigned to pointer kernel_x_start and kernel_x_end is put rounded up to the next section for the symbol __init_end, which is the end of the executable part of the kernel.

The BSS segment and other non-execuable segments are linked below the executable part of the kernel, so this is in the virtual memory above the executable part of the kernel.

Then we loop over all the memory blocks we have in the system and call create_mapping() where we first check the following conditions:

  • If the end of the memblock is below the start of the kernel the memory is mapped as readable/writeable/executable MT_MEMORY_RWX. This is a whole memory block up in userspace memory and similar. We want to be able to execute code up there. I honestly do not know why, but I can think about things such as small firmware areas that need to be executable, registered somewhere in a very low memblock.
  • If the start of the memblock is above kernel_x_end the memory is mapped as readable/writeable. No execution shall happen in the linear map above the executable kernel memory.

Next we reach the situation where the memblock is covering both the executable and the non-executable part of the kernel image. There is even a comment saying “this better cover the entire kernel” here: the whole kernel has to be inside one memory block under these circumstances or the logic will not work.

This is the most common scenario under all circumstances, such as my example with a single 128 MB physical memory block. Most ARM32 systems are like this.

We then employ the following pretty intuitive mapping:

  • If the memblock starts below the executable part of the kernel kernel_start_x we chop off that part and map it as readable/writeable with MT_MEMORY_RW.
  • Then we map kernel_x_start to kernel_x_end as readable/writeable/executable with MT_MEMORY_RWX.
  • Then we map the last part of the kernel above kernel_x_end to the end of the memblock as readable/writeable with MT_MEMORY_RW.

Remapping the kernel image This illustrates how the memory above the kernel is readable/writeable, the upper part of the kernel image with the text segment is readable/writeable/executable while the lower .data and .bss part is just readable/writeable, then the rest of lowmem is also just readable writable. In this example we have 128MB (0x20000000) of memory and the in the kernel image lowmem is mapped 0x00000000 -> 0xB0000000, 0x10000000 -> 0xC0000000 and 0x20000000 -> 0xD0000000. The granularity may require individual 4K pages so this will use elaborate page mapping.

The executable kernel memory is also writable because the PGD is in the first PMD sized chunk here – we sure need to be able to write to that – and several kernel mechanisms actually rely on being able to runtime-patch the executable kernel, even if we have already finalized the crucial physical-to-virtual patching. One example would be facilities such as ftrace.

Everything that fits in the linearly mapped lowmem above and below the kernel will be readable and writable by the kernel, which leads to some optimization opportunities – especially during context switches – but also to some problems, especially regarding the highmem. But let’s not discuss that right now.

map_lowmem() will employ create_mapping() which in turn will find out the PGD entries (in practice, on a classical MMU in this case that will be one entry of 32 bits/4 bytes in the level-1 page table) for the address we pass in (which will be PMD-aligned at 2 MB), and then call alloc_init_p4d() on that pgd providing the start and end address. We know very well that ARM32 does not have any five- or four-level page tables but this is where the generic nature of the memory manager comes into play: let’s pretend we do. alloc_init_p4d() will traverse the page table ladder with alloc_init_pud() (we don’t use that either) and then alloc_init_pmd() which we actually use, and then at the end if we need it alloc_init_pte().

What do I mean by “if we need it”? Why would we not allocate PTEs, real page-to-page tables for every 0x1000 chunk of physical-to-virtual map?

It’s because allocinitpmd() will first see if the stuff we map is big enough to use one or more section mappings – i.e. just 32 bits/4 bytes in the PGD at 0xC0004000-somewhere to map the memory. In our case that will mostly be the case! The kernel will if possible remain section mapped, and we call __map_init_section() which will create and write the exact same value into the PGD as we had put in there before – well maybe it was all executable up onto this point, so at least some small bits will change. But we try our best to use the big and fast section maps if we can without unnecessarily creating a myriad of PTE-level objects that turn into level-2 descriptors that need to be painstakingly traversed.

However on the very last page of the executable part of the kernel and very first page of the data segments it is very likely that we cannot use a section mapping, and for the first time the generic kernel fine-granular paging using 4 KB pages will kick in to provide this mapping. We learnt before that the linker file has been carefully tailored to make sure that each segment of code starts on an even page, so we know that we can work at page granularity here.

In create_mapping(), if the start and end of the memory we try to map is not possible to fit perfectly into a section, we will call alloc_init_pte() for the same address range. This will allocate and initialize a PTE, which is the next level down in the page table hierarchy.

phys_addr_t start, end;
struct map_desc map;

map.pfn = __phys_to_pfn(start);
map.virtual = __phys_to_virt(start);
map.length = end - start;
map.type = MT_MEMORY_RWX;

create_mapping(&map);

A typical case of creating a mapping with create_mapping(). We set the page frame number (pfn) to the page we want to start the remapping at, then the virtual address we want the remapping to appear at, and the size and type of the remapping, then we call create_mapping().

So Let’s Map More Stuff

We now know what actually happens when we call create_mapping() and that call is used a lot in the early architecture set-up. We know that map_lowmem() will chop up the border between the executable and non-executable part of the kernel using the just described paging layout of our MMU using either the classic complicated mode or the new shiny LPAE mode.

The early section mappings we put over the kernel during boot and start of execution from virtual memory are now gone. We have overwritten them all with the new, proper mappings, including the very memory we executed the remappings in. Nothing happens, but the world is now under generic kernel memory management control. The state of the pages is, by the way, maintained in the page tables themselves.

If we create new PTEs these will be allocated into some new available memory page as well using alloc(). At this point that means that the memblock allocator will be used, since the proper memory management with kmalloc() is not yet operational. The right type of properties (such as read/write/execute or other MT_* characteristics) will be used however so we could say that we have a “halfway” memory manager: the hardware is definitely doing the right thing now, and the page tables are manipulated the right way using generic code.

map_lowmem() is done. We call memblock_set_current_limit(arm_lowmem_limit) because from now on we only want memblock allocations to end up in lowmem proper, we have just mapped it in properly and all. In most cases this is the same as before, but in some corner cases we cannot put this restriction until now.

We remap the contiguous memory allocation area if CMA is in use, again using all the bells and whistles of the kernel’s generic memory manager. MMU properties for DMA memory is set to MT_MEMORY_DMA_READY which is very close to normal read/writeable memory.

Next we shutdown the fixmap allocations. The remapped memory that was using fixmaps earlier gets mapped like the kernel itself and everything else using create_mapping() instead of the earlier hacks poking directly into the page tables. They are all one page each and uses MT_DEVICE, i.e. write-through, uncached registers of memory-mapped I/O, such as used by the earlyconsole UART registers.

Next we set up some special mappings in devicemaps_init() that apart from the early ones we just reapplied add some new ones, i.e. some not-so-early mappings. They are mostly not devices either so the function name is completely misleading. The function is called like this because it at one point calls the machine-specific ->map_io() callback for the identified machine descriptor, which on pure device tree systems isn’t even used.

Inside devicemaps_init() We clear some more PMDs. Now we start at VMALLOC_START: the place where we previously stopped the PMD clearing, advancing 2 MB at a time up to FIXADDR_TOP which is the location of the fixmaps. Those have been redefined using the generic kernel paging engine, but they are still there, so we must not overwrite them.

Next follow some special mappings, then we get to something really interesting: the vectors.

Setting Up and Mapping the Exception Vector Table

The vector table is a page of memory where the ARM32 CPUs will jump when an exception occurs. Exceptions vectors can be one of:

  • Reset exception vector: the address the PC is set to if the RESET line to the CPU is asserted.
  • Undefined instruction exception vector: the address we jump to if an undefined instruction is executed. This is used for example to emulate floating point instructions if your CPU does not have them.
  • Software Interrupt (SWI) exception vector: also called a “trap”, is a way to programmatically interrupt the program flow and execute a special handler. In Linux this is used by userspace programs to execute system calls: to call on the kernel to respond to needs of a userspace process.
  • Prefetch abort exception vector: this happens when the CPU tries to fetch an instruction from an illegal address, such as those addresses where we have cleared the level-1 page table descriptor (PMD) so there is no valid physical memory underneath. This is also called a page fault and is used for implementing demand paging, a central concept in Unix-like operating systems.
  • Data abort exception vector: same thing but the CPU is trying to fetch data rather than an instruction. This is also used for demand paging.
  • Address exception vector: this is described in the source as “this should never happen” and some manuals describe it as “unused” or “reserved”. This is actually an architectural leftover from the ARM26 (ARMv1, v2, v3) no longer supported by Linux. In older silicon the virtual address space was 26 bits rather than the 32 bits in later architectures, and this exception would be triggered when an address outside the 26 bit range was accessed. (This information came from LWN reader farnz as a reply to this article.) On full 32-bit silicon it should indeed never happen.
  • Interrupt Request IRQ exception vector: the most natural type of exception in response to the IRQ line into the CPU. In later ARM32 CPUs this usually comes from the standard GIC (Generic Interrupt Controller) but in earlier silicon such as ARMv4 or ARMv5 some custom interrupt controller is usually connected to this line. The origin of the line was a discrete signal routed out on the CPU package, but in modern SoCs these are usually synthesized into the same silicon so the line is not visible to the outside, albeit the concept is the same.
  • Fast Interrupt Request FIQ exception vector: this is mostly unused in Linux, and on ARMv7 silicon often used to trap into the secure world interrupt handlers and thus not even accessible by the normal world where Linux is running.

These eight vectors in this order is usually all we ever need on any ARM32 CPU. They are one 32bit word each, so the PC is for example set at address 0xFFFF0000 when reset occurs and whatever is there is executed.

The way the vector table/page works is that the CPU will store the program counter and processor state in internal registers and put the program counter at the corresponding vector address. The vector table can be put in two locations in memory: either at address 0x00000000 or address 0xFFFF0000. The location is selected with a single bit in the CP15 control register 1. Linux supports putting the vectors in either place with a preference for 0xFFFF0000. Using address 0x00000000 is typically most helpful if the MMU is turned off and you have a 1-to-1 mapping to a physical memory that starts at address 0x00000000. If the MMU is turned on, which it is for us, the address used is the virtual one, even the vector table goes through MMU translation, and it is customary to use the vectors high up in memory, at address 0xFFFF0000.

As we noted much earlier, each exception context has its own copy of the sp register and thus is assigned an exception-specific stack. Any other registers need to be spooled out and back in by code before returning from the exception.

Exception vectors on the ARM CPUs The ARM32 exception vector table is a 4KB page where the first 8 32-bit words are used as vectors. The address exception should not happen and in some manuals is described as “unused”. The program counter is simply set to this location when any of these exceptions occur. The remainder of the page is “poisoned” with the word 0xE7FDDEF1.

ARM Linux uses two consecutive pages of memory for exception handling: the first page is the vectors, the second page is called stubs. The vectors will typically be placed at 0xFFFF0000 and the stubs at the next page at 0xFFFF1000. If you use the low vectors these will instead be at 0x00000000 and 0x00001000 respectively. The actual physical pages backing these locations are simply obtained from the rough memblock allocator using memblock_alloc(PAGE_SIZE * 2).

The stubs page requires a bit of explanation: since each vector is just 32 bits, we simply cannot just jump off to a desired memory location from it. Long jumps require many more bits than 32! Instead we do a relative jump into the next page, and either handle the whole exception there (if it’s a small thing) or dispatch by jumping to some other kernel code. The whole vector and stubs code is inside the files arch/arm/kernel/entry-armv.S and arch/arm/kernel/traps.c. (The “armv” portion of the filename is misleading, this is used for pretty much all ARM32 machines. “Entry” means exception entry point.)

The vector and stub pages is set up in the function early_trap_init() by first filling the page with the 32bit word 0xE7FDDEF1 which is an undefined instruction on all ARM32 CPUs. This process is called “poisoning”, and makes sure the CPU locks up if it ever would put the program counter here. Poisoning is done to make sure wild running program counters stop running around, and as a security vulnerability countermeasure: overflow attacks and other program counter manipulations often tries to lead the program counter astray. Next we copy the vectors and stubs to their respective pages.

The code in the vectors and stubs has been carefully tailored to be position-independent so we can just copy it and execute it wherever we want to. The code will execute fine at address 0xFFFF0000-0xFFFF1FFF or 0x00000000-0x00001FFF alike.

You can inspect the actual vector table between symbols __vector_start and __vector_end at the end of entry-armv.S: there are 8 32-bit vectors named vector_rst, vector_und … etc.

Clearing out PMDs setting vectors We cleared some more PMDs between VMALLOC_START and the fixmaps, so now the black blocks are bigger in the virtual memory space. At 0xFFFF0000 we install vectors and at 0xFFFF1000 we install stubs.

We will not delve into the details of ARM32 exception handling right now: it will suffice to know that this is where we set it up, and henceforth we can deal with them in the sense that ARM32 will be able to define exception handlers after this point. And that is pretty useful. We have not yet defined any generic kernel exception or interrupt interfaces.

The vectors are flushed to memory and we are ready to roll: exceptions can now be handled!

At the end of devicemaps_init() we call early_abt_enable() which enables us to handle some critical abort exceptions during the remaining start-up sequence of the kernel. The most typical case would be a secondary CPU stuck in an abort exception when brought online, we need to cope with that and recover or it will bring the whole system down when we enable it.

The only other notable thing happening in devicemaps_init() is a call to the machine descriptor-specific .map_io() or if that is undefined, to debug_ll_io_init(). This used to be used to set up fixed memory mappings of some device registers for the machine, since at this point the kernel could do that properly using create_mapping(). Nowadays, using device trees, this callback will only be used to remap a debug UART for LL_DEBUG (all other device memory is remapped on-demand) which is why the new function name debug_ll_io_init() which isn’t even using the machine descriptor is preferred.

Make no mistake: DEBUG_LL is already assuming a certain virtual address for the UART I/O-port up until this point and it better remain there. We mapped that much earlier in head.S using a big fat section mapping of 1MB physical-to-virtual memory.

What happens in debug_ll_io_init() is that the same memory window is remapped properly with create_mapping() using a fine-granular map of a single page of memory using the right kernel abstractions. We obtain the virtual address for the UART using the per-serialport assembly macro addruart from the assembly file for the corresponding UART in arch/arm/include/debug/*.

This will overwrite the level-1 section mapping descriptor used for debug prints up until this point with a proper level-1 to level-2 pointer using PMDs and PTEs.

Initialzing the Real Memory Manager

Back in paging_init() we call kmap_init() that initialize the mappings used for highmem and then tcm_init() which maps some very tiny on-chip RAMs if we have them. The TCM (tightly coupled memory) is small SDRAMs that are as fast as cache, that some vendors synthesize on their SoCs.

Finally we set top_pmd to point at address 0xFFFF0000 (the vector space) and we allocate a page called the empty_zero_page which will be a page filled with zeroes. This is sometimes very helpful for the kernel when referencing a “very empty page”.

We call bootmem_init() which brings extended memblock page handling online: it allows resizing of memblock allocations, finds the lowest and highest page frame numbers pfns, performs an early memory test (if compiled in) and initializes sparse memory handling in the generic virtual memory manager.

We then chunk the physical memory into different memory zones with the final call to free_area_init() providing the maximum page frame numbers for the different memory zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM are used on ARM32. The zones are assumed to be consecutive in physical memory, so only the maximum page frame numbers for each zone is given.

  • ZONE_DMA is for especially low physical memory that can be accessed by DMA by some devices. There are machines with limitations on which addresses some devices can access when performing DMA bus mastering, so these need special restrictions on memory allocation.
  • ZONE_NORMAL is what we refer to as lowmem on ARM32: the memory that the kernel or userspace can use for anything.
  • ZONE_HIGHMEM is used with the ARM32 definition of highmem , which we have discussed in detail above: memory physically above the 1-to-1-mapped lowmem.

After returning from free_area_init() the generic kernel virtual memory pager is finally initialized. The memory management is not yet online: we cannot use kmalloc() and friends, but we will get there. We still have to use memblock to allocate memory.

We call request_standard_resources(), which is a call to register to the kernel what purpose specific memory areas have. Here we loop over the memblocks and request them as System RAM if nothing else applies. The resource request facility is hierarchical (resources can be requested inside resources) so the kernel memory gets requested inside the memory block where it resides. This resource allocation provides the basic output from the file /proc/iomem such as the location of the kernel in memory. This facility is bolted on top of the actual memory mapping and just works as an optional protection mechanism.

Finalizing Architecture Setup

We are getting close to the end of ARM32:s setup_arch() call.

If the machine descriptor has a restart hook we assign the global function pointer arm_pm_restart to this. Nominally drivers to restart modern platforms should not use this: they should provide a restart handler in drivers/power/reset/* registering itself using register_restart_handler(), but we have a bit of legacy code to handle restarts, and that will utilize this callback.

Next we unflatten the device tree, if the machine uses this, which all sufficiently modern ARM32 machines should. The device tree provided from boot is compact, binary and read only, so we need to process it so that boot code and device drivers can traverse the device tree easily. An elaborate data structure is parsed out from the device tree blob and allocated into free pages, again using the crude memblock allocator.

So far we only used very ad hoc device tree inspection to find memory areas and memory reservations.

Now we can inspect the device tree in a more civilized manner to find out some very basic things about the platform. The first thing we will actually do is to read the CPU topology information out of the device tree and build a list of available CPUs on the system. But that will happen later on during boot.

Finally, and lastly, we check if the machine has defined ->init_early() and need some other early work. If it does then we call this callback. Else we are done with setup_arch().

After this we return to the function start_kernel() in init/main.c again, where we will see how the kernel builds zones of the memory blocks, initializes the page allocator and finally gets to call mm_init() which brings the proper memory management with kmalloc() and friends online. We will set up SMP, timekeeping and call back into the architecture to finalize the deal.

But this is all a topic for another time.

 
Läs mer...

from joelfernandes

The writer works in the ChromeOS kernel team, where most of the system libraries, low-level components and user space is written in C++. Thus the writer has no choice but to be familiar with C++. It is not that hard, but some things are confusing. rvalue references are definitely confusing.

In this post, I wish to document rvalue references by simple examples, before I forget it.

Refer to this article for in-depth coverage on rvalue references.

In a nutshell: An rvalue reference can be used to construct a C++ object efficiently using a “move constructor”. This efficiency is achieved by the object's move constructor by moving the underlying memory of the object efficiently to the destination instead of a full copy. Typically the move constructor of the object will copy pointers within the source object into the destination object, and null the pointer within the source object.

An rvalue reference is denoted by a double ampersand (&&) when you want to create an rvalue reference as a variable.

For example T &&y; defines a variable y which holds an rvalue reference of type T. I have almost never seen an rvalue reference variable created this way in real code. I also have no idea when it can be useful. Almost always they are created by either of the 2 methods in the next section. These methods create an “unnamed” rvalue reference which can be passed to a class's move constructor.

When is an rvalue reference created?

In the below example, we create an rvalue reference to a vector, and create another vector object from this.

This can happen in 2 ways (that I know off):

1. Using std::move

This converts an lvalue reference to an rvalue reference.

Example:

#include <iostream>
#include <vector>

int main()
{
    int *px, *py;
    std::vector<int> x = {4,3};
    px = &(x[0]);
 
    // Convert lvalue 'x' to rvalue reference and pass
    // it to vector's overloaded move constructor.
    std::vector<int> y(std::move(x)); 
    py = &(y[0]);

    // Confirm the new vector uses same storage
    printf("same vector? : %d\n", px == py); // prints 1
}

2. When returning something from a function

The returned object from the function can be caught as an rvalue reference to that object.

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int> y(myf(4)); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 1
}

Note on move asssignment

Interestingly, if you construct vector 'y' using the assignment operator: std::vector<int> y = myf(4);, the compiler may decide to use the move constructor automatically even though assignment is chosen. I believe this is because of vector's move assignment operator overload.

Further, the compiler may even not invoke a constructor at all and just perform RVO (Return Value Optimization).

Quiz

Question:

If I create a named rvalue reference using std::move and then use this to create a vector, the underlying storage of the new vector is different. Why?

#include <iostream>
#include <vector>

int *pret;
int *py;

std::vector<int> myf(int a)
{
    vector<int> ret;

    ret.push_back(a * a);

    pret = &(ret[0]);

    // Return is caught as an rvalue ref: vector<int> &&
    return ret;
}

int main()
{
    // Invoke vector's move constructor.
    std::vector<int>&& ref = myf(4);
    std::vector<int> y(ref); 
    py = &(y[0]);

    // Confirm the vectors share the same underlying storage
    printf("same vector? : %d\n", pret == py); // prints 0
}

Answer

The answer is: because the value category of the id-expression 'ref' is lvalue, the copy constructor will be chosen. To use the move constructor, it has to be std::vector<int> y(std::move(ref));.

Conclusion

rvalue references are confusing and sometimes the compiler can do different optimizations to cause further confusion. It is best to follow well known design patterns when designing your code. It may be best to also try to avoid rvalue references altogether but hopefully this article helps you understand it a bit more when you come across large C++ code bases.

 
Read more...

from linusw

After we have considered how the ARM32 kernel uncompressed and the early start-up when the kernel jumps from executing in physical memory to executing in virtual memory we now want to see what happens next all the way until the kernel sets up the proper page tables and starts executing from properly paged virtual memory.

To provide a specific piece of the story that does not fit into this linear explanation of things, i have also posted a separate article on how the ARM32 page tables work. This will be referenced in the text where you might need to recapture that part.

To repeat: we have a rough initial section mapping of 1 MB sections covering the kernel RAM and the provided or attached Device Tree Blob (DTB) or ATAGs if we use a legacy system that is not yet using device tree. We have started executing from virtual memory in arch/arm/kernel/head-common.S, from the symbol __mmap_switched where we set up the C runtime environment and jump to start_kernel() in init/main.c. The page table is at a pointer named swapper_pg_dir.

Initial page table layout The initial page table swapper_pg_dir and the 1:1 mapped one-page-section __turn_mmu_on alongside the physical to virtual memory mapping at early boot. In this example we are not using LPAE so the initial page table is -0x4000 from (PAGE_OFFSET +TEXT_OFFSET), usually at 0xC0004000 thru 0xC0007FFF and memory ends at 0xFFFFFFFF.

We are executing in virtual memory, but interrupts and caches are disabled and absolutely no device drivers are available, except the initial debug console. The initial debug console can be enabled with CONFIG_DEBUG_LL and selecting the appropriate debug UART driver for your system. This makes the kernel completely non-generic and custom for your system but is great if you need to debug before the device drivers come up. We discussed how you can insert a simple print in start_kernel() using this facility.

In the following text we start at start_kernel() and move down the setup_arch() call. When the article ends, we are not yet finished with setup_arch() so there will be a second part to how we set up the architecture. (I admit I ran over the maximum size for a post, else it would be one gigantic post.)

In the following I will not discuss the “nommu” (uClinux) set-up where we do not use virtual memory, but just a 1-to-1 physical-to-virtual map with cache and memory protection. It is certainly an interesting case, but should be the topic for a separate discussion. We will describe setting up Linux on ARM32 with full classic or LPAE MMU support.

Setting Up the Stack Pointer and Memory for the init Task

In the following section the words task and thread are used to indicate the same thing: an execution context of a process, in this case the init process.

Before we start executing in virtual memory we need to figure out where our stack pointer is set. __mmap_switched in head-common.S also initializes the ARM stack pointer:

   ARM( ldmia   r4!, {r0, r1, sp} )
 THUMB( ldmia   r4!, {r0, r1, r3} )
 THUMB( mov     sp, r3 )

r4 in this case contains __mmap_switched_data where the third variable is:

.long   init_thread_union + THREAD_START_SP

THREAD_START_SP is defined as (THREAD_SIZE - 8), so 8 bytes backward from the end of the THREAD_SIZE number of bytes forward from the pointer stored in the init_thread_union variable. This is the first word on the stack that the stack will use, which means that bytes at offset THREAD_SIZE - 8, -7, -6, -5 will be used by the first write to the stack: only one word (4 bytes) is actually left unused at the end of the THREAD_SIZE memory chunk.

The init_thread_union is a global kernel variable pointing to the task information for the init process. You find the actual memory for this defined in the generic linker file for the kernel in include/asm-generic/vmlinux.lds.h where it is defined as a section of size THREAD_SIZE in the INIT_TASK_DATA section definition helper. ARM32 does not do any special tricks with this and it is simply included into the RW_DATA section, which you find in the linker file for the ARM kernel in arch/arm/kernel/vmlinux.lds.S surrounded by the labels _sdata and _edata and immediately followed by the BSS section:

        _sdata = .;
        RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
        _edata = .;

        BSS_SECTION(0, 0, 0)

This will create a section which is aligned at a page boundary, beginning with INIT_TASK_DATA of size THREAD_SIZE, initialized to the values assigned during compilation and linking and writable by the kernel.

The union thread_union is what is actually stored in this variable, and in the ARM32 case it actually looks like this after preprocessing:

union thread_union {
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

It is just an array of unsigned long 4 byte chunks, as the word length on ARM32 is, well 32 bits, and sized so that it will fit a THREAD_SIZE stack. The name stack is deceptive: this chunk of THREAD_SIZE bytes stores all ARM-specific context information for the task, and the stack, with the stack in the tail of it. The remaining Linux generic accounting details are stored in a struct task_struct which is elsewhere in memory.

THREAD_SIZE is defined in arch/arm/include/asm/thread_info.h to be (PAGE_SIZE << THREAD_SIZE_ORDER) where PAGE_SIZE is defined in arch/arm/include/asm/page.h to be (1 << 12) which usually resolves to (1 << 13) so the THREAD_SIZE is actually 0x2000 (8196) bytes, i.e. 2 consecutive pages in memory. These 0x2000 bytes hold the ARM-specific context and the stack for the task.

The struct thread_info is the architecture-specific context for the task and for the init task this is stored in a global variable called init_thread_info that you find defined at the end of init/init_task.c:

struct thread_info init_thread_info __init_thread_info = INIT_THREAD_INFO(init_task);

The macro __init_thread_info expands to a linker directive that puts this into the .data.init_thread_info section during linking, which is defined in the INIT_TASK_DATA we just discussed, so this section is only there to hold the init task thread_info. 8 bytes from the end of this array of unsigned longs that make up the INIT_TASK_DATA is where we put the stack pointer.

Where is this init_thread_union initialized anyway?

It is pretty hard to spot actually, but the include/asm-generic/vmlinux.lds.h INIT_DATA_TASK linker macro does this. It contains these statements:

init_thread_union = .;
init_stack = .;

So the memory assigned to the pointers init_thread_union and init_stack is strictly following each other in memory. When the .stack member of init_task is assigned to init_stack during linking, it will resolve to a pointer just a little further ahead in memory, right after the memory chunk set aside for the init_thread_union. This is logical since the stack grows toward lower addresses on ARM32 systems: .stack will point to the bottom of the stack.

Init thread During the early start of the kernel we take extra care to fill out these two thread_info and task_struct data structures and the pointers to different offsets inside it. Notice the sp (stack pointer) pointing 8 bytes up from the end of the end of the two pages assigned as memory to hold the task information. The first word on the stack will be written at sp, typically at offset 0x1FF8 .. 0x1FFB.

This init_stack is assigned to .stack of struct task_struct init_task in init/init_task.c, where the rest of the task information for the init task is hardcoded. This task struct for the init task is in another place than the thread_info for the init task, they just point back and forth to each other. The task_struct is the generic kernel part of the per-task information, while the struct thread_info is an ARM32-specific information container that is stored together with the stack.

Here we see how generic and architecture-specific code connect: the init_thread_info is something architecture-specific and stores the state and stack of the init task that is ARM32-specific, while the task_struct for the init task is something completely generic, all architectures use the same task_struct.

This init task is task 0. It is not identical to task 1, which will be the init process. That is a completely different task that gets forked in userspace later on. This task is only about providing context for the kernel itself, and a point for the first task (task 1) to fork from. The kernel is very dependent on context as we shall see, and that is why its thread/task information and even the stack pointer for this “task zero” is hardcoded into the kernel like this. This “zero task” does not even appear to userspace if you type ps aux, it is hidden inside the kernel.

Initializing the CPU

The very first thing the kernel does in start_kernel() is to initialize the stack of the init task with set_task_stack_end_magic(&init_task). This puts a STACK_END_MAGIC token (0x57AC6E9D) where the stack ends. Since the ARM stack grows downwards, this will be the last usable unsigned long before we hit the thread_info on the bottom of the THREAD_SIZE memory chunk associated with our init task.

init thread stack end marker We insert a 0x57AC6E9D token so we can see if the last word of per-task stack ever gets overwritten and corrupted. The ARM32 stack grows towards the lower addresses. (Up along the arrow in the picture.)

Next smp_setup_processor_id() is called which is a weak symbol that each architecture can override. ARM does this: arch/arm/kernel/setup.c contains this function and if we are running on an SMP system, we execute read_cpuid_mpidr() to figure out the ID of the CPU we are currently running on and initialize the cpu_logical_map() array such that the current CPU is at index 0 and we print the very first line of kernel log which will typically be something like:

Booting Linux on physical CPU 0x0

If you are running on a uniprocessor system, there will be no print like this.

The kernel then sets up some debug objects and control group information that is needed early. We then reach local_irq_disable(). Interrupts are already disabled (at least they should be) but we exercise this code anyways. local_irq_disable() is defined to arch_local_irq_disable() which in the ARM case can be found in arch/arm/include/asm/irqflags.h. As expected it resolves to the assembly instruction cpsid i which will disable any ordinary IRQ in the CPU. This is short for change processor state interrupt disable i. It is also possible to issue cpsid f to disable FIQ, which is an interrupt which the ARM operating systems seldom make use of.

ARM systems usually also have an interrupt controller: this is of no concern here: we are disabling the line from the interrupt controller (such as the GIC) to the CPU: this is the big main switch, like going down in the basement and cutting the power to an entire house. Any other lightswitches in the house are of no concern at this point, we haven’t loaded a driver for the interrupt controller so we just ignore any interrupt from any source in the system.

Local IRQ big powerswitch The local_irq_disable() results in cpsid i which will cut the main interrupt line to the CPU core.

This move makes a lot of sense, because at this point we have not even set up the exception vectors, which is what all IRQs have to jump through to get to the destined interrupt handler. We will get to this in due time.

Next we call boot_cpu_init(). As the name says this will initialize the one and only CPU we are currently running the kernel on. Normally any other CPUs in the system are just standing by with frozen instruction counters at this point. This does some internal kernel bookkeeping. We note that smp_processor_id() is called and __boot_cpu_id is assigned in the SMP case. smp_processor_id() is assigned to raw_smp_processor_id() in include/linux/smp.h and that will go back into the architecture in arch/arm/include/asm/smp.h and reference the global variable current_thread_info()->cpu, while on uniprocessor (UP) systems it is simply defined to 0.

Let’s see where that takes us!

In the SMP case, current_thread_info() is defined in arch/arm/include/asm/thread_info.h and is dereferenced from the current_stack_pointer like this:

static inline struct thread_info *current_thread_info(void)
{
        return (struct thread_info *)
                (current_stack_pointer & ~(THREAD_SIZE - 1));
}

current_stack_pointer in turn is defined in arch/arm/include/asm/percpu.h and implemented as the assembly instruction sp. We remember that we initialized this to point at the end of THREAD_SIZE minus 8 bytes of the memory reserved for init_thread_info. The stack grows backward. This memory follows right after the thread_info for the init task so by doing this arithmetic, we get a pointer to the current struct thread_info, which will be the init_task. Further we see that the link file contains this:

. = ALIGN(THREAD_SIZE);
(...)
RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)

So the linker takes care to put thread info into a THREAD_SIZE:d chunk in memory. This means that if THREAD_SIZE is 0x2000, then the init task thread_info will always be on addresses like 0x10000000, 0x10002000, 0x10004000...

For example if THREAD_SIZE is 0x2000 then (THREAD_SIZE - 1) is 0x00001FFF and ~0x00001FFF is 0xFFFFE000 and since we know that the struct thread_info always start on an even page, this arithmetic will give us the address of the task information from the stack pointer. If the init_thread_info ends up at 0x10002000 then sp points to 0x10003FF8 and we calculate 0x10003FF8 & 0xFFFFE000 = 0x10002000 and we have a pointer to our thread_info, which in turn has a pointer to init_task: thus we know all about our context from just using sp.

Thus sp is everything we need to keep around to quickly look up the information of the currently running task, i.e. the process context. This is a central ARM32 Linux idea.

Since the kernel is now busy booting we are dealing with the init task, task 0, but all tasks that ever get created on the system will follow the same pattern: we go into the task context and use sp to find any information associated with the task. The init_task provides context so we don't crash. Now it becomes evident why this special “task zero” is needed so early.

So to conclude: when raw_smp_processor_id() inspects current_thread_info()->cpu to figure out what CPU we are running on, this thread info is going to be the init_thread_info and ->cpu is going to be zero because the compiler has assigned the default value zero to this member during linking (we haven’t assigned anything explicitly). ->cpu is an index and at index zero in the cpu_logical_map() is where we poked in our own CPU ID a little earlier using smp_setup_processor_id() so we can now find ourselves through the sp. We have come full circle.

As I pointed out this clever sp mechanism isn’t just used for the init task. It is used for all tasks on the system, which means that any allocated thread info must strictly be on a memory boundary evenly divisible with THREAD_SIZE. This works fine because kmalloc() that is used to allocate kernel memory returns chunks that are “naturally aligned”, which means “aligned to the object size if that size is a power of two” – this will ascertain that the allocation ends up on a page boundary evenly divisible by the object allocated, and the current_thread_info() call will always work fine. This is one of the places where the kernel assumes this kind of natural alignment.

Hello World

Next we call page_address_init() which mostly does nothing, but if the system is using highmem (i.e. has a lot of memory) this will initialize a page hash table. We will explain more about how highmem is handled later in this article.

Next we print the Linux banner. The familiar first text of the kernel appear in the kernel print buffer stating what version and git hash we built from and what compiler and linker was used to build this kernel:

Linux version 5.9.0-rc1-00021-gbbe281ed6cfe-dirty (...)

Even if we have an early console defined for the system, i.e. if we defined CONFIG_DEBUG_LL these first few messages will still NOT be hammered out directly on the console. They will not appear until we have initialized the actual serial console driver much later on. Console messages from CONFIG_DEBUG_LL only come out on the console as a result of printascii() calls. This banner is just stashed aside in the printk memory buffer until we get to a point where there is a serial device that can actually output this text. If we use earlyprintk and the serial driver has the proper callbacks, it will be output slightly earlier than when the memory management is properly up and running. Otherwise you will not see this text until the serial driver actually gets probed much later in the kernel start-up process.

But since you might have enabled CONFIG_DEBUG_LL there is actually a trick to get these prints as they happen: go into kernel/printk/printk.c and in the function vprintk_store() insert some code like this at the end of the function, right before the call to log_output():

#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
       {
               extern void printascii(char *);
               printascii(textbuf);
       }
#endif

Alternatively, after kernel version v5.11-rc1:

if (dev_info)
                memcpy(&r.info->dev_info, dev_info, sizeof(r.info->dev_info));
#if defined(CONFIG_ARM) && defined(CONFIG_DEBUG_LL)
       {
               extern void printascii(char *);
               printascii(&r.text_buf[0]);
       }
#endif

This will make all printk:s get hammered out on the console immediately as they happen, with the side effect that once the serial port is up you get conflict about the hardware and double prints of everything. However if the kernel grinds to a halt before that point, this hack can be really handy: even the very first prints to console comes out immediately. It’s one of the ARM32-specific hacks that can come in handy.

Next we call early_security_init() that will call ->init() on any Linux Security Modules (LSMs) that are enabled. If CONFIG_SECURITY is not enabled, this means nothing happens.

Setting Up the Architecture

The next thing that happens in start_kernel() is more interesting: we call setup_arch() passing a pointer to the command line. setup_arch() is defined in <linux/init.h> and will be implemented by each architecture as they seem fit. The architecture initializes itself and passes any command line from whatever mechanism it has to provide a command line. In the case of ARM32 this is implemented in arch/arm/kernel/setup.c and the command line can be passed from ATAGs or the device tree blob (DTB).

Everything in this article from this point on will concern what happens in setup_arch(). Indeed this function has given the name to this whole article. We will conclude the article once we return out of setup_arch(). The main focus will be memory management in the ARM32 MMUs.

Setting Up the CPU

ARM32 first sets up the processor. This section details what happens in the setup_processor() call which is the first thing the setup_arch() calls. This code will identify the CPU we are running on, determine its capabilities such as cache type and prepare the exception stacks. It will however not enable the caches or any other MMU functionality such as paging, that comes later.

We call read_cpuid_id() which in most cases results in a CP15 assembly instruction to read out the CPU ID: mrc p15, 0, <Rd>, c0, c0, 0 (some silicon such as the v7m family require special handling).

Using this ID we cross reference a struct with information about the CPU, struct proc_info_list. This is actually the implementation in assembly in head-common.S that we have seen earlier, so we know that this will retrieve information about the CPU from the files in arch/arm/mm/proc-*.S for example arch/arm/mm/proc-v7.S for all the ARMv7 processors. With some clever linkage the information about the CPU is now assigned into struct proc_info_list from arch/arm/include/asm/procinfo.h, so that any low-level information about the CPU and a whole set of architecture-specific assembly functions for the CPU can be readily accessed from there. A pointer to access these functions is set up in a global vector table named, very clearly, processor.

We print the CPU banner which can look something like this (Qualcomm APQ8060, a dual-core Cortex A9 SMP system):

CPU: ARMv7 Processor [510f02d2] revision 2 (ARMv7), cr=10c5787d

From the per-processor information struct we also set up elf_hwcap and elf_hwcap2. These flags that can be found in arch/arm/include/uapi/asm/hwcap.h tells the ELF (Executable and Linkable Format) parser in the kernel what kind of executable files we can deal with. For example if the executable is using hardware floating point operations or NEON instructions, a flag is set in the ELF header, and compared to this when we load the file so that we can determine if we can even execute the file.

Just reading that out isn’t enough though so we call in succession cpuid_init_hwcaps() to make a closer inspection of the CPU, disable execution of thumb binaries if the kernel was not compiled with thumb support, and we later also call elf_hwcap_fixup() to get these flags right for different fine-granular aspects of the CPU.

Next we check what cache type this CPU has in cacheid_init(). You will find these in arch/arm/include/asm/cachetype.h with the following funny names tagged on: VIVT, VIPT, ASID, PIPT. Naturally, this isn’t very helpful. These things are however clearly defined in Wikipedia, so go and read.

We then call cpu_init() which calls cpu_proc_init() which will execute the per-cputype callback cpu_*_proc_init from the CPU information containers in arch/arm/mm/proc-*.S. For example, the Faraday FA526 ARMv4 type CPU cpu_fa526_proc_init() in arch/arm/mm/proc-fa526.S will be executed.

Lastly there is a piece of assembly: 5 times invocation of the msr cpsr_c assembly instruction, setting up the stacks for the 5 different exceptions of an ARM CPU: IRQ, ABT, UND, FIQ and SVC. msr cpsr_c can switch the CPU into different contexts and set up the hardware-specific sp for each of these contexts. The CPU stores these copies of the sp internally.

The first assembly instruction loads the address of the variable stk, then this is offsetted for the struct members irq[0], abt[0], und[0], and fiq[0]. The struct stack where this is all stored was located in the beginning of the cpu_init() with these two codelines:

unsigned int cpu = smp_processor_id();
struct stack *stk = &stacks[cpu];

stacks is just a file-local struct in a cacheline-aligned variable:

struct stack {
    u32 irq[3];
    u32 abt[3];
    u32 und[3];
    u32 fiq[3];
} ____cacheline_aligned;

static struct stack stacks[NR_CPUS];

In the ARM32 case cacheline-aligned means this structure will start on an even 32, 64 or 128-byte boundary. Most commonly a 32-byte boundary, as this is the most common cacheline size on ARM32.

So we define as many exception callstacks as there are CPUs in the system. Interrupts and other exceptions thus have 3 words of callstack. We will later in this article descript how we set up the exception vectors for these exceptions as well.

Setting up the machine

We have managed to set up the CPU per se and we are back in the setup_arch() function. We now need to know what kind of machine we are running on to get further.

To do this we first look for a flattened device tree (DTB or device tree blob) by calling setup_machine_fdt(). If this fails we will fall back to ATAGs by calling setup_machine_tags() which was the mechanism used for boardfiles before we had device trees.

We have touched upon this machine characteristic before: the most important difference is that in device trees the hardware on the system is described using an abstract hierarchical tree structure and in ATAGs the hardware is defined in C code in a so-called boardfile, that will get called later during this initialization. There are actually to this day three ways that hardware is described in the ARM systems:

  1. Compile-time hardcoded: you will find that some systems such as the RISC PC or StrongARM EBSA110 (evaluation board for StrongARM 110) will have special sections even in head.S using #ifdef CONFIG_ARCH_RPC and similar constructs to kick in machine-specific code at certain stages of the boot. This was how the ARM32 kernel was started in 1994-1998 and at the time very few machines were supported. This was the state of the ARM32 Linux kernel as it was merged into the mainline in kernel v2.1.80.
  2. Machine number + ATAGs-based: as the number of ARM machines grew quickly following the successful Linux port, the kernel had to stop relying on compile-time constants and needed a way for the kernel to identify which machine it was running on at boot time rather than at compile time. For this reason ATAGs were introduced in 2002. The ATAGs are also called the kernel tagged list and they are a simple linked list in memory passed to the kernel at boot, and most importantly provides information about the machine type, physical memory location and size. The machine type is what we need to call into the right boardfile and perform the rest of the device population at runtime. The ATAGs are however not used to identify the very machine itself. To identify the machine the boot loader also passes a 32bit number in r1, then a long list of numbers identifying each machine, called mach-types, is compiled into the kernel to match this number.
  3. Device Tree-based: the Android heist in 2007 and onward started to generate a constant influx of new machine types to the ARM kernel, and the board files were growing wild. In 2010 Grant Likely proposed that ARM follow PowerPC and switch to using device trees to describe the hardware, rather than ATAGs and boardfiles. In 2011 following an outburst from Torvalds in march, the situation was becoming unmaintainable and the ARM community started to accelerate to consolidate the kernel around using device trees rather than boardfiles to cut down on kernel churn. Since then, the majority of ARM32 boards use the device tree to describe the hardware. This pattern has been followed by new architectures: Device Tree is established as the current most viable system description model for new machines.

All of the approaches use the kernel device and driver model to eventually represent the devices in the kernel – however during the very early stages of boot the most basic building blocks of the kernel may need to shortcut this to some extent, as we will see.

Whether we use ATAGs or a DTB, a pointer to this data structure is passed in register r2 when booting the kernel and stored away in the global variable __atags_pointer. The name is a bit confusing since it can also point to a DTB. When using ATAGs a second global variable named __machine_arch_type is also used, and this contains register r1 as passed from the boot loader. (This value is unused when booting from a device tree.)

Using the ATAGs or the DTB, a machine descriptor is located. This is found either from the numerical value identifying the machine in __machine_arch_type or by parsing the DTB to inspect the .compatible value of the machine, at the very top node of the device tree.

While ARM32 device tree systems are mainly relying on the device tree to boot, they also still need to match against a machine descriptor in some file under arch/arm/mach-*/*.c, defined with the macro DT_MACHINE_START. A simple grep DT_MACHINE_START gives you an idea of how many basic ARM32 machine types that boot from the device tree. The ARM64 (Aarch64) kernel, by contrast, has been engineered from start not to require any such custom machine descriptors and that is why it is not found in that part of the kernel. On ARM64, the device trees is the sole machine description.

If you inspect the arch/arm/kernel/devtree.c file you will see that a default device tree machine named Generic DT based system is defined if we are booting a multiplatform image. If no machine in any subdirectory matches the compatible-string of the device tree, this one will kick in. This way it is possible to rely on defaults and boot an ARM32 system with nothing but a device tree, provided you do not need any fix-ups of any kind. Currently quite a lot of ARM32 machines have some quirks though. ARM64 again, have no quirks on the machine level: if quirks are needed for some hardware, these will normally go into the drivers for the machine, and gets detected on a more fine-granular basis using the compatible-string or data found in hardware registers. Sometimes the ARM64 machines use firmware calls.

In either case the physical address pointing to the data structure for ATAGs or DTB is converted to a virtual address using __phys_to_virt() which involves a bit of trickery as we have described in an earlier article about phys_to_virt patching.

Memory Blocks – Part 1

One of the more important effects of calling either setup_machine_fdt() or setup_machine_tags() depending on machine, is the population of a list of memory blocks, memblocks or regions, which is the basic boot-time unit of physical RAM in Linux. Memblocks is the early memory abstraction in Linux and should be contiguous (follow each other strictly in memory). The implementation as used by all Linux architectures can be found in mm/memblock.c.

Either parser (device tree or ATAGs) identifies blocks of physical memory that are added to a list of memory blocks using the function arm_add_memory(). The memory blocks (regions) are identified with start and size, for example a common case is 128MB of memory starting at 0x00000000 so that will be a memory block with start = 0x00000000 and size = 0x08000000 (128 MB).

The memory blocks are actually stored in struct memblock_region which have members .base and .size for these two numbers, plus a flag field.

The ATAG parser calls arm_add_memory() which will adjust the memory block a bit if it has odd size: the memory block must start at a page boundary, be inside a 32bit physical address space (if we are not using LPAE), and it must certainly be on or above PHYS_OFFSET.

When arm_add_memory() has aligned a memory block it will call a facility in the generic memory manager of the Linux kernel with memblock_add() which will in effect store the list of memory blocks in a kernel-wide global list variable with the helpful name memory.

Why must the memory blocks be on or above (i.e. at higher address) than PHYS_OFFSET?

This is because of what has been said earlier about boot: we must load the kernel into the first block of physical memory. However the effect of this check is usually nothing. PHYS_OFFSET is, on all modern platforms, set to 0x00000000. This is because they all use physical-to-virtual patching at runtime. So on modern systems, as far as memory blocks are concerned: anything goes as long as they fit inside a 32-bit physical memory address space.

Consequently, the device tree parser we enter through setup_machine_fdt() does not care about adjusting the memory blocks to PHYS_OFFSET: it will end up in early_init_dt_scan_memory() in drivers/of/fdt.c which calls early_init_dt_add_memory_arch() which just aligns the memory block to a page and add it to the list with memblock_add(). This device tree parsing code is shared among other Linux architectures such as ARC or ARM64 (Aarch64).

When we later want to inspect these memory blocks, we just use the iterator for_each_mem_range() to loop over them. The vast majority of ARM32 systems have exactly one memory block/region, starting at 0x00000000, but oddities exist and these are handled by this code. An example I will use later involves two memory blocks 0x00000000-0x20000000 and 0x20000000-0x40000000 comprising 2 x 512 MB of memory resulting in 1 GB of physical core memory.

Initial Virtual Memory init_mm

After some assigning of variables for the machine, and setting up the reboot mode from the machine descriptor we then hit this:

init_mm.start_code = (unsigned long) _text;
init_mm.end_code   = (unsigned long) _etext;
init_mm.end_data   = (unsigned long) _edata;
init_mm.brk       = (unsigned long) _end;

init_mm is the initial memory management context and this is a compile-time prepared global variable in mm/init-mm.c.

As you can see we assign some variables to the (virtual) addresses of the kernel .text and .data segments. Actually this also covers the BSS section, that we have initialized to zero earlier in the boot process, as _end follows after the BSS section. The initial memory management context needs to know these things to properly set up the virtual memory. This will not be used until much later, after we have exited setup_arch() and get to the call to the function init_mm(). (Remember that we are currently still running in a rough 1MB-chunk-section mapping.)

init MM set-up Here we can see the area where the kernel memory starts at PAGE_OFFSET and how we align in the different sections into init_mm. The swapper_pg_dir is actually the page global directory of the init_mm structure as we will see later.

The Early Fixmap and early_mm_init()

Next we initialize the early fixmap. The fixmap is a virtual memory area from 0xFFC00000-0xFFF00000 where some fixed physical-to-virtual mappings can be specified. They are used for remapping some crucial parts of the memory as well as some I/O memory before the proper paging is up, so we poke around in the page table in a simplified manner and we can do a few necessary I/O operations in the virtual memory even before the proper set-up of the virtual memory.

There are four types of fixmaps on ARM32, all found in arch/arm/include/asm/fixmap.h where enum fixed_addresses define slots to be used for different early I/O maps:

  1. FIX_EARLYCON: this is a slot used by the early console TTY driver. Some serial line drivers have a special early console callback that can be used to get an early console before the actual serial driver framework has started. This is supported on ARM32 but it has a very limited value, because on ARM32 we have CONFIG_DEBUG_LL which provides a hardcoded serial port at compile time, which makes it possible to get debug output on the serial port even before we are running in virtual memory, as we have seen earlier. CONFIG_DEBUG_LL cannot be used on a multiplatform image and has the upside of using the standard serial port driver callbacks, the serial port defines an EARLYCON_DECLARE() callback and assigns functions to ->con->write/read to get both read and write support on the early console.
  2. FIX_KMAP: As can be seen from the code, this is KM_TYPE_NR * NR_CPUS. KM_TYPE_NR is tied down to 16 for ARM32, so this will be 16 maps for each CPU. This area is used for high memory “highmem”. Highmem on ARM32 is a whole story on its own and we will detail it later, but it relates to the way that the kernel uses a linear map of memory (by patching physical to virtual memory mapping as we have seen), but for now it will suffice to say that this is where memory that cannot be accessed by using the linear map (a simple addition or subtraction to get between physical and virtual memory) is temporarily mapped in so that the kernel can make use of it. The typical use case will be page cache.
  3. FIX_TEXT_POKE[0|1]: These two slots are used by debug code to make some kernel text segments writable, such as when inserting breakpoints into the code. The fixmap will open a “window” over the code that it needs to patch, modify the code and close the window again.
  4. FIX_BTMAPS: This is parameterization for 32 * 7 slots of mappings used by early ioremap, see below.

Those early fixmaps are set up during boot as we call:

pmd_t *pmd;

pmd = fixmap_pmd(FIXADDR_TOP);
pmd_populate_kernel(&init_mm, pmd, bm_pte);

The abbreviations used here include kernel page table idiosyncracies so it might be a good time to read my article on how the ARM32 page tables work.

This will create an entry in the page middle directory (PMD) and sufficient page table entries (PTE) referring to the memory at FIXADDR_TOP, and populate it with the finer granular page table entries found in bm_pte. The page table entries in bm_pte uses PAGE_SIZE granularity so usually this is a number of 0x1000 sized windows from virtual to physical memory. FIXADDR_TOP points to the last page in the FIXADDR address space, so this will be at 0xFFF00000 - PAGE_SIZE so typically at 0xFFEFF000..0xFFEFFFF. The &init_mm parameter is actually ignored, we just set up this PMD using the PTEs in bm_pte. Finally the memory storing the PMD itself is flushed in the translation lookaside buffer so we know the MMU has the right picture of the world.

Did we initialize the init_mm->pgd member now again? Nope. init_mm is a global variable defined in <linux/mm_types.h> as extern struct mm_struct init_mm, and the struct mm_struct itself is also defined in this header. To find init_mm you need to look into the core kernel virtual memory management code in mm/init-mm.c and there it is:

struct mm_struct init_mm = {
        .mm_rb          = RB_ROOT,
        .pgd            = swapper_pg_dir,
        (...)
};

So this is assigned at compile time to point to swapper_pg_dir, which we know already to contain our crude 1MB section mappings. We cross our fingers, hope that our fixmap will not collide with any of the existing 1MB section mappings, and just push a more complex, proper “PMD” entry into this PGD area called swapper_pg_dir. It will work fine.

So we still have our initial 1MB-sized section mappings, and to this we have added an entry to this new “PMD”, which in turn point to the page table entry bm_pte which stores our fixmaps.

Init MM context This illustrates the world of the init_mm memory management context. It is getting crowded in this picture. You can see the fixmap entries piling up around 0xFFF00000 and the serial port mapped in as one of the fixmaps.

Code will have to assign the physical base address (aligned to a page boundary) to use for each slot in the fixmap, and the kernel will assign and map a suitable virtual address for the physical address, for example the early console does this:

set_fixmap(FIX_EARLYCON, <physical address>);

This just loops back to __set_fixmap() in arch/arm/mm/mmu.c that looks up which virtual address has been assigned for this index (in this case the index is FIX_EARLYCON) by using __fix_to_virt() from the generic part of the fixmap in include/asm-generic/fixmap.h. There are two functions for cross referencing virtual to physical memory and vice versa that look like this:

#define __fix_to_virt(x)        (FIXADDR_TOP - ((x) << PAGE_SHIFT))
#define __virt_to_fix(x)        ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)

As you can see, the fixmap remappings are done one page at the time (so each remapping is one single page), backwards from FIXADDR_TOP (on ARM32 0xFFEFF000) so the first index for FIX_EARLYCON will be a page at 0xFFEFE000, the second index at 0xFFEFD000 etc.

It then looks up the PTE for this virtual address using pte_offset_fixmap():

pte_t *pte = pte_offset_fixmap(pmd_off_k(vaddr), vaddr);

Which at this point is a pointer to pte_offset_early_fixmap():

static pte_t * __init pte_offset_early_fixmap(pmd_t *dir, unsigned long addr)
{
    return &bm_pte[pte_index(addr)];
}

So we simply get a pointer to the PTE inside the bm_pte page table, which makes sense since this is the early fixmap page table where these page table entries physically reside.

Then set_fixmap() will modify the page table entry in the already registered and populated bm_pte page table by calling set_pte_at() which call set_pte_ext() which will eventually call down into per-CPU assembly symbol set_pte_ext in arch/arm/mm/proc-*.S This will conjure the right value for the PTE and write that back to the page table in the right slot, so that the virtual-to-physical mapping actually happens.

After this we call local_flush_tlb_kernel_range() just to make sure we don’t have this entry stored in the translation lookaside buffer for the CPU. We can now access the just remapped physical memory for the early console in the virtual address space. It better not be more than one page, but it’s cool: there is no memory mapped serial port that uses more than one page of memory. It will “just work”.

The early fixmaps will eventually be converted to proper (non-fixed) mappings once we call early_fixmap_shutdown() inside paging_init() which will be described later. Curiously only I/O memory is supported here. We better not use any fixmaps before early_fixmap_shutdown() that are not I/O memory and expect them to still be around after this point. This should be safe: the patching in the poke windows and the highmem business should not happen until later anyways.

As I noted the FIX_BTMAPS inside the early fixmaps are used for I/O memory. So this comes next, as well call early_ioremap_init(). As ARM32 is using the generic early ioremap code this just calls early_ioremap_setup() in mm/early_ioremap.c. This makes it possible to use early calls to ioremap(). As noted we have defined for NR_FIX_BTMAPS which we use to parameterize the generic early ioremap code. We can early ioremap 32 different memory areas.

For drivers it will transparently provide a back-end for ioremap() so that a piece of I/O memory can be remapped to a virtual address already at this point, and stay there, usually for the uptime of the system. A driver requesting an ioremap at this point will get a virtual-to-physical mapping in the assigned virtual memory area somewhere in the range 0xFFC00000-0xFFF00000 and it stays there.

Which device drivers will use these early ioremaps to get to the memory-mapped I/O? The early console is using it’s own fixmap so not that one. Well. actually not much as it looks. But it’s available.

Later, at runtime, the fixmaps will find another good use: they are used to map in highmem: memory that the kernel cannot handle due to being outside of the linear kernel map. Such areas will be mapped in here one piece at a time. This is causing complexities in the highmem handling, and is why we are investigating an end to highmem. If you do not know what highmem means in this context – do not worry! – it will be explained in more detail below.

Early Parameters

We have just set up some early mappings for the early console, so when we next call parse_early_param() in init/main.c to read a few command line parameters that just cannot wait. As the documentation in <linux/init.h> says: “only for really core code”.

Before we set up the early fixmap we actually copied the boot_command_line. This was already present since the call to setup_machine_fdt() or setup_machine_tags(). During the parsing of ATAGs or the DTB, boot_command_line is assigned from either source. So a kernel command line can be passed in from each of these two facilities. Parameters to the kernel can naturally be passed in on this command line.

The normal way to edit and pass these command line arguments is by using a facility in U-Boot, UEFI or GRUB to set them up in the boot loader before booting the kernel. U-Boot for example will either pass them in a special ATAG (old way) or by modifying the chosen node of the device tree in memory before passing a pointer to the device tree to the kernel in r2 when booting.

The early params are defined all over the kernel compiled-in code (not in modules, naturally) using the macro early_param(). Each of these result in a struct obs_kernel_param with a callback ->setup_func() associated with them, that gets stored in a table section named .init.setup by the linker and these will be called one by one at this point.

Examples of things that get set up from early params are cache policy, extra memory segments passed with mem=..., parameters to the IRQ controllers such as noapic to turn off the x86 APIC, initrd to point out the initial RAM disk.

mem is an interesting case: this is parsed by an early_param() inside setup.c itself, and if the user passed some valid mem= options on the command line, these will be added to the list of available memory blocks that we constructed from the ATAGs or DTB.

earlycon is also parsed, which will activate the early console if this string is passed on the command line. So we just enabled the special FIX_EARLYCON early fixmap, and now the early console can use this and dump out the kernel log so far. All this happens in drivers/tty/serial/earlycon.c by utilizing the early console callbacks of the currently active serial port.

If you are using device tree, early_init_dt_scan_chosen_stdout() will be called, which will call of_setup_earlycon() on the serial driver selected in the stdout-path in the chosen section in the root of the device tree, such as this:

/ {
    chosen {
         stdout-path = "uart0:19200n8";
    };
(...)

Or like this:

/ {
    chosen {
           stdout-path = &serial2;
    };
(...)
    serial2: uart@80007000 {
        compatible = "arm,pl011", "arm,primecell";
        reg = <0x80007000 0x1000>;
    };
};

The device tree parsing code will follow the phandle or use the string to locate the serial port. For this reason, to get an early console on device tree systems, all you need to pass on the command line is earlycon. The rest will be figured out by the kernel by simply inspecting the device tree.

On older systems it is possible to pass the name of the driver and a memory address of a serial port, for example earlycon=pl011,0x80007000 but this is seldomly used these days. Overall the earlycon, as noted, is seldomly used on ARM32. It is because we have the even more powerful DEBUG_LL that can always hammer out something on the serial port. If you need really early debugging information, my standard goto solution is DEBUG_LL and using printascii() until the proper kernel prints have come up. If earlycon is early enough, putting the stdout-path into your device tree and specifying the earlycon parameter should do the trick on modern systems.

Early Memory Management

As we reach early_mm_init() we issue build_mem_type_table() and early_paging_init() in sequence. This is done so that we can map the Linux memory the way it is supposed to be mapped using all features of the MMU to the greatest extent possible. In reality, this involves setting up the “extra bits” in the level-2/3 page descriptors corresponding to the page table pointer entries (PTE:s) so the MMU knows what to do.

build_mem_type_table() will fill in the mem_types[] array of memory types with the appropriate protection and access settings for the CPU we are running on. mem_types[] is an array of some 16 different memory types that can be found inside arch/arm/mm/mmu.c and list all the memory types that may exist in an ARM32 system, for example MT_DEVICE for memory-mapped I/O or MT_MEMORY_RWX for RAM that can be read, written and executed. The memory type will determine how the page descriptors for a certain type of memory gets set up. At the end of building this array, the kernel will print out some basic information about the current main memory policy, such as:

Memory policy: ECC enabled, Data cache, writeback

This tells us that was turned on by passing ecc=on on the command line, with the writeback data cache policy, which is the default. The message is a bit misleading since it actually just talks about a few select aspects of the memory policy regarding the use of ECC and the data cache. There are many other aspects to the cache policy that can be seen by inspecting the mmu.c file. It is not normal to pass in ecc=on, so this is just an example.

early_paging_init() then, is a function that calls ->pv_fixup() on the machine, and this is currently only used on the Texas Instruments Keystone 2 and really just does something useful when the symbol CONFIG_ARM_PV_FIXUP is defined. What happens then is that an address space bigger than 4GB is enabled using LPAE. This is especially complicated since on this machine even the physical addresses changes as part of the process and PHYS_OFFSET is moved upwards. The majority of ARM32 machines never do this.

We do some minor initialization calls that are required before kicking in the proper paging such as setting up the DMA zone, early calls to Xen (virtualization) and EFI. Then we get to the core of the proper Linux paging.

Paging Initialization: Lowmem and Highmem

We now reach the point where the kernel will initialize the virtual memory handling proper, using all the bells and whistles of the MMU.

The first thing we do is to identify the bounds of lowmem. ARM32 differentiates between lowmem and highmem like this:

  • Lowmem is the physical memory used by the kernel linear mapping. This is typically the virtual memory between 0xC0000000-0xF0000000, 768 MB. With an alternative virtual memory split, such as VMSPLIT_1G giving userspace 0x40000000 and kernelspace 0xC0000000 lowmem would instead be 0x40000000-0xD0000000, 2.8 GB. But this is uncommon: almost everyone and their dog uses the default VMSPLIT with 768 MB lowmem.
  • Highmem is any memory at higher physical addresses, that we cannot fit inside the lowmem linear physical-to-virtual map.

You might remember that this linear map between physical and virtual memory was important for ARM32. It is achieved by physical-to-virtual runtime patching as explained in a previous article, with the goal of being as efficient as possible.

As you can see, on ARM32 both lowmem and highmem have a very peculiar definition and those are just conventions, they have very little to do with any hardware limitations of the architecture or any other random definitions of “lowmem” and “highmem” that are out there.

A system can certainly have less than 768 MB (0x30000000) physical memory as well: then all is fine. The kernel can map in and access any memory and everyone is happy. For 14 years all ARM32 systems were like this, and “highmem” did not even exist as a concept. Highmem was added to ARM32 by Nicolas Pitre in september 2008. The use case was a Marvell DB-78x00-BP development board that happened to have 2 GB of RAM. Highmem requires fixmap support so that was added at the same time.

The kernel currently relies on being able to map all of the core memory – the memory used by the kernel itself, and all the userspace memory on the machine – into its own virtual memory space. This means that the kernel cannot readily handle a physical memory bigger than 768 MB, with the standard VMSPLIT at 0xC0000000. To handle any memory outside of these 768 MB, the FIX_KMAP windows in the fixmaps we discussed earlier are used.

Let’s inspect how this 768 MB limitation comes about.

To calculate the end of lowmem we call adjust_lowmem_bounds(), but first notice this compiled-in constant a little bit above that function:

static void * __initdata vmalloc_min =
    (void *)(VMALLOC_END - (240 << 20) - VMALLOC_OFFSET);

vmalloc is shorthand for virtual memory allocation area, and indicates the memory area where the kernel allocates memory windows. It has nothing to do with the physical memory of the kernel at this point: it is an area of virtual addresses that the kernel can use to place mappings of RAM or memory-mapped I/O. It will be used by SLAB and other kernel memory allocators as well as by ioremap(), it is a number of addresses in the kernel’s virtual memory that it will be using to access random stuff in physical memory. As we noted, we had to use early fixmaps up to this point, and the reason is exactly this: there is no vmalloc area to map stuff into yet.

The variables used for this pointer can be found in arch/arm/include/asm/pgtable.h:

#define VMALLOC_OFFSET          (8*1024*1024)
#define VMALLOC_START           (((unsigned long)high_memory + VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1))
#define VMALLOC_END             0xff800000UL

Never mind the high_memory variable in VMALLOC_START: it has not been assigned yet. We have no idea about where the virtual memory allocation will start at this point.

VMALLOC_END is hardcoded to 0xFF800000, (240 << 20) is an interesting way of writing 0x0F000000 and VMALLOC_OFFSET is 0x00800000. So vmalloc_min will be a pointer to the virtual address 0xF0000000. This lower bound has been chosen for historical reasons, which can be studied in the commit. Above this, all the way to 0xFF800000 we will find our vmalloc area.

From this we calculate the vmalloc_limit as the physical address corresponding to vmalloc_min, i.e. what address in the physical memory the virtual address 0xF0000000 corresponds to in the linear map of physical-to-virtual memory. If our kernel was loaded/decompressed at 0x10008000 for example, that usually corresponds to 0xC0008000 and so the offset is 0xB0000000 and the corresponding physical address for 0xF0000000 would be 0xF0000000 - 0xB0000000 = 0x40000000. So this is our calculated vmalloc_limit.

We arrive at the following conclusion: since the kernel starts at 0x10000000 (in this example) and the linear map of the kernel stretches from 0x10000000-0x40000000 (in virtual address space 0xC0000000-0xF0000000) That would be the equivalent of 0x30000000 bytes of memory, i.e. 768MB. This holds in general: the linear map of the kernel memory is 768 MB. The kernel and all memory it directly references must fit in this area.

In adjust_lowmem_bounds() we first loop over all memblocks to skip over those that are not “PMD-aligned” which means they do not start at a physical address evenly divisible by 0x00200000 (2 MB) on ARM systems. These are marked nomap in the memblock mechanism, so they will not be considered for kernel mappings.

The patch-phys-to-virt mechanism and the way we map page tables over the kernel relies on the kernel being at a PMD boundary so we will not be able to use memory blocks that do not start at a PMD-aligned memory address. If the memory block is such an off thing, such as when a machine has some small memories at random locations for things like graphics or other buffers, these will not be considered as a candidate for the core of the system. You can think about this as a heuristic that says: “small memories are probably not supposed to be used as main RAM memory”.

The next loop is more interesting. Here we again loop over the memblocks, but skip over those we just marked as unfit for the core memory. For those that remain we check if they start below vmalloc_limit (in our example at 0x40000000). As the memory we are executing inside certainly must start below vmalloc_limit we have a core memory candidate and inspect it further: if the block ends above the current lowmem_limit we bump lowmem_limit to the end of the block, unless it would go past vmalloc_limit, in that case we will truncate it at vmalloc_limit. This will in effect fit the core memory candidate over the 768 MB available for lowmem, and make sure that lowmem_limit is at the end of the highest memblock we can use for the kernel alternatively at the hard vmalloc_limit if that is lower than the end of the memblock. vmalloc_limit in turn is just calculated by subtracting vmalloc_reserve from VMALLOC_END, and that is parsed from a command line argument, so if someone passed in a minimum size of vmalloc reservation space, that will be respected.

In most cases arm_vmalloc_limit will be at the end of the highest located memblock. Very often there is just one big enough memblock on a given system.

The loop is necessary: there could be two blocks of physical memory adjacent to each other. Say one memblock at 0x00000000-0x20000000 and another memblock at 0x20000000-0x40000000 giving 2 x 512 MB = 1 GB of core memory. Both must be mapped to access the maximum of core memory. The latter will however be truncated by the code so that it ends at vmalloc_limit, which will be at 0x30000000 as we can only map 768 MB of memory. Oops.

The remainder of the second memblock will be called highmem if we have enabled highmem support, else we will just call memblock_remove() to delete the remainder from the system, and in this case the remaining memory will be unused so we print a warning about this: “Consider using a HIGHMEM enabled kernel.”

Lowmem and highmem on 1 GB This shows how a 1 GB physical memory in two memblocks of 512 MB each starting at physical address 0x00000000 gets partitioned into a lowmem of 768 MB and highmem of 256 MB. The upper 256 MB cannot fit into the linear kernel memory map!

Lowmem and highmem on 2 GB If we instead have four memblocks of 512 MB physical memory comprising a total of 2 GB, the problem becomes ever more pronounced: 1.2 GB of memory is now in highmem. This reflects the situation in the first Marvell board with 2 GB of memory that initiated the work on highmem support for ARM32.

After exiting the function we set arm_lowmem_limit to the lowmem_limit we found. In our example with these two 512 MB banks, it will be set at physical address 0x30000000. This is aligned to the PMD granularity (0x00200000) which will again be 0x30000000, and set as a limit for memory blocks using memblock_set_current_limit(). The default limit is 0xFFFFFFFF so this will put a lower bound on where in physical memory we can make allocations with memblocks for the kernel: the kernel can now only allocate in physical memory between 0x00000000-0x30000000.

We immediately also assign high_memory to the virtual memory address right above arm_lowmem_limit, in this case 0xF0000000 of course. That means that this definition from arch/arm/include/asm/pgtable.h is now resolved:

#define VMALLOC_START (((unsigned long)high_memory + VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1))

And we find that VMALLOC_START is set to 0xF0800000. The VMALLOC area is now defined to be 0xF0800000-0xFF800000, 0x0F000000 bytes (240 MB). This looks familiar to the (240 << 20) statement in the vmalloc_min definition.

Next

This exploration will be continued in Setting Up the ARM32 Architecture, part 2 where we will take a second round around the memory blocks and set up the paging.

 
Läs mer...

from David Ahern

Long overdue blog post on XDP; so many details uncovered during testing causing tests to be redone.

This post focuses on a comparison of XDP and OVS in delivering packets to a VM from the perspective of CPU cycles spent by the host in processing those packets. There are a lot of variables at play, and changing any one of them radically affects the outcome, though it should be no surprise XDP is always lighter and faster.

Setup

I believe I am covering all of the settings here that I discovered over the past few months that caused variations in the data.

Host

The host is a standard, modern server (Dell PowerEdge R640) with an Intel® Xeon® Platinum 8168 CPU @ 2.70GHz with 96 hardware threads (48 cores + hyper threading to yield 96 logical cpus in the host). The server is running Ubuntu 18.04 with the recently released 5.8.0 kernel. It has a Mellanox Connectx4-LX ethernet card with 2 25G ports into an 802.3ad (LACP) bond, and the bond is connected to an OVS bridge.

Host setup

As discussed in [1] to properly compare the CPU costs of the 2 networking solutions, we need to consolidate packet processing to a single CPU. Handling all packets destined to the same VM on the same CPU avoids lock contention on the tun ring, so consolidating packets to a single CPU is actually best case performance.

Ensure RPS is disabled in the host:

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

and add flow rules in the NIC to push packets for the VM under test to a single CPU:

sudo ethtool -N eth0 flow-type ether dst 12:34:de:ad:ca:fe action 2
sudo ethtool -N eth1 flow-type ether dst 12:34:de:ad:ca:fe action 2

For this host and ethernet card, packets for queue 2 are handled on CPU 5 (consult /proc/interrupts for the mapping on your host).

XDP bypasses the qdisc layer, so to have a fair comparison make noqueue the default qdisc before starting the VM:

sudo sysctl -w net.core.default_qdisc=noqueue

(or add a udev rule [2]).

Finally, the host is fairly quiet with only one VM running (the one under test) and very little network traffic outside of the VM under test and a few, low traffic ssh sessions used to run commands to collect data about the tests.

Virtual Machine

The VM has 8 cpus and is also running Ubuntu 18.04 with a 5.8.0 kernel. It uses tap+vhost for networking with the tap device a port in the OVS bridge as shown in the picture above. The tap device has a single queue, and RPS is also disabled in the guest:

echo 00 | sudo tee /sys/class/net/eth0/queues -name rps_cpus

The VM is also quiet with no load running in the guest OS.

The point of this comparison is host side processing of packets, so packets are dropped in the guest as soon as possible using a bpf program [3] attached to eth0 as a tc filter. (Note: Theoretically, XDP should be used to drop the packets in the guest OS since it truly is the fewest cycles per packet. However, XDP in the VM requires a multi-queue NIC[5], and adding queues to the guest NIC has a huge affect on the results.)

In the host, the qemu threads corresponding to the guest CPUs (vcpus) are affined (as a set) to 8 hardware threads in the same NUMA node as CPU 5 (the host CPU processing packets per the RSS rules mentioned earlier). The vhost thread for the VM's tap device is also affined to a small set of host CPUs in the same NUMA node to avoid scheduling collisions with the vcpu threads, the CPU processing packets (5) and its sibling hardware thread (CPU 53 in my case) – all of which add variability to the results.

Forwarding with XDP

Packet forwarding with XDP is done by attaching an L2 forwarding program [4] to eth0 and eth1. The program pulls the VLAN and destination mac from the ethernet header, and uses the pair as a key for a lookup in a hash map. The lookup returns the next device index for the packet which for packets destined to the VM is the index of its tap device. If an entry is found, the packet is redirected to the device via XDP_REDIRECT. The use case was presented in depth at netdevconf 0x14 [5].

Packet generator

Packets are generated using 2 VMs on a server that is directly connected to the same TOR switches as the hypervisor running the VM under test. The point of the investigation is to measure the overhead of delivering packets to a VM, so memcpy is kept to a minimum by having the packet generator [6] in the VMs send 1-byte UDP packets.

Test setup

Each VM can generate a little over 1 million packets per sec (1M pps), for a maximum load of 2.2M pps based on 2 separate source addresses.

CPU Measurement

As discussed in [1] a fair number of packets are processed in the context of some interrupted, victim process or when handled on an idle CPU the cycles are not fully accounted in the softirq time shown in tools like mpstat.

This test binds openssl speed, a purely userspace command[1], to the CPU handling packets to fully consume 100% of all CPU cycles which makes the division of CPU time between user, system and softirq more transparent. In this case, the output of mpstat -P 5 shows how all of the cycles for CPU 5 were spent (within the resolution of system accounting): * %softirq is the time spent handling packets. This data is shown in the graphs below. * %usr represents the usable CPU time for processes to make progress on their workload. In this test, it shows the percentage of CPU consumed by openssl and compares to the times shown by openssl within 1-2%. * %sys is the percentage of kernel time and for the data shown below was always <0.2%.

As an example, in this mpstat output openssl is only getting 14.2% of the CPU while 85.8% was spent handling the packet load:

CPU    %usr   %nice    %sys  %iowait   %irq   %soft   %idle
  5   14.20    0.00    0.00     0.00   0.00   85.80   0.00

(%steal, %guest and %gnice dropped were always 0 and dropped for conciseness.)

Let's get to the data.

CPU Comparison

This chart shows a comparison of the %softirq required to handle various PPS rates for both OVS and XDP. Lower numbers are better (higher percentages mean more CPU cycles).

1-VM softirq

There is 1-2% variability in ksoftirqd percentages despite the 5-second averaging, but the variability does not really affect the important points of this comparison.

The results should not be that surprising. OVS has well established scaling problems and the chart shows that as packet rates increase. In my tests it was not hard to saturate a CPU with OVS, reaching a maximum packet rate to the VM of 1.2M pps. The 100% softirq at 1.5M pps and up is saturation of ksoftirqd alone with nothing else running on that CPU. Running another process on CPU 5 immediately affects the throughput rate as the CPU splits time between processing packets and running that process. With openssl, the packet rate to the VM is cut in half with packet drops at the host ingress as it can no longer keep up with the packet rate given the overhead of OVS.

XDP on the other hand could push 2M pps to the VM before the guest could no longer keep up with packet drops at the tap device (ie., no room in the tun ring meaning the guest has not processed the previous packets). As shown above, the host still has plenty of CPU to handle more packets or run workloads (preferred condition for a cloud host).

One thing to notice about the chart above is the apparent flat lining of CPU usage between 50k pps and 500k pps. That is not a typo, and the results are very repeatable. This needs more investigation, but I believe it shows the efficiencies kicking in from a combination of more packets getting handled per napi poll cycle (closer to maximum of the netdev budget) and the kernel side bulking in XDP before a flush is required.

Hosts typically run more than 1 VM, so let's see the effect of adding a second VM to the mix. For this case a second VM is started with the same setup as mentioned earlier, but now the traffic load is split equally between 2 VMs. The key point here is a single CPU processing interleaved network traffic for 2 different destinations.

2-VM softirq

For OVS, CPU saturation with ksoftirqd happens with a maximum packet rate to each VM of 800k pps (compared to 1.2M with only a single VM). The saturation is in the host with packet drops shown at host ingress, and again any competition for the CPU processing packets cuts the rate in half.

Meanwhile, XDP is barely affected by the second VM with a modest increase of 3-4% in softirq at the upper packet rates. In this case, the redirected packets are just hitting separate bulking queues in the kernel. The two packet generators are not able to hit 4+M pps to find the maximum per-VM rate.

Final Thoughts

CPU cycles are only the beginning for comparing network solutions. A full OVS-vs-XDP comparison needs to consider all the resources consumed – e.g., memory as well as CPU. For example, OVS has ovs-vswitchd which consumes a high amount of memory (>750MB RSS on this server with only the 2 VMs) and additional CPU cycles to handle upcalls (flow misses) and revalidate flow entries in the kernel which on an active hypervisor can easily consume 50+% cpu (not counting increased usage from various bugs[7]).

Meanwhile, XDP is still early in its lifecycle. Right now, using XDP for this setup requires VLAN acceleration in the NIC [5] to be disabled meaning the VLAN header has to be removed by the ebpf program before forwarding to the VM. Using the proposed hardware hints solution reduces the softirq time by another 1-2% meaning 1-2% more usable CPU by leveraging hardware acceleration with XDP. This is just an example of how XDP will continue to get faster as it works better with hardware offloads.

Acronyms

LACP Link Aggregation Control Protocol NIC Nework Interface Card NUMA Non-Uniform Memory Access OVS Open VSwitch PPS Packets per Second RPS Receive Packet Steering RSS Receive Side Scaling TOR Top-of-Rack VM Virtual Machine XDP Express Data Path in Linux

References

[1] https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host [2] https://people.kernel.org/dsahern/rss-rps-locking-qdisc [3] https://github.com/dsahern/bpf-progs/blob/master/ksrc/rx_acl.c [4] https://github.com/dsahern/bpf-progs/blob/master/ksrc/xdp_l2fwd.c [5] https://netdevconf.info/0x14/session.html?tutorial-XDP-and-the-cloud [6] https://github.com/dsahern/random-cmds/blob/master/src/pktgen.c [7] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg39266.html

 
Read more...

from Christian Brauner

In my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))

Last time I mentioned two new features that we had landed:

  1. Retrieving file descriptors from another task via pidfd_getfd()
  2. Injection file descriptors via the new SECCOMP_IOCTL_NOTIF_ADDFD ioctl on the seccomp notifier

The 2. feature just landed in the merge window for v5.9. So what better time than now to boot a v5.9 pre-rc1 kernel and play with the new features.

I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are open(), connect(), dup2(), but also bpf(). People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than bpf().

As we know, bpf() and unprivileged containers don't get along too well. But that doesn't need to be the case. For the demo you're about to see I enabled LXD to supervise the bpf() syscalls for tasks running in unprivileged containers. We will intercept the bpf() syscalls for the BPF_PROG_LOAD command for BPF_PROG_TYPE_CGROUP_DEVICE program types and the BPF_PROG_ATTACH, and BPF_PROG_DETACH commands for the BPF_CGROUP_DEVICE attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.

This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up here. Let's see if the rest of the team thinks I'm going crazy. :)

asciicast

 
Read more...

from David Ahern

I recently learned this fun fact: With RSS or RPS enabled [1] and a lock-based qdisc on a VM's tap device (e.g., fq_codel) a UDP packet storm targeted at the VM can severely impact the entire server.

The point of RSS/RPS is to distribute the packet processing load across all hardware threads (CPUs) in a server / host. However, when those packets are forwarded to a single device that has a lock-based qdisc (e.g., virtual machines and a tap device or a container and veth based device) that distributed processing causes heavy spinlock contention resulting in ksoftirqd spinning on all CPUs trying to handle the packet load.

As an example, my server has 96 cpus and 1 million udp packets per second targeted at the VM is enough to push all of the ksoftirqd threads to near 100%:

  PID %CPU COMMAND               P
   58 99.9 ksoftirqd/9           9
  128 99.9 ksoftirqd/23         23
  218 99.9 ksoftirqd/41         41
  278 99.9 ksoftirqd/53         53
  318 99.9 ksoftirqd/61         61
  328 99.9 ksoftirqd/63         63
  358 99.9 ksoftirqd/69         69
  388 99.9 ksoftirqd/75         75
  408 99.9 ksoftirqd/79         79
  438 99.9 ksoftirqd/85         85
 7411 99.9 CPU 7/KVM            64
   28 99.9 ksoftirqd/3           3
   38 99.9 ksoftirqd/5           5
   48 99.9 ksoftirqd/7           7
   68 99.9 ksoftirqd/11         11
   78 99.9 ksoftirqd/13         13
   88 99.9 ksoftirqd/15         15
   ...

perf top shows the spinlock contention:

    96.79%  [kernel]          [k] queued_spin_lock_slowpath
     0.40%  [kernel]          [k] _raw_spin_lock
     0.23%  [kernel]          [k] __netif_receive_skb_core
     0.23%  [kernel]          [k] __dev_queue_xmit
     0.20%  [kernel]          [k] __qdisc_run

With the callchain leading to

    94.25%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --94.25%--queued_spin_lock_slowpath
                       |
                        --93.83%--__dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

A little code analysis shows this is the qdisc lock in __dev_xmit_skb.

The overloaded ksoftirqd threads means it takes longer to process packets resulting in budget limits getting hit and packet drops at ingress. The packet drops can cause ssh sessions to stall or drop or cause disruptions in protocols like LACP.

Changing the qdisc on the device to a lockless one (e.g., noqueue) dramatically lowers the ksoftirqd load. perf top still shows the hot spot as a spinlock:

    25.62%  [kernel]          [k] queued_spin_lock_slowpath
     6.87%  [kernel]          [k] tasklet_action_common.isra.21
     3.28%  [kernel]          [k] _raw_spin_lock
     3.15%  [kernel]          [k] tun_net_xmit

but this time it is the lock for the tun ring:

    25.10%  [kernel.vmlinux]    [k] queued_spin_lock_slowpath
            |
             --25.05%--queued_spin_lock_slowpath
                       |
                        --24.93%--tun_net_xmit
                                  dev_hard_start_xmit
                                  __dev_queue_xmit
                                  do_execute_actions
                                  ovs_execute_actions
                                  ovs_dp_process_packet
                                  ovs_vport_receive

which is a much lighter lock in the sense of the amount of work done with the lock held.

systemd commit e6c253e363dee, released in systemd 217, changed the default qdisc from pfifo_fast (kernel default) to fq_codel (/usr/lib/sysctl.d/50-default.conf for Ubuntu). As of v5.8 kernel fq_codel still has a lock to enqueue packets, so systems using fq_codel with RSS/RPS are hitting this lock contention which affects overall system performance. pfifo_fast is lockless as of v4.16 so for newer kernels the kernel's default is best.

But, it begs the question why have a qdisc for a VM tap device (or a container's veth device) at all? To the VM a host is just part of the network. You would not want a top-of-rack switch to buffer packets for the server, so why have the host buffer packets for a VM? (The “Tx” path for a tap device represents packets going to the VM.)

You can change the default via:

sysctl -w net.core.default_qdisc=noqueue

or add that to a sysctl file (e.g., /etc/sysctl.d/90-local.conf). sysctl changes affect new devices only.

Alternatively, the default can be changed for selected devices via a udev rule:

cat > /etc/udev/rules.d/90-tap.rules <<EOF
ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="tap*", PROGRAM="/sbin/tc qdisc add dev $env{INTERFACE} root handle 1000: noqueue"

Running sudo udevadm trigger should update existing devices. Check using tc qdisc sh dev <NAME>:

$ tc qdisc sh dev tapext4798884
qdisc noqueue 1000: root refcnt 2
qdisc ingress ffff: parent ffff:fff1 ----------------

[1] https://www.kernel.org/doc/Documentation/networking/scaling.txt

 
Read more...