metan's blog

Tales from kernel testing

We have reached an important milestone with latest LTP release – the amount of testcases written in the new test library finally outnumbers the amount of old library tests. Which is nice opportunity for a small celebration and also to look back a bit into a history and try to summarize what has happened over the last 10 years in LTP.

I've joined LTP development a bit more than 10 years ago in 2009. At that point we were really struggling with the basics. The build system was collection of random Makefiles and the build often failed for very random reasons. The were pieces of shell code embedded in Makefiles for instance to check for devel libraries, manually written shell loops over directories that prevented parallel build, and all kind of ugly mess like that. This has changed and at the end of 2009 as the build system was rewritten, with that LTP supported proper parallel build, started to use autoconf for feature checks, etc. We also switched from CVS to GIT at the end of the 2009, which was huge improvement as well.

However that was only a start, LTP was easier to build, git was nicer to use, but we still had tests that were mostly failing and fair amount of the tests were producing nothing but noise. There were also tests that didn't produce real results and always passed but it's really hard to count these unless you review the code carefully one testcase at a time, which is part of what we still do even after ten years of work.

From that point on it took us a few years to clear the worst parts and to deal with most of the troublemakers and the results from LTP were gradually getting greener and more stable as well. We are far from being bugless, there are still parts covered in dust that are waiting for attention, but we are getting there. For instance in this release we finally got a nice cgroup test library that simplifies cgroup testcases and we should fix rest of the cgroup tests ideally before the next one. Also I'm quite happy that the manpower put into LTP development slowly increases, however compared to the efforts put into the kernel development the situation is still dire. I used to tell people that the amount of work put into Linux automated testing is a bad joke back then. These days it's much better but still hardly optimal as we struggle to keep up with covering the newly introduced kernel features.

At the start I've mentioned new test library so I should explain how we came to this and why it's superior to what we had previously. First of all there was a test library in LTP that could be traced back to SGI and was released under GPL more than 20 years ago, it's probably even older than that though. The main problems with the library was that it was cumbersome to use. There were some API reporting functions, but these were not thread safe nor could be used in child processes. You had to propagate test results manually in these two cases which was prone to errors. Even worse since the test implemented the main() function you had to return the overall result manually as well and forgetting to do so was one of the common mistakes. At a point where most of the broken tests were finally fixed I had a bit of time to invest into a future and after seven years of dealing with a common test mistakes and I had a pretty good picture of what a test library should look like and what should be avoided. Hence I've sat down and designed library that is nice and fun to use and makes tests much easier to write. This library still evolves over the time, the version introduced in 2016 wasn't as nice as it is now, but even when it was introduced it included the most important bits, for instance thread safe and automatic test result propagation or synchronization primitives that could be used even to synchronize shell code against C binary.

The old library is still present in LTP since we are a bit more than halfway done converting the tests, which is no easy task since we have still more than 600 tests to go. And as we are converting the test we are also reviewing them to ensure that the assertions are correct and the coverage isn't lacking. We still find tests that fail to report results from time to time even now, which only show how hard is to eliminate mistakes like this and why preventing them in the first place is right thing to do. And if things will go well the rest of tests should be converted in about 5 years and LTP should be finally free of the historical baggage. At that point I guess that I will throw a small celebration since that would conclude a huge task I've been working on for a decade now.

FOSDEM Testing and Automation CfP

We are organizing Testing and Automation devroom on FOSDEM this year again details at: https://fosdem-testingautomation.github.io/

My colleague wrote a bit more in depth article and also recorded a talk on the fuzzy sync library. I you have read my introduction on the topic and enjoyed it you may like to have a look.

What's a race anyways?

Most of the readers, for sure, do know what a race condition is. Let me however include short description just for the sake of completeness. Race condition, in terms of a computer programming, is a bug where two pieces of code cause an error if executed concurrently.

As a kernel QA I'm mostly interested in writing testcases that can reproduce once fixed races in kernel code in order to avoid regressions and also to make sure all code streams, such as stable kernels, are bug free.

The main problem with these tests is that they are notoriously unreliable. Especially when the race window is very small, just a few instructions in length, it's nearly impossible to write a test that can reasonably reliably trigger the problem.

Naive approach to races

Race reproducer tests are usually implemented so that two threads runs two different loops each with a different piece of code in a hope that the race would trigger. In a case of a kernel this usually means two threads each calling a different syscall in a loop. The problem with this approach is that we are depending on system jitter in order to hit the race window. And as you may know computers are mostly deterministic, hence we are strongly depending on luck in this case.

Fuzzy sync to the rescue

We started to ponder if we can do better while we were converting a CVE proof of concepts into a LTP testcases. We ended up, after three redesigns and rewrites, with a nice library that can greatly improve the likelihood of triggering a race successfully. As a side effect we need much less iterations and our testcases finish much faster on race-free systems.

The overall idea is quite simple. First we sample how long it takes for the two pieces of code that cause the race to run. Once that is settled all we need to do is to synchronize the code sections accordingly. As we do not know which exact parts of the two sections needs to be aligned to trigger the race, we synchronize the code sections so that the alignment is random in each iteration. But we also make sure that the two sections overlap, otherwise there is no chance to trigger the race.

The implementation is a bit more complicated though. In the sampling phase we are using moving average and we wait for the deviation to settle down so that we have reasonable approximation of the race sections duration. Synchronization of the sections depends on atomic increments and spinlocks and the delay, used for alignment randomization, is introduced by a calibrated busy loops.

Problem solved?

No, not completely. The real world problem is a bit more complex than that. It's true that fuzzy sync library applies well to many race reproducers but some cases does not work at all. One of the problems we found is that the syscall duration may vary considerably depending on the alignment of the second piece of the racing code.

Consider for example recvmsg() racing with close(). If we align these two syscalls in a way that the file descriptor would be closed at the beginning of recvmsg() the syscall will quickly return EBADF which has zero chance of hitting the race. This was the case for CVE-2016-7117 so had to introduce a function to bias the offset of the code sections in order to finish sampling phase successfully in this case.

Real world example

Piece of code that attempts to reproduce d90a10e2444b “fsnotify: Fix fsnotify_mark_connector race” follows.

static struct tst_fzsync_pair fzsync_pair;
static int fd;

static void *write_seek(void *unused)
{
        char buf[64];

        while (tst_fzsync_run_b(&fzsync_pair)) {
                tst_fzsync_start_race_b(&fzsync_pair);
                SAFE_WRITE(0, fd, buf, sizeof(buf));
                SAFE_LSEEK(fd, 0, SEEK_SET);
                tst_fzsync_end_race_b(&fzsync_pair);
        }
        return unused;
}

static void setup(void)
{
        fd = SAFE_OPEN(FNAME, O_CREAT | O_RDWR, 0600);
        tst_fzsync_pair_init(&fzsync_pair);
}

static void cleanup(void)
{
        if (fd > 0)
                SAFE_CLOSE(fd);

        tst_fzsync_pair_cleanup(&fzsync_pair);
}

static void verify_inotify(void)
{
        int inotify_fd;
        int wd;

        inotify_fd = SAFE_MYINOTIFY_INIT1(0);

        tst_fzsync_pair_reset(&fzsync_pair, write_seek);
        while (tst_fzsync_run_a(&fzsync_pair)) {
                wd = SAFE_MYINOTIFY_ADD_WATCH(inotify_fd, FNAME, IN_MODIFY);

                tst_fzsync_start_race_a(&fzsync_pair);
                wd = myinotify_rm_watch(inotify_fd, wd);
                tst_fzsync_end_race_a(&fzsync_pair);
                if (wd < 0)
                        tst_brk(TBROK | TERRNO, "inotify_rm_watch() failed.");
        }
        SAFE_CLOSE(inotify_fd);
        /* We survived for given time - test succeeded */
        tst_res(TPASS, "kernel survived inotify beating");
}

static struct tst_test test = {
        .needs_tmpdir = 1,
        .setup = setup,
        .cleanup = cleanup,
        .test_all = verify_inotify,
};

As you can see the initialization and exit are handled in the setup() and cleanup() functions.

The syscalls that are racing here are enclosed between tst_fzsync_start_race_*() and tst_fzsync_end_race_*().

The tst_fzync_pair_reset() functions clears the counters and also starts a thread that is racing against the main thread.

And that's about it, all the complexity is neatly packed in the fzsync library.

When we were designing the new LTP test library we ended up with something that could be called a test driver model. There is plenty of of reasons why tests should have well defined structure and why declarative style for test requirements and dependencies is better than conditions buried deep in a chain of function calls. Not only that simplifies the test code and helps the author to focus on the actual test code, this arrangement also helps to export the test metadata into a machine parsable format, which I've been writing about in my previous posts.

So how does LTP driver mode looks like? First of all LTP tests consists of three functions. There is a setup(), which is called once at the start of the test, a test() function which may be called repeatedly and a cleanup() that may be called asynchronously. Asynchronously means that the test library will call the cleanup() before the test exits, which may be triggered prematurely by a fatal failure in the middle of the test run.

Apart from these functions there are plenty of bit flags, strings, and arrays that describe what the test requires and what should be prepared before these functions are called. There are plenty of basic bit flags such as needs_root, needs_tmpdir, and so on. Then there are a bit more advanced flags that can prepare a block device prior to the test and clean it up automatically or run the test() function for all filesystems supported by a kernel.

Basic LTP test

#include "tst_test.h"

static void test(void)
{
        tst_res(TPASS, "Test passed");
}

static struct tst_test test = {
        .test_all = test,
};

This is the most basic LTP test that succeeds. There are a few default test parameters such as -i which allows the test to be executed repeatedly, so the test function has to be able to cope with that. The actual test also runs in a forked process while the test library process watches for timeouts and cleans up once the test is done. However all of that is invisible to the test itself.

Tests that runs on all supported filesystems

#include "tst_test.h"

#define MNTPOINT "mntpoint"

static void test(void)
{
        /* Actual test is done here and the device is mounted at MNTPOINT */
}

static void setup(void)
{
        tst_res(TINFO, "Running test on %s device %s filesystem",  
                tst_device.dev, tst_device.fs_type);
}

static struct tst_test test = {
        .setup = setup,
        .test_all = test,
        .mount_device = 1,
        .all_filesystems = 1,
        .mntpoint = MNTPOINT,
};

This is a bit more complex example of the test library. We ask for the test to be executed for each supported filesystem. Supported means that kernel is able mount it and mkfs is installed so that the device could be formatted. The test library also supports FUSE which needs a special handling since we need a kernel support for a fuse and a user-space binary that implements the filesystem.

Once the test is started the test library compiles a list of supported filesystems and creates unique test temporary directory because .needs_tmpdir is implied by .mount_device flag. For each filesystem the device is formatted and mounted at .mntpoint which is relative to the test temporary directory. After that a new process witch will run the test is forked. The main library process then starts a timeout timer and waits. The child, if set, runs the setup() function, which is usually used to populate the newly formatted device with files and so on. Then finally the test() function is executed, possibly repeatedly. Once the test is done the child, if set, runs the cleanup(). But since the device is mounted in the test library and cleaned up automatically as well as the temporary directory there is nothing to do. Also if the child that runs the test crashes the device is unmounted regardless.

You may ask where did we get the device for the test in the first place, most of the time this is a loopback device created by the test library as well, which is created at the start of the test automatically and released once it finishes as well.

This sums up a very shallow look at the test library, there is much more implemented in there, for example we do have a flag for restoring the system wall clock after a test, support for parsing kernel .config and many more, but I guess that I should stop now because the blog post is already longer than I wanted it to be.

First of all what's result propagation and what is wrong with it. Result propagation happens when test does a function call and the test result depends on the return value. Or if a test executes a sub-process and the result depends on the return value. Sometimes the propagation is quite simple but more often the chain is complicated and the code is prone to errors. I've seen quite a few testcases that were failing but the test results were being ignored because the failure was lost in propagation.

Naturally I wanted to avoid this problems when designing the LTP test library. The main requirements for the solution were:

  • Keep it as simple as possible
  • No need to propagate results even from processes started by exec()
  • Thread safe

In the end the solution was quite simple, the functions that report test results in LTP tests use atomic increments on counters stored in a piece of shared memory.

When LTP tests starts, the library allocates a page of shared memory, the memory is backed by a unique file on tmpfs and the path is stored in an environment variable. Which also means that you can use this interface from basically any programming language including a shell, since all you need is the environment variable and small C helper that increments the counters.

The shared page of memory could also be used for synchronization, once we have the page in place in all tests there is plenty of room to be used by futexes, which is what the checkpoint synchronization primitives in LPT are based on. And again, since the path to the shared memory is available even to processes started by exec(), we can synchronize shell parts of the tests against C code which I think is pretty cool feature.

What is wrong with sleep() then?

First of all this is something I had to fight off a lot and still have to from time to time. In most of the cases sleep() has been misused to avoid a need for a proper synchronization, which is wrong for at least two reasons.

The first is that it may and will introduce very rare test failures, that means somebody has to spend time looking into these, which is a wasted effort. Also I'm pretty sure that nobody likes tests that will fail rarely for no good reason. Even more so you cannot run such tests with a background load to ensure that everything works correctly on a bussy system, because that will increase the likehood of a failure.

The second is that this wastes resources and slowns down a test run. If you think that adding a sleep to a test is not a big deal, let me put things into a perspective. There is about 1600 syscall tests in Linux Test Project (LTP), if 7.5% of them would sleep just for one second, we would end up with two minutes of wasted time per testrun. In practice most of the test I've seen waited for much longer just to be sure that things will works even on slower hardware. With sleeps between 2 and 5 seconds that puts us somewhere between 4 and 10 minutes which is between 13% and 33% of the syscall runtime on my dated thinkpad, where the run finishes in a bit less than half an hour. It's even worse on newer hardware, because this slowdown will not change no matter how fast your machine is, which is maybe the reason why this was acceptable twenty years ago but it's not now.

When sleep() is acceptable then?

So far in my ten years of test development I met only a few cases where sleep() in a test code was appropriate. From the top of my head I remeber:

  • Filesystem tests for file timestamps, atime, mtime, etc.
  • Timer related tests where we sample timer in a loop
  • alarm() and timer_create() test where we wait for the timer to fire
  • Leap second tests

How to fix the problem?

Unfortunately there is no silver bullet since there are plenty of reasons for a race condition to happen and each class has to be dealt with differently.

Fortunately there are quite a few very common classes that could be dealt with quite easily. So in LTP we wrote a few synchronization primitives and helper functions that could be used by a test, so there is no longer any excuse to use sleep() instead.

The most common case was a need to synchronize between parent and child processes. There are actually two different cases that needed to be solved. First is a case where child has to execute certain piece of code before parent can continue. For that LTP library implements checkpoints with simple wait and wake functions based on futexes on a piece of shared memory set up by the test library. The second case is where child has to sleep in a syscall before parent can continue, for which we have a helper that polls /proc/$PID/stat. Also sometimes tests can be fixed just be adding a waitpid() in the parent which ensures that child is finished before parent runs.

There are other and even more complex cases where particular action is done asynchronously, or a kernel resource deallocation is deffered to a later time. In such cases quite often the best we can do is to poll. In LTP we ended up with a macro that polls by calling a piece of code in a loop with exponentially increasing sleeps between retries. Which means that instead of sleeping for a maximal time event can possibly take the sleep is capped by twice of the optimal sleeping time while we avoid polling too aggressively.

We are trying to resurrect the automated testing track on FOSDEM along with Anders and Matt from Linaro. Now that will not happen and even does not make sense to happen without you. I can talk and drink beer in the evening with Matt and Anders just fine, we do not need a devroom for that.

We also have a few people prepared to give a talks on various topics but not nearly enough to fill in and justify a room for a day. So if you are interested in attending or even better giving a talk or driving a discussion please let us know.

The problem

More than ten years ago even consumer grade hardware started to have two and more CPU cores. These days you can easily buy PC with eight cores even in the local computer shops. The hardware has envolved quite fast during the last decade, but the software that does kernel testing didn't. These days we still run LTP testcases sequentially even on beefy servers with 64+ cores and terabytes of RAM. That means that syscalls testrun takes 30+ minutes while it could be reduced to less than 10 minutes, which will significantly shorten the feedback loop for any continous integration.

The naive solution

The naive solution is obviously to run $NCPU+1 tests in parallel. But there is a catch, some of the tests utilize global system resources or state and running two such tests in parallel would lead to false negatives. If you think that this situation is rare you are mistaken, there are plenty of tests that needs a block device, change system wall clock, sample kernel timers in a loop, play with SysV IPC, networking, etc. In short we will end up with many, hard to reproduce, race conditions and mostly useless results.

The proper solution

The proper solution would require:

  1. Make tests declare which resources they utilize
  2. Export this data to the testrunner
  3. Make the testrunner use that information when running tests

Happily the first part is mostly done for LTP, we have switched to new test library I've written a few years ago. The new library defines a “driver model” for a testcase. At this time about half of the tests are converted to the new library and the progress is steady.

In the new library the resources a test needs are listed in a declarative way as a C structure, so for most cases we already have this information in place. If you want to know more you can have a look at our documentation.

At this point I'm trying to solve the second part, which is making the data available to the testrunner, which mostly works at this point. Once this part is finished the missing piece would be writing a scheduller that takes this information into an account.

Unfortunately the way LTP runs test is also dated, there is an old runltp script which I would call an ugly hack at best. Adding feature such as parallel test run to that code is close to impossible. Which is the reason I've started to work on a new LTP testrunner, which I guess could be a subject for a second blog post.