metan's blog

Tales from kernel testing

The new LTP release will include changes that have introduced concept of a test maximal runtime so let me briefly explain what exactly that is. To begin with let's make an observation about a LTP test duration. Most of the LTP tests do fall into two categories when duration of the test is considered. First type of tests is fast, generally under a second or two and most of the time even fraction of that. These tests mostly prepare simple environment, call a syscall or two, clean up and are done. The second type of tests runs for longer and their duration is usually counted in minutes. These tests include I/O stress test, various regression tests that are looping in order to hit a race, timer precision tests that have to sample time intervals and so on.

Historically in LTP the test duration was limited by a single value called timeout, that defaulted to a compromise of 5 minutes, which is the worst value for both classes of the tests. That is because it's clearly too long for short running tests and at the same time too short for significant fraction of the long running tests. This was clear just by checking the tests that actually adjusted the default timeout. Quite a few short running tests that were prone to deadlocks decreased the default timeout to a much shorter interval and at the same time quite a few long running tests did increase it as well.

But back at how the test duration was handled in the long running tests. The test duration for long running tests is usually bounded by a time limit as well as a limit on a number of iterations and the test exits on whichever is hit first. In order to exit the test before the timeout these tests watched the elapsed runtime and did exit the main loop if the runtime got close enough to the test timeout. The problem was that close enough was loosely defined and implemented in each test differently. That obviously leads to a different problems. For instance if test looped until there was 10 seconds left to the timeout and the test cleanup did take more than 10 seconds on a slower hardware, there was no way how to avoid triggering the timeout which resulted in test failure. If test timeout was increased the test simply run for longer duration and hit the timeout at the end either way. At the same time if the test did use proportion of the timeout left out for the test cleanup things didn't work out when the timeout was scaled down in order to shorten the test duration.

After careful analysis it became clear that the test duration has to be bound by a two distinct values. The new values are now called timeout and max_runtime and the test duration is bound by a sum of these two. The idea behind this should be clear to the reader at this point. The max_runtime limits the test active part, that is the part where the actual test loop is executed and the timeout covers the test setup and cleanup and all inaccuracies in the accounting. Each of them can be scaled separately which gives us enough flexibility to be able to scale from small embedded boards all the way up to the supercomputers. This change also allowed us to change the default test timeout to 30 seconds. And if you are asking yourself a question how max_runtime is set for short running tests the answer is simple it's set to zero since the default timeout is more than enough to cope with these.

All of this also helps to kill the misbehaving tests much faster since we have much better estimation for the expected test duration. And yes this is a big deal when you are running thousands of testcases, it may speed up the testrun quite significantly even with a few deadlocked tests.

But things does not end here, there is a bit of added complexity on the top of this. Some of the testcases will call the main test loop more than once. That is because we have a few “multipliers” flags that can increase test coverage quite a bit. For instance we have so called .all_filesystems flag, that when set, will execute the test on the top of the most commonly used filesystems. There is also flag that can run the test for a different variants, which is sometimes used to run the test for a more than one syscall variant, e.g. for clock_gettime() we run the same test for both syscall and VDSO. All these multipliers have to be taken into an account when overall test duration is computed. However we do have all these flags in the metadata file now hence we are getting really close to a state where we will have a tool that can compute an accurate upper bound for duration for a given test. However that is completely different story for a different short article.

Unfortunately FOSDEM is going to be virtual again this year, but that does not stop us from organizing the testing and automation devroom. Have a look at our CfP and if you have something interesting to present go ahead and fill in a submission!

As usual we had a LTP release at the end of the September. What was unusual though is the number of patches that went it, we got 483 patches, which is about +150 than the last three releases. And the number of patches did slowly grow even before that.

While it's great and I'm happy that the project is growing, there is a catch, grow like this puts additional strain on the maintainers, particularly on the patch reviewers. For me it was +120 patches reviewed during the four months period and that only counts the final versions of patches that were accepted to the repository, it's not unusual to have three or more revisions before the work is ready to be merged.

While I managed to cope with it reasonably fine the work that I had on TODO for the project was stalled. One of the things I finally want to move forward is making the runltp-ng official LTP test runner, but there is much more. So the obvious question is how to make things better and one of the things we came up was automation.

What we implemented for LTP is 'make check' that runs different tools on the test source code that is supposed to be used before patch is sent for a review. For C code we use the well known checkpatch.pl and custom sparse based checker to identify most common problems. The tooling is set up automatically when you call 'make check' for a first time and we tried to make it as effortless as possible, so that there is no reason not to use during the development. We also use checkbashism.pl for shell code and hopefully the number of checks will grow over the time. Hopefully this should eliminate on average at least one revision for a patchset which would be hundreds of patches during our development cycle.

Ideally this will fix the problem for a while and we will make more effective use of our resources, but eventually we will get to a point where more maintainers and reviewers are needed, which is problem that is hard to solve without your help.

We have reached an important milestone with latest LTP release – the amount of testcases written in the new test library finally outnumbers the amount of old library tests. Which is nice opportunity for a small celebration and also to look back a bit into a history and try to summarize what has happened over the last 10 years in LTP.

I've joined LTP development a bit more than 10 years ago in 2009. At that point we were really struggling with the basics. The build system was collection of random Makefiles and the build often failed for very random reasons. The were pieces of shell code embedded in Makefiles for instance to check for devel libraries, manually written shell loops over directories that prevented parallel build, and all kind of ugly mess like that. This has changed and at the end of 2009 as the build system was rewritten, with that LTP supported proper parallel build, started to use autoconf for feature checks, etc. We also switched from CVS to GIT at the end of the 2009, which was huge improvement as well.

However that was only a start, LTP was easier to build, git was nicer to use, but we still had tests that were mostly failing and fair amount of the tests were producing nothing but noise. There were also tests that didn't produce real results and always passed but it's really hard to count these unless you review the code carefully one testcase at a time, which is part of what we still do even after ten years of work.

From that point on it took us a few years to clear the worst parts and to deal with most of the troublemakers and the results from LTP were gradually getting greener and more stable as well. We are far from being bugless, there are still parts covered in dust that are waiting for attention, but we are getting there. For instance in this release we finally got a nice cgroup test library that simplifies cgroup testcases and we should fix rest of the cgroup tests ideally before the next one. Also I'm quite happy that the manpower put into LTP development slowly increases, however compared to the efforts put into the kernel development the situation is still dire. I used to tell people that the amount of work put into Linux automated testing is a bad joke back then. These days it's much better but still hardly optimal as we struggle to keep up with covering the newly introduced kernel features.

At the start I've mentioned new test library so I should explain how we came to this and why it's superior to what we had previously. First of all there was a test library in LTP that could be traced back to SGI and was released under GPL more than 20 years ago, it's probably even older than that though. The main problems with the library was that it was cumbersome to use. There were some API reporting functions, but these were not thread safe nor could be used in child processes. You had to propagate test results manually in these two cases which was prone to errors. Even worse since the test implemented the main() function you had to return the overall result manually as well and forgetting to do so was one of the common mistakes. At a point where most of the broken tests were finally fixed I had a bit of time to invest into a future and after seven years of dealing with a common test mistakes and I had a pretty good picture of what a test library should look like and what should be avoided. Hence I've sat down and designed library that is nice and fun to use and makes tests much easier to write. This library still evolves over the time, the version introduced in 2016 wasn't as nice as it is now, but even when it was introduced it included the most important bits, for instance thread safe and automatic test result propagation or synchronization primitives that could be used even to synchronize shell code against C binary.

The old library is still present in LTP since we are a bit more than halfway done converting the tests, which is no easy task since we have still more than 600 tests to go. And as we are converting the test we are also reviewing them to ensure that the assertions are correct and the coverage isn't lacking. We still find tests that fail to report results from time to time even now, which only show how hard is to eliminate mistakes like this and why preventing them in the first place is right thing to do. And if things will go well the rest of tests should be converted in about 5 years and LTP should be finally free of the historical baggage. At that point I guess that I will throw a small celebration since that would conclude a huge task I've been working on for a decade now.

FOSDEM Testing and Automation CfP

We are organizing Testing and Automation devroom on FOSDEM this year again details at: https://fosdem-testingautomation.github.io/

My colleague wrote a bit more in depth article and also recorded a talk on the fuzzy sync library. I you have read my introduction on the topic and enjoyed it you may like to have a look.

What's a race anyways?

Most of the readers, for sure, do know what a race condition is. Let me however include short description just for the sake of completeness. Race condition, in terms of a computer programming, is a bug where two pieces of code cause an error if executed concurrently.

As a kernel QA I'm mostly interested in writing testcases that can reproduce once fixed races in kernel code in order to avoid regressions and also to make sure all code streams, such as stable kernels, are bug free.

The main problem with these tests is that they are notoriously unreliable. Especially when the race window is very small, just a few instructions in length, it's nearly impossible to write a test that can reasonably reliably trigger the problem.

Naive approach to races

Race reproducer tests are usually implemented so that two threads runs two different loops each with a different piece of code in a hope that the race would trigger. In a case of a kernel this usually means two threads each calling a different syscall in a loop. The problem with this approach is that we are depending on system jitter in order to hit the race window. And as you may know computers are mostly deterministic, hence we are strongly depending on luck in this case.

Fuzzy sync to the rescue

We started to ponder if we can do better while we were converting a CVE proof of concepts into a LTP testcases. We ended up, after three redesigns and rewrites, with a nice library that can greatly improve the likelihood of triggering a race successfully. As a side effect we need much less iterations and our testcases finish much faster on race-free systems.

The overall idea is quite simple. First we sample how long it takes for the two pieces of code that cause the race to run. Once that is settled all we need to do is to synchronize the code sections accordingly. As we do not know which exact parts of the two sections needs to be aligned to trigger the race, we synchronize the code sections so that the alignment is random in each iteration. But we also make sure that the two sections overlap, otherwise there is no chance to trigger the race.

The implementation is a bit more complicated though. In the sampling phase we are using moving average and we wait for the deviation to settle down so that we have reasonable approximation of the race sections duration. Synchronization of the sections depends on atomic increments and spinlocks and the delay, used for alignment randomization, is introduced by a calibrated busy loops.

Problem solved?

No, not completely. The real world problem is a bit more complex than that. It's true that fuzzy sync library applies well to many race reproducers but some cases does not work at all. One of the problems we found is that the syscall duration may vary considerably depending on the alignment of the second piece of the racing code.

Consider for example recvmsg() racing with close(). If we align these two syscalls in a way that the file descriptor would be closed at the beginning of recvmsg() the syscall will quickly return EBADF which has zero chance of hitting the race. This was the case for CVE-2016-7117 so had to introduce a function to bias the offset of the code sections in order to finish sampling phase successfully in this case.

Real world example

Piece of code that attempts to reproduce d90a10e2444b “fsnotify: Fix fsnotify_mark_connector race” follows.

static struct tst_fzsync_pair fzsync_pair;
static int fd;

static void *write_seek(void *unused)
{
        char buf[64];

        while (tst_fzsync_run_b(&fzsync_pair)) {
                tst_fzsync_start_race_b(&fzsync_pair);
                SAFE_WRITE(0, fd, buf, sizeof(buf));
                SAFE_LSEEK(fd, 0, SEEK_SET);
                tst_fzsync_end_race_b(&fzsync_pair);
        }
        return unused;
}

static void setup(void)
{
        fd = SAFE_OPEN(FNAME, O_CREAT | O_RDWR, 0600);
        tst_fzsync_pair_init(&fzsync_pair);
}

static void cleanup(void)
{
        if (fd > 0)
                SAFE_CLOSE(fd);

        tst_fzsync_pair_cleanup(&fzsync_pair);
}

static void verify_inotify(void)
{
        int inotify_fd;
        int wd;

        inotify_fd = SAFE_MYINOTIFY_INIT1(0);

        tst_fzsync_pair_reset(&fzsync_pair, write_seek);
        while (tst_fzsync_run_a(&fzsync_pair)) {
                wd = SAFE_MYINOTIFY_ADD_WATCH(inotify_fd, FNAME, IN_MODIFY);

                tst_fzsync_start_race_a(&fzsync_pair);
                wd = myinotify_rm_watch(inotify_fd, wd);
                tst_fzsync_end_race_a(&fzsync_pair);
                if (wd < 0)
                        tst_brk(TBROK | TERRNO, "inotify_rm_watch() failed.");
        }
        SAFE_CLOSE(inotify_fd);
        /* We survived for given time - test succeeded */
        tst_res(TPASS, "kernel survived inotify beating");
}

static struct tst_test test = {
        .needs_tmpdir = 1,
        .setup = setup,
        .cleanup = cleanup,
        .test_all = verify_inotify,
};

As you can see the initialization and exit are handled in the setup() and cleanup() functions.

The syscalls that are racing here are enclosed between tst_fzsync_start_race_*() and tst_fzsync_end_race_*().

The tst_fzync_pair_reset() functions clears the counters and also starts a thread that is racing against the main thread.

And that's about it, all the complexity is neatly packed in the fzsync library.

When we were designing the new LTP test library we ended up with something that could be called a test driver model. There is plenty of of reasons why tests should have well defined structure and why declarative style for test requirements and dependencies is better than conditions buried deep in a chain of function calls. Not only that simplifies the test code and helps the author to focus on the actual test code, this arrangement also helps to export the test metadata into a machine parsable format, which I've been writing about in my previous posts.

So how does LTP driver mode looks like? First of all LTP tests consists of three functions. There is a setup(), which is called once at the start of the test, a test() function which may be called repeatedly and a cleanup() that may be called asynchronously. Asynchronously means that the test library will call the cleanup() before the test exits, which may be triggered prematurely by a fatal failure in the middle of the test run.

Apart from these functions there are plenty of bit flags, strings, and arrays that describe what the test requires and what should be prepared before these functions are called. There are plenty of basic bit flags such as needs_root, needs_tmpdir, and so on. Then there are a bit more advanced flags that can prepare a block device prior to the test and clean it up automatically or run the test() function for all filesystems supported by a kernel.

Basic LTP test

#include "tst_test.h"

static void test(void)
{
        tst_res(TPASS, "Test passed");
}

static struct tst_test test = {
        .test_all = test,
};

This is the most basic LTP test that succeeds. There are a few default test parameters such as -i which allows the test to be executed repeatedly, so the test function has to be able to cope with that. The actual test also runs in a forked process while the test library process watches for timeouts and cleans up once the test is done. However all of that is invisible to the test itself.

Tests that runs on all supported filesystems

#include "tst_test.h"

#define MNTPOINT "mntpoint"

static void test(void)
{
        /* Actual test is done here and the device is mounted at MNTPOINT */
}

static void setup(void)
{
        tst_res(TINFO, "Running test on %s device %s filesystem",  
                tst_device.dev, tst_device.fs_type);
}

static struct tst_test test = {
        .setup = setup,
        .test_all = test,
        .mount_device = 1,
        .all_filesystems = 1,
        .mntpoint = MNTPOINT,
};

This is a bit more complex example of the test library. We ask for the test to be executed for each supported filesystem. Supported means that kernel is able mount it and mkfs is installed so that the device could be formatted. The test library also supports FUSE which needs a special handling since we need a kernel support for a fuse and a user-space binary that implements the filesystem.

Once the test is started the test library compiles a list of supported filesystems and creates unique test temporary directory because .needs_tmpdir is implied by .mount_device flag. For each filesystem the device is formatted and mounted at .mntpoint which is relative to the test temporary directory. After that a new process witch will run the test is forked. The main library process then starts a timeout timer and waits. The child, if set, runs the setup() function, which is usually used to populate the newly formatted device with files and so on. Then finally the test() function is executed, possibly repeatedly. Once the test is done the child, if set, runs the cleanup(). But since the device is mounted in the test library and cleaned up automatically as well as the temporary directory there is nothing to do. Also if the child that runs the test crashes the device is unmounted regardless.

You may ask where did we get the device for the test in the first place, most of the time this is a loopback device created by the test library as well, which is created at the start of the test automatically and released once it finishes as well.

This sums up a very shallow look at the test library, there is much more implemented in there, for example we do have a flag for restoring the system wall clock after a test, support for parsing kernel .config and many more, but I guess that I should stop now because the blog post is already longer than I wanted it to be.

First of all what's result propagation and what is wrong with it. Result propagation happens when test does a function call and the test result depends on the return value. Or if a test executes a sub-process and the result depends on the return value. Sometimes the propagation is quite simple but more often the chain is complicated and the code is prone to errors. I've seen quite a few testcases that were failing but the test results were being ignored because the failure was lost in propagation.

Naturally I wanted to avoid this problems when designing the LTP test library. The main requirements for the solution were:

  • Keep it as simple as possible
  • No need to propagate results even from processes started by exec()
  • Thread safe

In the end the solution was quite simple, the functions that report test results in LTP tests use atomic increments on counters stored in a piece of shared memory.

When LTP tests starts, the library allocates a page of shared memory, the memory is backed by a unique file on tmpfs and the path is stored in an environment variable. Which also means that you can use this interface from basically any programming language including a shell, since all you need is the environment variable and small C helper that increments the counters.

The shared page of memory could also be used for synchronization, once we have the page in place in all tests there is plenty of room to be used by futexes, which is what the checkpoint synchronization primitives in LPT are based on. And again, since the path to the shared memory is available even to processes started by exec(), we can synchronize shell parts of the tests against C code which I think is pretty cool feature.

What is wrong with sleep() then?

First of all this is something I had to fight off a lot and still have to from time to time. In most of the cases sleep() has been misused to avoid a need for a proper synchronization, which is wrong for at least two reasons.

The first is that it may and will introduce very rare test failures, that means somebody has to spend time looking into these, which is a wasted effort. Also I'm pretty sure that nobody likes tests that will fail rarely for no good reason. Even more so you cannot run such tests with a background load to ensure that everything works correctly on a bussy system, because that will increase the likehood of a failure.

The second is that this wastes resources and slowns down a test run. If you think that adding a sleep to a test is not a big deal, let me put things into a perspective. There is about 1600 syscall tests in Linux Test Project (LTP), if 7.5% of them would sleep just for one second, we would end up with two minutes of wasted time per testrun. In practice most of the test I've seen waited for much longer just to be sure that things will works even on slower hardware. With sleeps between 2 and 5 seconds that puts us somewhere between 4 and 10 minutes which is between 13% and 33% of the syscall runtime on my dated thinkpad, where the run finishes in a bit less than half an hour. It's even worse on newer hardware, because this slowdown will not change no matter how fast your machine is, which is maybe the reason why this was acceptable twenty years ago but it's not now.

When sleep() is acceptable then?

So far in my ten years of test development I met only a few cases where sleep() in a test code was appropriate. From the top of my head I remeber:

  • Filesystem tests for file timestamps, atime, mtime, etc.
  • Timer related tests where we sample timer in a loop
  • alarm() and timer_create() test where we wait for the timer to fire
  • Leap second tests

How to fix the problem?

Unfortunately there is no silver bullet since there are plenty of reasons for a race condition to happen and each class has to be dealt with differently.

Fortunately there are quite a few very common classes that could be dealt with quite easily. So in LTP we wrote a few synchronization primitives and helper functions that could be used by a test, so there is no longer any excuse to use sleep() instead.

The most common case was a need to synchronize between parent and child processes. There are actually two different cases that needed to be solved. First is a case where child has to execute certain piece of code before parent can continue. For that LTP library implements checkpoints with simple wait and wake functions based on futexes on a piece of shared memory set up by the test library. The second case is where child has to sleep in a syscall before parent can continue, for which we have a helper that polls /proc/$PID/stat. Also sometimes tests can be fixed just be adding a waitpid() in the parent which ensures that child is finished before parent runs.

There are other and even more complex cases where particular action is done asynchronously, or a kernel resource deallocation is deffered to a later time. In such cases quite often the best we can do is to poll. In LTP we ended up with a macro that polls by calling a piece of code in a loop with exponentially increasing sleeps between retries. Which means that instead of sleeping for a maximal time event can possibly take the sleep is capped by twice of the optimal sleeping time while we avoid polling too aggressively.