How to trigger races reliably

May 5, 2020

What's a race anyways?

Most of the readers, for sure, do know what a race condition is. Let me however include short description just for the sake of completeness. Race condition, in terms of a computer programming, is a bug where two pieces of code cause an error if executed concurrently.

As a kernel QA I'm mostly interested in writing testcases that can reproduce once fixed races in kernel code in order to avoid regressions and also to make sure all code streams, such as stable kernels, are bug free.

The main problem with these tests is that they are notoriously unreliable. Especially when the race window is very small, just a few instructions in length, it's nearly impossible to write a test that can reasonably reliably trigger the problem.

Naive approach to races

Race reproducer tests are usually implemented so that two threads runs two different loops each with a different piece of code in a hope that the race would trigger. In a case of a kernel this usually means two threads each calling a different syscall in a loop. The problem with this approach is that we are depending on system jitter in order to hit the race window. And as you may know computers are mostly deterministic, hence we are strongly depending on luck in this case.

Fuzzy sync to the rescue

We started to ponder if we can do better while we were converting a CVE proof of concepts into a LTP testcases. We ended up, after three redesigns and rewrites, with a nice library that can greatly improve the likelihood of triggering a race successfully. As a side effect we need much less iterations and our testcases finish much faster on race-free systems.

The overall idea is quite simple. First we sample how long it takes for the two pieces of code that cause the race to run. Once that is settled all we need to do is to synchronize the code sections accordingly. As we do not know which exact parts of the two sections needs to be aligned to trigger the race, we synchronize the code sections so that the alignment is random in each iteration. But we also make sure that the two sections overlap, otherwise there is no chance to trigger the race.

The implementation is a bit more complicated though. In the sampling phase we are using moving average and we wait for the deviation to settle down so that we have reasonable approximation of the race sections duration. Synchronization of the sections depends on atomic increments and spinlocks and the delay, used for alignment randomization, is introduced by a calibrated busy loops.

Problem solved?

No, not completely. The real world problem is a bit more complex than that. It's true that fuzzy sync library applies well to many race reproducers but some cases does not work at all. One of the problems we found is that the syscall duration may vary considerably depending on the alignment of the second piece of the racing code.

Consider for example recvmsg() racing with close(). If we align these two syscalls in a way that the file descriptor would be closed at the beginning of recvmsg() the syscall will quickly return EBADF which has zero chance of hitting the race. This was the case for CVE-2016-7117 so had to introduce a function to bias the offset of the code sections in order to finish sampling phase successfully in this case.

Real world example

Piece of code that attempts to reproduce d90a10e2444b “fsnotify: Fix fsnotify_mark_connector race” follows.

static struct tst_fzsync_pair fzsync_pair;
static int fd;

static void *write_seek(void *unused)
{
        char buf[64];

        while (tst_fzsync_run_b(&fzsync_pair)) {
                tst_fzsync_start_race_b(&fzsync_pair);
                SAFE_WRITE(0, fd, buf, sizeof(buf));
                SAFE_LSEEK(fd, 0, SEEK_SET);
                tst_fzsync_end_race_b(&fzsync_pair);
        }
        return unused;
}

static void setup(void)
{
        fd = SAFE_OPEN(FNAME, O_CREAT | O_RDWR, 0600);
        tst_fzsync_pair_init(&fzsync_pair);
}

static void cleanup(void)
{
        if (fd > 0)
                SAFE_CLOSE(fd);

        tst_fzsync_pair_cleanup(&fzsync_pair);
}

static void verify_inotify(void)
{
        int inotify_fd;
        int wd;

        inotify_fd = SAFE_MYINOTIFY_INIT1(0);

        tst_fzsync_pair_reset(&fzsync_pair, write_seek);
        while (tst_fzsync_run_a(&fzsync_pair)) {
                wd = SAFE_MYINOTIFY_ADD_WATCH(inotify_fd, FNAME, IN_MODIFY);

                tst_fzsync_start_race_a(&fzsync_pair);
                wd = myinotify_rm_watch(inotify_fd, wd);
                tst_fzsync_end_race_a(&fzsync_pair);
                if (wd < 0)
                        tst_brk(TBROK | TERRNO, "inotify_rm_watch() failed.");
        }
        SAFE_CLOSE(inotify_fd);
        /* We survived for given time - test succeeded */
        tst_res(TPASS, "kernel survived inotify beating");
}

static struct tst_test test = {
        .needs_tmpdir = 1,
        .setup = setup,
        .cleanup = cleanup,
        .test_all = verify_inotify,
};

As you can see the initialization and exit are handled in the setup() and cleanup() functions.

The syscalls that are racing here are enclosed between tst_fzsync_start_race_*() and tst_fzsync_end_race_*().

The tst_fzync_pair_reset() functions clears the counters and also starts a thread that is racing against the main thread.

And that's about it, all the complexity is neatly packed in the fzsync library.