metan's blog

Tales from kernel testing

We are trying to resurrect the automated testing track on FOSDEM along with Anders and Matt from Linaro. Now that will not happen and even does not make sense to happen without you. I can talk and drink beer in the evening with Matt and Anders just fine, we do not need a devroom for that.

We also have a few people prepared to give a talks on various topics but not nearly enough to fill in and justify a room for a day. So if you are interested in attending or even better giving a talk or driving a discussion please let us know.

The problem

More than ten years ago even consumer grade hardware started to have two and more CPU cores. These days you can easily buy PC with eight cores even in the local computer shops. The hardware has envolved quite fast during the last decade, but the software that does kernel testing didn't. These days we still run LTP testcases sequentially even on beefy servers with 64+ cores and terabytes of RAM. That means that syscalls testrun takes 30+ minutes while it could be reduced to less than 10 minutes, which will significantly shorten the feedback loop for any continous integration.

The naive solution

The naive solution is obviously to run $NCPU+1 tests in parallel. But there is a catch, some of the tests utilize global system resources or state and running two such tests in parallel would lead to false negatives. If you think that this situation is rare you are mistaken, there are plenty of tests that needs a block device, change system wall clock, sample kernel timers in a loop, play with SysV IPC, networking, etc. In short we will end up with many, hard to reproduce, race conditions and mostly useless results.

The proper solution

The proper solution would require:

  1. Make tests declare which resources they utilize
  2. Export this data to the testrunner
  3. Make the testrunner use that information when running tests

Happily the first part is mostly done for LTP, we have switched to new test library I've written a few years ago. The new library defines a “driver model” for a testcase. At this time about half of the tests are converted to the new library and the progress is steady.

In the new library the resources a test needs are listed in a declarative way as a C structure, so for most cases we already have this information in place. If you want to know more you can have a look at our documentation.

At this point I'm trying to solve the second part, which is making the data available to the testrunner, which mostly works at this point. Once this part is finished the missing piece would be writing a scheduller that takes this information into an account.

Unfortunately the way LTP runs test is also dated, there is an old runltp script which I would call an ugly hack at best. Adding feature such as parallel test run to that code is close to impossible. Which is the reason I've started to work on a new LTP testrunner, which I guess could be a subject for a second blog post.