Modeling Condition Variables using Formal Methods

Condition variables can seldom be used in isolation and depend on proper usage from the users of it. This article is a gentle introduction to condition variable usage using TLA+ / PlusCal to formally model it: A parent and child process waiting on each other. During this, I actually discovered a bug in my understanding, causing a deadlock which I will share in the end.

Note that you may need to refer to PlusCal and TLA+ documentation. There's a ton of it written by people who know more than I. My interest is just as a practical user and this article does not explain PlusCal much. To be honest it is quite readable any way.

First we start with a simple C program which we'll model later, a parent waiting on its child after spawning it.

 volatile int done = 0;
 void *child(void *arg) {
     done = 1;
     return NULL;
 }

 int main(int argc, char *argv[]) {
     printf("parent: begin\n");
     pthread_t c;
     pthread_create(&c, NULL, child, NULL); // create child
     while (done == 0); // spin
     printf("parent: end\n");
     return 0;
}

This trivial program works but wastes a lot of CPU due to the spin loop, especially, say if the child runs for a long time. But it is useful to write a formal model which we'll use as the basis for more advanced design.

Following is the PlusCal program to verify this:

(*--fair algorithm ThreadJoinSpin

variables
    \* Spin variable used by child to signal to parent that it finished running.
    done = 0,
    
    \* Simulate the parent forking the child.
    childwait = 1,

    \* The below variables are for checking invariants.
    ChildAboutToExit = 0,
    ParentDone = 0;

define
    \* The invariant that has to hold true always.
    ExitChildBeforeParent == (ParentDone = 0) \/ ((ParentDone = 1) /\ (ChildAboutToExit = 1))
end define;

procedure thr_exit() begin
texit: done := 1;
ret2: return;
end procedure;

procedure thr_join() begin
check: await done = 1;
ret3: return;
end procedure;

process childproc \in {1} begin
c0:   await childwait = 0;
    \* child does something for a long time.
c1: ChildAboutToExit := 1;
c2: call thr_exit();
end process;

process parent \in {2} begin
c3: childwait := 0;
c4: call thr_join();
c5: ParentDone := 1;
end process;

There are 2 things that I make the model verify, 1. That the ExitChildBeforeParent invariant is always satisified. This invariant confirms that under no circumstance will the parent process terminate before the child does. More precisely, the parent's join returns only after the child calls thr_exit(). 2. That the program terminates without going into a deadlock.

Both of these are true with the trivial program. This gives us a foundation to replace thr_exit() and thr_join() with condition variables.

Following is a first attempt, I am only adding the “new changes” to the above program model to keep the article short.

Here is the first attempt at C program using CVs, that we will try to model check:

void thr_exit() {
    done = 1;
    Pthread_cond_signal(&cv);
 }

 void thr_join() {
    if (done == 0)
        Pthread_cond_wait(&cv);
 }

And the PlusCal model is as follows. We add a new set waitSet to keep track of all the waiters on the CV. Along with 2 new functions cvwait and cvsignal to emulate pthread_cond_wait() and pthread_cond_signal() APIs:

variables
    waitSet = {};

procedure cvwait(p) begin
    c1: waitSet := waitSet \cup {p};
    c2: await p \notin waitSet;
    c4: return;
end procedure;

procedure cvsignal() begin
    \* This if cond is needed because otherwise
    \* the with statement waits forever if waitset is empty.
    c5:
    if waitSet = {} then
        c7: return;
    end if;
    
    \* Non deterministically pick something to wake up.
    c8: with x \in waitSet do
        waitSet := waitSet \ {x};
        end with;
    c9: return;  
end procedure;

procedure thr_exit() begin
    c11:  call cvsignal();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c20: if done = 0 then
    c21:    call cvwait(p);
         end if;
    c23: return;
end procedure;

It is notable how we model wait/wake by just adding and removing the process number from a set. The TLC model checker tells us this program deadlocks, why? Because thr_exit() can be run first and signal the CV. But there's nothing waiting for it. The parent then calls thr_wait() and waits forever. Replacing the if with while also does not help. The problem is we need the child to signal only once there's someone waiting.

Seems like we need some synchronization to make sure the signal/wait don't race with each other, let us add a lock.

void thr_exit() {
    pthread_mutex_lock(&m);
    pthread_cond_signal(&c);
    pthread_mutex_unlock(&m);
}

 void thr_join() {
    pthread_mutex_lock(&m);
    pthread_cond_wait(&c, &m);
    pthread_mutex_unlock(&m);
 }

To implement the lock in PlusCal, we can just do a simple test-and-set lock:

procedure lock() begin
  cas:  if mutex = 0 then
        mutex := 1; return;
        else
        goto cas;
        end if;
end procedure;    

procedure unlock() begin
  unlock_it:  mutex := 0;
  ret:        return;
end procedure;

Anything under a label in PlusCal is atomic and executed as one unit. This helps us to model a Compare-And-Swap operation pretty nicely.

We also need to modify our CV signal/wait functions to be callable under a lock. In particular, we cannot wait on the CV with the lock held as we would then deadlock.

procedure cvwait(p) begin
    c0: call unlock();
    c1: waitSet := waitSet \cup {p};
    c2: await p \notin waitSet;
    c3: call lock();
    c4: return;
end procedure;

procedure cvsignal() begin
    \* This if cond is needed because otherwise
    \* the with statement waits forever if waitset is empty.
    c5:
    if waitSet = {} then
        c7: return;
    end if;
    
    \* Non deterministically pick something to wake up.
    c8: with x \in waitSet do
        waitSet := waitSet \ {x};
        end with;
    c9: return;  
end procedure;

Now armed with these modeled primitives, let us use them in our join/exit functions:

procedure thr_exit() begin
    c10:  call lock();
    c11:  call cvsignal();
    c12:  call unlock();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c20: call lock();
    c21: call cvwait(p);
    c22: call unlock();
    c23: return;
end procedure;

The model checker again complaints of a deadlock! Though the signaling and wait cannot be happen at the same time, there is still the original problem of the parent running much later than the child even though exclusively, and then waiting on the CV forever. The parent should not wait if no waiting is needed. Looks like we need both the done state variable and the locking.

Lets try to model this, the C program we will model uses both locking and CVs.

 void thr_exit() {
        pthread_mutex_lock(&m);
        done = 1;
        pthread_cond_signal(&c);
        pthread_mutex_unlock(&m);
 }

 void thr_join() {
    pthread_mutex_lock(&m);
    while (done == 0) {
           pthread_cond_wait(&c, &m);
     }
    pthread_mutex_unlock(&m);
 }

The PlusCal program now becomes:

procedure thr_exit() begin
    c9:   call lock();
    c10:  done := 1;
    c11:  call cvsignal();
    c12:  call unlock();
    c13:  return;
end procedure;

procedure thr_join(p) begin
    c19: call lock();
    c20: while done = 0 do
    c21:    call cvwait(p);
         end while;
    c22: call unlock();
    c23: return;
end procedure;

However, turns out the model checker fails even for this! After some staring, I discovered there is a bug in my condition variable implementation itself. The cvwait procedures needs a subtle change.

The change is that, the process waiting (in this case the parent) has to be added to the waitqueue while the lock is being held. A slight reordering of c0 and c1 lines fix the cvwait(). With this the model now passes, by satisfying the invariant while not locking up:

procedure cvwait(p) begin
    c1: waitSet := waitSet \cup {p};
    c0: call unlock();
    c2: await p \notin waitSet;
    c3: call lock();
    c4: return;
end procedure;

Model checking can be a powerful tool to clarify understanding of simple primitives, that can have subtle bugs. While it has its limitations, quick model checking can identify and rule out bugs before they show up in the wild and clarify your understanding.

As an exercise for the reader, replace the while in c20 with an if statement. You should see it fail as well.

Model checking also shows that in thread_exit(), we only need to hold the lock when setting the done variable, though care must be taken to make sure that the waitSet can be concurrently queued and dequeued into. Otherwise holding the lock may still be needed. Assuming that queue and dequeue are atomic, thread_exit() can rewritten as:

procedure thr_exit() begin
    c9:   call lock();
    c10:  done := 1;
    c12:  call unlock();
    c11:  call cvsignal();
    c13:  return;
end procedure;

Note: A few examples of the C code were borrowed from the Operating Systems: Three Easy Pieces text book. I am grateful to them.