Coordination

Required reading: Fast mutual exclusion for uniprocessors

Overview

We revisit the topic of mutual-exclusion coordination: the techniques to protect variables that are shared among multiple threads. These techniques allow programmers to implement atomic sections so that one thread can safely update the shared variables without having to worry that another thread intervening.

We focus today on uniprocessors. Even on uniprocessor one has to provide mutual-exclusion coordination, because the scheduler might schedule another thread in response to a hardware interrrupt (e.g., end of a time slice or a page fault).

The technique for mutual-exclusion coordination used in v6 is disabling/enabling interrupts in the kernel, through spln() calls. In v6, the only program with multiple threads is the kernel, thus that is the only program that needs need mutual-exclusion mechanism. User-level processes don't share data directly.

As we discussed in last lecture, microkernels force server programs to deal with many similar issues as monolithic kernels, and they therefore require concurrent handling of events. This organizational requirement raises the question "How to handle multiple events concurrently at a user-level?"

Use a thread package that doesn't preempt threads. An approach sometimes called cooperative threads. Thus, the scheduler always resumes the thread that was running before preemption unless the thread released the processor explicitly. A number of systems use this approach, but it is not without challenges either: If a thread calls printf in a section of code that must be executed atomically, how do i know that printf doesn't release the processor?
Use a thread package that provides primitives to mark critical sections atomically (e.g., acquire and release lock). These primitives must be implemented as atomic instructions, which can be done with, for example, a hardware try-set-lock (TSL) instruction.
Don't use a thread package; organize the application around events. In this approach, the programmer structures the program as a single loop that process events (e.g., network packet arrival, disk read completion). Each event is handled atomically.

Solutions 1 and 3 aren't good enough for multiprocessors.

To implement any three of these options for a user-level thread package, we need to avoid blocking system calls. The kernel must support asynchronous system calls or scheduler activations.

Can we use a thread package that use the v6 kernel approach? That is, can we allow a user-level thread to disable and reenable interrupts? (Answer: not, if we care about fault isolation.)

List and insert example:


struct List {
  int data;
  struct List *next;
};

List *list = 0;

insert(int data) {
  List *l = new List;
  l->data = data;
  l->next = list;  // A
  list = l;        // B
}

fn() {
  insert(100);
}

main() {
  thread_create(..., fn);
  thread_create(..., fn);
  thread_schedule();
}

What needs to be atomic? The two statements labeled A and B should always be executed together, as an indivisible fragment of code. If two threads execute A and B interleaved, then we end up with an incorrect list. To see that this is the case, draw out the list after the sequence A1 (statement executed A by thread 1), A2 (statement A executed by thread 2), B2, and B1).

How could this erroneous sequence happen? If thread 1 and 2 are running on a multiprocessors, the varilable list lives in physical memory shared among multiple processors, connected by a bus. The accesses to the shared memory will be ordered in some total order by the bus/memory system. If the programmer doesn't coordinate the execution of the statements A and B, any order can happen, including the erroneous one.

How else could it happen? Thread 1's time slice runs out after executing A, and the kernel switches to thread 2, which then executes A and B. Then thread 2 is done, and the kernel may schedule thread 1 again, which executes B. This sequence of events produces the erroneous order.

We need to extend the thread package so that the programmer can express that A and B should be executed as single atomic instruction. We generally use a concept like locks to mark an atomic region, acquiring the lock at the beginning of the section and releasing it at the end:

void acquire(int *lock) {
  while (TSL(lock) != 0) ;
}

void release (int *lock) {
  *lock = 0;
}

Acquire and release, of course, need to be atomic too, which can, for example, be done with a hardware atomic TSL (try-set-lock) instruction:

The semantics of TSL are:

   R <- [mem]   // load content of mem into register R
   [mem] <- 1   // store 1 in mem.

In a harware implementation, the bus arbiter guarantees that both the load and store are executed without any other load/stores coming in between.

We can use locks to implement an atomic insert, or we can use TSL directly:

int insert_lock = 0;

insert(int data) {

  /* acquire the lock: */
  while(TSL(&insert_lock) != 0)
    ;

  /* critical section: */
  List *l = new List;
  l->data = data;
  l->next = list;
  list = l;

  /* release the lock: */
  insert_lock = 0;
}

The paper

The paper addresses how to provide mutual-exclusion primitives that are completed implemented in software. The motivation is that some uniprocessors don't provide a hardware-level atomic instructions. It turns out, however, that the approach advocated in the paper is in general a good one, because software implementations are often more efficient than hardware implementations.

The paper describes several ways to implement try-set-lock in software:

Emulation. The kernel provides a TSL system call, which is implemented atomically by, for example, disabling interrupts at the beginning of the system call and reenabling them at the end.

Mutual exclusion algorithm. Use an algorithm that can provide TSL completely in software. These algorithms require many instructions. Here is a simpler version of the one in the paper:

int flag[N];

int TSL (int *L) {
  while (true) {
    flag[me] = true;
    if (!is_flagged (me)) {
       R = *L;
       *L = 1;
       flag[me] = false;
       return R;
    } else {
       flag[me] = false;
    }
  }
}

boolean is_flagged(me) {
  for (i = 0; i < N; i++) {
    if ((i != me) && flag[i]) return true;
  }
  return false;
}

This version doesn't guarantee progress, though. (That's why the version in the paper is more complex.)

Restartable atomic sequences. On a uniprocessor few atomic sections will experience a context switch, because sections are short and they will only be rescheduled when the scheduler decides to reschedule a thread (e.g., end of a time slice or perhaps because a page fault). TAS optimizes for the common case that a section of code won't experience a scheduling event. Thus, when a thread is not rescheduled (the common case), we pay hardly any performance overhead. When a thread is rescheduled (the uncommon case), it checks whether it was in an atomic section before it was rescheduled; if so, the thread reexecutes the whole atomic section again. We need to structure our code so that atomic sections are restartable; this may require some code reorganization.
Use RASs, we can implement an atomic TSL, and from that we can implement locks, acquire, and release, and make any sequence of instructions indivisible.

Paper discussion

Figure 3.
- What if context switch between lines 5 and 6?
- What if context switch between lines 6 and 7? (better not restart)
- What if context switch between lines 4 and 5? (failure?)
There's something wrong here: Kernel has to be able to decide precisely if sequence is done.
Figure 4.
- Is line 4 executed?
- What if interrupted before 3? (restart, re-do reads, no writes yet)
- What if interrupted between 3 and 4? (not possible)
- What if interrupted after 3/4? (already done)
The kernel can tell precisely if sequence has completed: Has line 3 executed?
Implementation. Two implementations: (1) register with kernel addresses of atomic section; and (2) designated sequence, which are recognized by kernel.
How general is RAS? The authors use it for one particular atomic sequence: implementing TSL. (which then can be used to locks, which then can be used to make any sequence of code atomic, so one is sufficient.) Nevertheless, it is an intellectual interesting exercise to inspect how general RASs by themselves are: could we use it directly in insert()? (Answer: yes.)
```
insert(int data) {
  List *l = new List;
  l->data = data;
  BEGIN_RAS
    l->next = list;
    list = l;
  END_RAS
}
```
In fact, this implementation of insert is strictly better than the one using TSL. The TSL insert is blocking: if T1 is executing the insert is pre-empted and control is passed to T2, T2 cannot execute insert without waiting for T1. This RAS version doesn't block T2. Such versions are called wait free.
Can we use RAS directly in any atomic sequence? (No: what if two writes?). Example: the x86 provides an atomic exchange instruction; the RAS version of this instruction is:
```
void xchg_RAS (int *p1, int *p2) {
  BEGIN_RAS
    int tmp = *p1;
    *p1 = *p2;
    *p2 = tmp;
  END_RAS
```
Find a sequence of events such that xchg_RAS doesn't work. Assume:
```
a = 1;
b = 2;
xchg_RAS (&a, &b);
```

Using TSL one cannot make wait-free implementation of insert, but using cmpxchg we can:

int cmpxchg(addr, v1, v2) {
  int ret = 0;
  // stop all memory activity and ignore interrupts
  if (*addr == v1) {
    *addr = v2;
    ret = 1;
  }
  // resume other memory activity and take interrupts
  return ret;
}

insert (int data) {
  element *n = new Element;
  n->x = x;
  do {
     n->next = list;
  } while (cmpxchg (&list, n->next, n) == 0);
}

Can we implement cmpxchg as RAS? (Answer: yes.)

Should we use RAS in the lab? If so, where would it be useful? We have carefully constructed the kernel and library so far that you don't have to worry about coordination. (The best way of dealing with concurrency is to structure the system in way that you don't have to worry about it!) As soon environments start sharing variables that they can write, you have to, though. When you do, you could use RAS to implement a software TSL, or use the hardware TSL, provided by the x86.