Required reading: Fast mutual exclusion for uniprocessors
We revisit the topic of mutual-exclusion coordination: the techniques to protect variables that are shared among multiple threads. These techniques allow programmers to implement atomic sections so that one thread can safely update the shared variables without having to worry that another thread intervening.
We focus today on uniprocessors. Even on uniprocessor one has to provide mutual-exclusion coordination, because the scheduler might schedule another thread in response to a hardware interrrupt (e.g., end of a time slice or a page fault).
The technique for mutual-exclusion coordination used in v6 is disabling/enabling interrupts in the kernel, through spln() calls. In v6, the only program with multiple threads is the kernel, thus that is the only program that needs need mutual-exclusion mechanism. User-level processes don't share data directly.
As we discussed in last lecture, microkernels force server programs to deal with many similar issues as monolithic kernels, and they therefore require concurrent handling of events. This organizational requirement raises the question "How to handle multiple events concurrently at a user-level?"
Solutions 1 and 3 aren't good enough for multiprocessors.
To implement any three of these options for a user-level thread package, we need to avoid blocking system calls. The kernel must support asynchronous system calls or scheduler activations.
Can we use a thread package that use the v6 kernel approach? That is, can we allow a user-level thread to disable and reenable interrupts? (Answer: not, if we care about fault isolation.)
List and insert example:
struct List { int data; struct List *next; }; List *list = 0; insert(int data) { List *l = new List; l->data = data; l->next = list; // A list = l; // B } fn() { insert(100); } main() { thread_create(..., fn); thread_create(..., fn); thread_schedule(); }
What needs to be atomic? The two statements labeled A and B should always be executed together, as an indivisible fragment of code. If two threads execute A and B interleaved, then we end up with an incorrect list. To see that this is the case, draw out the list after the sequence A1 (statement executed A by thread 1), A2 (statement A executed by thread 2), B2, and B1).
How could this erroneous sequence happen? If thread 1 and 2 are running on a multiprocessors, the varilable list lives in physical memory shared among multiple processors, connected by a bus. The accesses to the shared memory will be ordered in some total order by the bus/memory system. If the programmer doesn't coordinate the execution of the statements A and B, any order can happen, including the erroneous one.
How else could it happen? Thread 1's time slice runs out after executing A, and the kernel switches to thread 2, which then executes A and B. Then thread 2 is done, and the kernel may schedule thread 1 again, which executes B. This sequence of events produces the erroneous order.
We need to extend the thread package so that the programmer can express that A and B should be executed as single atomic instruction. We generally use a concept like locks to mark an atomic region, acquiring the lock at the beginning of the section and releasing it at the end:
void acquire(int *lock) { while (TSL(lock) != 0) ; } void release (int *lock) { *lock = 0; }
Acquire and release, of course, need to be atomic too, which can, for example, be done with a hardware atomic TSL (try-set-lock) instruction:
The semantics of TSL are:
R <- [mem] // load content of mem into register R [mem] <- 1 // store 1 in mem.
In a harware implementation, the bus arbiter guarantees that both the load and store are executed without any other load/stores coming in between.
We can use locks to implement an atomic insert, or we can use TSL directly:
int insert_lock = 0; insert(int data) { /* acquire the lock: */ while(TSL(&insert_lock) != 0) ; /* critical section: */ List *l = new List; l->data = data; l->next = list; list = l; /* release the lock: */ insert_lock = 0; }
The paper addresses how to provide mutual-exclusion primitives that are completed implemented in software. The motivation is that some uniprocessors don't provide a hardware-level atomic instructions. It turns out, however, that the approach advocated in the paper is in general a good one, because software implementations are often more efficient than hardware implementations.
The paper describes several ways to implement try-set-lock in software:
int flag[N]; int TSL (int *L) { while (true) { flag[me] = true; if (!is_flagged (me)) { R = *L; *L = 1; flag[me] = false; return R; } else { flag[me] = false; } } } boolean is_flagged(me) { for (i = 0; i < N; i++) { if ((i != me) && flag[i]) return true; } return false; }
This version doesn't guarantee progress, though. (That's why the version in the paper is more complex.)
Use RASs, we can implement an atomic TSL, and from that we can implement locks, acquire, and release, and make any sequence of instructions indivisible.
insert(int data) { List *l = new List; l->data = data; BEGIN_RAS l->next = list; list = l; END_RAS }In fact, this implementation of insert is strictly better than the one using TSL. The TSL insert is blocking: if T1 is executing the insert is pre-empted and control is passed to T2, T2 cannot execute insert without waiting for T1. This RAS version doesn't block T2. Such versions are called wait free.
void xchg_RAS (int *p1, int *p2) { BEGIN_RAS int tmp = *p1; *p1 = *p2; *p2 = tmp; END_RASFind a sequence of events such that xchg_RAS doesn't work. Assume:
a = 1; b = 2; xchg_RAS (&a, &b);
int cmpxchg(addr, v1, v2) { int ret = 0; // stop all memory activity and ignore interrupts if (*addr == v1) { *addr = v2; ret = 1; } // resume other memory activity and take interrupts return ret; } insert (int data) { element *n = new Element; n->x = x; do { n->next = list; } while (cmpxchg (&list, n->next, n) == 0); }