Locking

Required reading: spinlock.c

Why coordinate?

Mutual-exclusion coordination is an important topic in operating systems, because many operating systems run on multiprocessors. Coordination techniques protect variables that are shared among multiple threads and updated concurrently. These techniques allow programmers to implement atomic sections so that one thread can safely update the shared variables without having to worry that another thread intervening.

Diagram: CPUs, bus, memory.

List and insert example:

struct List {
  int data;
  struct List *next;
};

List *list = 0;

insert(int data) {
  List *l = new List;
  l->data = data;
  l->next = list;  // A
  list = l;        // B
}

What's required for insert() to be correct if executed on multiple CPUs? The two statements labeled A and B should always be executed together, as an indivisible fragment of code. If two processors execute A and B interleaved, then we end up with an incorrect list. To see that this is the case, draw out the list after the sequence A1 (statement executed A by processor 1), A2 (statement A executed by processor 2), B2, and B1.

How could this erroneous sequence happen? The variable list lives in physical memory shared among multiple processors, connected by a bus. The accesses to the shared memory will be interleaved in some total order by the bus/memory system. If the programmer doesn't coordinate the execution of the statements A and B, any interleaving can happen, including erroneous ones.

This kind of error is called a race (the name probably originates in circuit design). The problem with races is that they are difficult to reproduce. For example, if you put print statements in to debug the incorrect behavior, you might change the timing and the race might not happen anymore.

The lock abstraction

The programmer must be able express that the A and B from any given call to insert() should not be interleaved with the A and B from any other call to insert(). In fact that's probably not general enough. On the one hand, if there were a function delete() for lists the programmer probably wants A and B not to be interleaved with lines from a delete() on the same list. On the other hand, if there are multiple distinct lists, it's probably OK for their A and B lines to be interleaved. So we're looking for an abstraction that gives the programmer this kind of control.

While there are many coordination abstractions, locking is among the most common. The idea: when you have a data structure on which operations should proceed one at a time (i.e. not be interleaved), associate a lock object L with the data structure and bracket the operations with acquire(L) and release(L).

Lock list_lock;

insert(int data){
  List *l = new List;
  l->data = data;

  acquire(&list_lock);

  l->next = list ; // A
  list = l;        // B

  release(&list_lock);
}

How can we implement acquire() and release()? Here is a version that DOES NOT WORK:

 
struct Lock {
  int locked;
}

void broken_acquire(Lock *lock) {
  while(1){
    if(lock->locked == 0){ // C
      lock->locked = 1;    // D
      break;
    }
  }
}

void release (Lock *lock) {
  lock->locked = 0;
}

What's the problem? Two acquire()s on the same lock on different CPUs might both execute line C, and then both execute D. Then both will think they have acquired the lock. This is the same kind of race we were trying to eliminate in insert(). But we have made a little progress: now we only need a way to prevent interleaving in one place (acquire()), not for many arbitrary complex sequences of code (e.g. A and B in insert()).

Atomic instructions

We're in luck: most CPUs provide "atomic instructions" that are equivalent to the combination of lines C and D and prevent interleaving from other CPUs. For example, some CPUs have an atomic TSL (try-set-lock) instruction.

The semantics of TSL are:

   R <- [mem]   // load mem into register R
   [mem] <- 1   // store 1 in mem

The CPU cache controller and system bus arbiter guarantee that no other operations on this memory location occur between when TSL reads and when it writes.

We can use TSL to make a correct acquire() and release():

void acquire(Lock *lock){
  while(1){
    if(TSL(&lock->locked) == 0)
      break;
  }
}

void release(Lock *lock){
  lock->locked = 0;
}

Why does this work? If two CPUs execute TSL at the same time, the hardware ensures that one TSL does both its steps before the other one starts. So we can say "the first TSL" and "the second TSL". The first TSL will read a 0 and set lock->locked to 1 and the acquire() will return. The second TSL will then see lock->locked as 1, and the acquire will loop.

This style of lock is called a spin-lock, since acquire() stays in a loop while waiting for the lock.

insert() with these locks is only correct if each CPU carries out memory reads and writes in program order. For example, if the CPU were to execute insert() out of order so that it did the read at A before the acquire(), then insert() would be incorrect even with locks. Modern x86 processors can execute reads out of order! So we may have to use special instructions to tell the CPU not to re-order memory operations past acquire()s and release()s. The compiler may also generate instructions in surprising orders, so we have to worry about that too.

Example: Locking on x86

The x86 doesn't exactly have a TSL instruction, but it turns out that xchg is pretty close.

xchgl %eax, addr does the following:

freeze other CPUs' memory activity for address addr
temp := *addr
*addr := %eax
%eax = temp
un-freeze other memory activity

Here's TSL with xchg:

int
TSL(int *addr)
{
  int content = 1;

  asm volatile ("xchgl %0, %1" :
                "=r" (content),
                "=m" (*addr) :
                "0" (content),
                "m" (*addr));

  return content;
}

Here's what the compiler produces for the asm statement:

        movl  8(%ebp), %edx  // %edx = addr
        movl  $1, %eax       // %eax = 1
        xchgl %eax, (%edx)

This implementation doesn't scale to a large number of processors; in a later lecture we will see how we could do better.

Lock granularity

This lock abstraction gives the programmer lots of flexibility, since the programmer can associate lock objects with code or data in whatever way seems best. For example, suppose you are implementing a b-tree package. You can have one lock guarding the whole package; or one lock per distinct tree; or a lock per tree element. All of these approaches can lead to correct code. How to decide?

Perhaps the easiest approach is a "coarse-grained" (or "big") lock around as much code and data as possible (in the example, one lock for the whole b-tree package). A big lock requires relatively little thought to achieve correctness, and relatively few code changes. It's probably best to start with big locks and only think about reducing lock granularity after you understand your program's performance.

The potential performance problem with big locks is that they do not allow parallel execution by multiple CPUs inside the locked region. If you use big locks, the one-at-a-time regions may be large. If your program spends a lot of time manipulating b-trees, you may find you get only one CPU's worth of performance with a big lock even if you have multiple CPUs. Perhaps locking at smaller granularity would get higher performance through more concurrency. How to best reduce lock granularity is a bit of an art. In the case of b-trees, it matters whether the CPUs would be operating on different b-trees or on different parts of the same b-tree. In the former case, a lock per b-tree will get the performance you want and is relatively simple. In the latter case you might need a lock per tree element to get concurrency, but you may find that a difficult programming challenge.

Spinning vs blocking

acquire()/release() is good for short atomic sections: increment a counter, search in i-node cache, allocate a free buffer. In these cases acquire() won't waste much CPU time spinning even if another CPU holds the lock.

Spin locks are not so great for code that has to wait for events that might take a long time. For example, it would not be good for the disk reading code to get a spin-lock on the disk hardware, start a read, wait for the disk read to complete, and then release the lock. The problem is that another process that wanted to read the disk would have to spin waiting for access to the hardware; disk reads take so long (millions of instructions) that it would be better to release the CPU for other work.

xv6 has sleep() and wakeup() primitives that are better for processes that need to wait for I/O.

Locks and modularity

When designing a system we desire clean abstractions and good modularity. We'd like a caller not to have to know how a callee implements a particular function. We'd like to hide locking behavior in this way -- for example, it would be nice if the b-tree package's locking scheme was its own private business.

It turns out it's hard to use locks in a modular way. Two problems come up that require non-modular global knowledge: deadlock and multi-module invariants.

First, deadlocks. Suppose you are implementing banking software and you want to transfer money from account X to account Y. You need to maintain the invariant that there always appears to be the same amount of total money in the accounts. So you use locks to prevent other threads from seeing an inconsistent situation:

transfer(Account x, Account y, float amount){
  acquire(x.lock);
  acquire(y.lock);
  x.balance = x.balance - amount;
  y.balance = y.balance + amount;
  release(y.lock);
  release(x.lock);
}

The problem arises if there are simultaneous transfers from account A to B and B to A. One might lock A, the other might then lock B, and now they must both wait for the other to release(). But neither can proceed. This is called a deadlock.

In an operating system the usual solution is to avoid deadlocks by establishing an order on all locks and making sure that every thread acquires its locks in that order. In our example we might order A before B and thus require that the B->A transfer acquire A first, not B. Picking and obeying the order on all locks requires that modules make public their locking behavior, and requires them to know about other module's locking. This can be painful and error-prone.

Second, multi-module invariants. Suppose you'd like to move something from one b-tree to another, but maintain the invariant that the thing always appears to be on exactly one of the trees. You'd like to write code like:

  acquire(tree1.lock);
  acquire(tree2.lock);
  tree1.delete(obj);
  tree2.insert(obj);
  release(tree1.lock);
  release(tree2.lock);

If the b-tree software hides its locks beneath a module barrier, you can't do this. If the b-tree module exposes its locks, then there has to be a contract between the caller and the b-tree module about who is supposed to acquire what locks. This can get complex and error-prone. xv6 has such contracts, usually documented in comments. For example, if you want to call sched() to release the CPU to another process, you must already hold proc_table_lock.

Fine-grained locks usually make both of these problems worse.

Locking in xv6

xv6 runs on a multiprocessor and allows multiple threads of computation to run concurrently. At any given time there may be multiple processes actively executing in the kernel, each on its own CPU. There might also be interrupt handlers running at the same time as other kernel code. xv6 uses locks to coordinate how these concurrent activities share data structures.

Let's look at the IDE disk driver as an example of xv6's use of locks. The file system kernel code calls ide_rw() to read or write a block of data from/to the disk. ide_rw() should return when the read or write has completed. There may be multiple calls to ide_rw() outstanding because multiple processes may be reading files at the same time. The disk can only perform one operation at a time, so the IDE driver keeps a queue of requests that are waiting their turn.

The IDE driver has just one lock (ide_lock). This one lock helps enforce multiple invariants required for correct operation:

Only one thread should be inserting or deleting from ide_queue at a time.
Only one thread should be commanding the IDE hardware (via inb/outb instructions) at a time.
The disk hardware can only execute one read or write at a time.

The software (and hardware) for these activities was all designed for one-at-a-time use, which is why it needs locking.

What happens if two processes call ide_rw() at the same time from different CPUs?

ide_intr() deletes a request from ide_queue. What happens if an IDE interrupt occurs when a process is adding a block to the queue?

Why does ide_start_request() want the caller to hold the lock? Why doesn't it just acquire the lock itself?

How do xv6 locks work? Let's look at acquire() in spinlock.c.

1427: holding() is true if the current CPU already holds the lock. why is it bad for a thread to re-aquire a lock it already has?
1431: why disable interrupts on the current processor?
1432: why the incrementing of nlock?
1434: the atomic instruction (sheet 4) and the spin.
1439: make sure that the CPU doesn't re-order reads from after the acquire to before the acquire.
1445: remember which CPU holds the lock, so that holding() works.

And release():

1455: ensure this CPU really does hold this lock.
1462: prevent the CPU from re-ordering the locked=0 up into the code that's being protected.
1464: release the lock.
1466: turn interrupts back on if the acquire() turned them off.

Locking in JOS

JOS is meant to run on single-CPU machines. It doesn't have to worry about concurrent operations from other CPUs, but it still has to worry about interrupts. JOS takes a simple approach: it turns interrupts off for the entire time it is executing in the kernel. For the most part this means JOS kernel code doesn't have to do anything special in situations where xv6 would use locks.

JOS runs environments in user-mode with interrupts enabled, so at any point a timer interrupt may take the CPU away from an environment and switch to a different environment. This interleaves the two environments' instructions a bit like running them on two CPUs. The library operating system has some data structures in memory that's shared among multiple environments (e.g. pipes), so it needs a way to coordinate access to that data. In JOS we will use special-case solutions, as you will find out in lab 6. For example, to implement pipes we will assume there is one reader and one writer. The reader and writer never update each others' variables; they only read each others' variables. Carefully programming using this rule we can avoid races.