"Fast Mutual Exclusion for Uniprocessors", Bershad, Redell, Ellis.

Would Example 1 work in Frans' simple threading package?

What could go wrong in a more involved environment?
  The problem: lines E and F must execute atomically.
  Interrupt or signal.
  Pre-emption.
  Multiple CPUs.
  The program might want to yield during an atomic section.
    E.G. for memory allocation.

How can we provide atomic sequences?
  Does Example 2 work?
    Race between the while() and insert_lock = 1.
    No good against interrupts, pre-emption, multiple CPUs.
  Turn off interrupts.
    Side effect is to prevent pre-emption via timer interrupts.
    Not good in user mode, or multi-processor.
  Kernel emulation:
    System call to turn of interrupts around test-and-set.
    What does it do on a multi-processor?
  test-and-set-locked
    Works with interrupts, pre-emption, multi-CPU.

Test-and-set-locked instruction:
  TSL mem, R:
    R <- [mem]  /* load value at address mem */
    [mem] <- 1  /* set value at mem to 1 */
  Hardware locks the memory system around the two sub-operations.
  [Picture of CPUs, bus, arbitrator, memory system]
  Operates at motherboard clock rates, not CPU.
  Much slower than cached load/store.
  Prevents other uses of the bus.
  We can wrap an interface around TSL as in Example 3.

Example 3 works badly on a uniprocessor or with long critical sections.
  So put a thread_yield() in the acquire() while loop.

Table of techniques:
  Technique  Time      Space  Uni?  Multi?
  mask intrs ?         O(1)   yes   no
  Kern Emul  v. slow   O(1)   yes   no (?)
  TSL        fast      O(1)   yes   yes

Are locks enough?
  No: producer/consumer requires some kind of sleep / wakeup.
  Don't want to busy wait.
  Look at Example 4.

What's wrong with Example 4?
  Need to keep data structures consistent.
  Need to avoid race between
    if(full == 0) sleep();
    and
    full = 1; wakeup();
  Want each pair to execute atomically; how about:
    lock(l);
    if(full == 0) sleep();
    unlock(l);
  Can't hold the lock while sleep()ing!
  We want sleep() to hold the lock long enough to mark us as sleeping,
    then release the lock to the other process.
  So sleep/wakeup is not orthogonal to locking.
  Can we come up with a good abstraction?

Look at Example 5 and its use in Example 6.

Condition variable rules:
  You must be holding the lock when you call wait().
    Otherwise if(full == 0) wait(c, l); has a race.
  You must be holding the lock when you call signal().
    Otherwise if(full == 0) signal(c); has a race.
  wait() returns with the lock held.
    Since it's always used in that way.

PAPER DISCUSSION

What is the problem?
  Some hardware doesn't have TSL.
  TSL is often slow.
  Goal: user-level software emulation of TSL.

What's their general approach?
  Know when the atomic sequence has been interrupted.
    By pre-emptive context switch.
  Re-start the atomic sequence.

Why wouldn't it work on a multi-processor?
  You don't generally know the code has been interrupted.

Look at the paper's Figure 3.
  It could be used in place of our TSL() (but result inverted?).
  What if context switch between lines 5 and 6?
  What if context switch between lines 6 and 7? (better not restart)
  What if context switch between lines 4 and 5? (failure?)
  There's something wrong here:
  Kernel has to be able to decide precisely if sequence is done.

Look at the paper's Figure 4.
  Is line 4 every executed? (delay slot)
  What if interrupted before 3? (restart, re-do reads, no writes yet)
  What if interrupted between 3 and 4? (not possible)
  What if interrupted after 3/4? (already done)
  So the kernel can tell precisely if sequence has completed:
    Has line 3 executed?

How general is RAS?
  They use it for one particular atomic sequence.
  Could we use it directly in insert()? (yes!)
  Can we use it directly in any atomic sequence? (no -- what if two writes?)

How can kernel tell if it interrupted a program in a RAS?
  Mach implementation has a single registered routine per program.
  Mach checks the saved PC when the thread is suspended.
  If it has started the sequence, but not finished,
    Modifies the saved PC to point to first instruction.

Why does the TAOS implementation work differently?
  Allow inlining of RAS; avoid function call to single RAS routine.

What's the cost of the TAOS approach?
  Hard for kernel to decide a program is in an RAS.

What are the paper's claims?
  1. Their locking operation is cheaper than kernel emulation.
     Can they stop here?
     Why not?
  2. Threads are rarely interrupted.
     Justifies optimistic approach.
  3. Thread suspensions are rarer than atomic operations.
     So it's OK to make suspension code more expensive while
     keeping locking code cheap.
  4. RAS improves overall performance, relative to kernel emulation.
     Increased costs (check on every suspension) don't outweigh
     savings of faster atomic sequences.
     This is the only one that really matters.
  5. RAS often as fast as hardware TSL.

How do they back up their claims?
  1. Micro-benchmarks of atomic sequence code in 5.1.
     Micro-benchmarks of multi-thread locking in 5.2
  4. Whole-application benchmarks in 5.3.
  6. Micro-benchmarks of TSL and RAS in Section 6.