"Fast Mutual Exclusion for Uniprocessors", Bershad, Redell, Ellis.

What is the problem?
  Some hardware doesn't have TSL.
  TSL is often slow.
  Goal: user-level software emulation of TSL.

What's their general approach?
  Know when the atomic sequence has been interrupted.
    By pre-emptive context switch.
  Re-start the atomic sequence.

Would their scheme make sense on a multi-processor?
  No: you don't generally know the code has been interrupted.

Would their scheme make sense w/ non-pre-emptive threads?
  No: critical sections are never interrupted.

Look at the paper's Figure 3.
  It could be used in place of our TSL() (but result inverted?).
  What if context switch between lines 5 and 6?
  What if context switch between lines 6 and 7? (better not restart)
  What if context switch between lines 4 and 5? (failure?)
  There's something wrong here:
  Kernel has to be able to decide precisely if sequence is done.

Look at the paper's Figure 4.
  Is line 4 ever executed? (delay slot)
  What if interrupted before 3? (restart, re-do reads, no writes yet)
  What if interrupted between 3 and 4? (not possible)
  What if interrupted after 3/4? (already done)
  So the kernel can tell precisely if sequence has completed:
    Has line 3 executed?

How general is RAS?
  They use it for one particular atomic sequence.
  Could we use it directly in insert()? (yes!)
  Can we use it directly in any atomic sequence? (no -- what if two writes?)

How can kernel tell if it interrupted a program in a RAS?
  Mach implementation has a single registered routine per program.
  Mach checks the saved PC when the thread is suspended.
  If it has started the sequence, but not finished,
    Modifies the saved PC to point to first instruction.

Why does the TAOS implementation work differently?
  Allow inlining of RAS; avoid function call to single RAS routine.

What's the cost of the TAOS approach?
  Hard for kernel to decide a program is in an RAS.

What are the paper's claims?
  1. Their locking operation is cheaper than kernel emulation.
     Can they stop here?
     Why not?
  2. Threads are rarely interrupted.
     Justifies optimistic approach.
  3. Thread suspensions are rarer than atomic operations.
     So it's OK to make suspension code more expensive while
     keeping locking code cheap.
  4. RAS improves overall performance, relative to kernel emulation.
     Increased costs (check on every suspension) don't outweigh
     savings of faster atomic sequences.
     This is the only one that really matters.
  5. RAS often as fast as hardware TSL.

How do they back up their claims?
  1. Micro-benchmarks of atomic sequence code in 5.1.
     Micro-benchmarks of multi-thread locking in 5.2
  4. Whole-application benchmarks in 5.3.
  5. Micro-benchmarks of TSL and RAS in Section 6.

Should you hurry to implement this for the Linux kernel on your PC?
  Linux uniprocessor kernel probably isn't pre-emptive.
  x86 has reasonable efficient TSL.