6.824 2004 Lecture 14: Memory Consistency (2)

Review from previous lecture:
  We want to make it possible to write correct parallel/distributed programs.
  We assume different CPUs interact only through a storage system.
    Memory, distributed shared memory, or a file system.
  So we need a "memory consistency model"
    That tells us what to expect when we read/write memory.
  We want a model that:
    Is easy to understand, so programmers can easily write correct programs.
    Is possible to implement efficienctly.

One reasonable model: sequential consistency
  There must be some total order of operations such that
  1. all CPUs see results consistent with that total order
     i.e. reads see most recent write in the total order
  2. each CPU's instructions appear in-order in the total order

Intuitive justification:
  The single total order means it's easy for one CPU to predict what
    other CPUs will see
  The "consistent with" and lack of real time may make it easy to implement
  The system appears free to interleave instruction streams however it likes
    to form the total order
  However! When executing in real time, once the system reveals
    a written value to a read operation, the system has committed to a
    little bit of partial order. this may have transitive effects.
  So in real life the system only has freedom in ordering more or
    less concurrent operations -- ones that haven't been observed yet

Remember our mutual exclusion example:
  CPU0: x = 1; if(y == 0) { critical section; }
  CPU1: y = 1; if(x == 0) { critical section; }
  We want this to work.

Lay out style of argument
  there is more than one legal result, depending on interleaving! (not like uniprocessor)
  typical question: is xxx a correct result under sequential consistency?
  "yes" if you can demonstrate an interleaving that gets that result
  "no" if you can show no interleaving could get that result
  that's the definition of correctness
  main example:
  CPU0: w(x)0      w(x)1 r(y)?
  CPU1: w(y)0      w(y)1 r(x)?
  we can evaluate all legal seq consistency interleavings manually:
    1/1? 1/0? 0/1? 0/0? [only 0/0 is illegal]

Good: sequential consistency gets the intuitive result we expected

How can we implement sequential consistency?

straw man 1:
  internet cloud, hosts
  assume each host has a big cache
    and that all data is cached on every host
  reads are local, so they are very fast
  send write msg to each other host
    (but don't wait)
  what goes wrong?
    CPU0 starts the x=1 write, and CPU1 starts y=1 write.
      I.e. they send a packet on the network.
    Both read before the write message reaches them
    So they both read 0 and enter the critical section.
    We know this isn't legal for sequential consistency.
  Lesson: a CPU must wait for each operation to complete.
    [to give the system time to order concurrent operations]

straw man 2:
  we can fix this by changing write:
    write local cache
    send write msgs to other CPUs
    wait for ACKs from all other CPUs
    only then proceed to instruction after the write
  this fixes our mutex example
    if CPU0's r(y) = 0, then CPU0 has not sent write ACK for
      CPU1's w(y)1, so CPU1 has not executed r(x).
    BUT if CPU0 executed r(y), then it had gotten CPU1's write ACK,
      so when CPU1 executes r(x), it will see 1.
  what goes wrong?
    turns out we need a new example
    CPU0: w(x)1  r(x)?
    CPU1: w(x)2  r(x)?
    legal seq consistency results? 1/1 2/2 1/2 BUT NOT 2/1
  can we get 2/1 w/ straw man 2 implementation?
    if both write local cache, then wait for remote.
  how about wait, then write local?
    then CPU0's x=1, CPU1's x=2
    as if this total order:
      w(x)1 r(x)1
                  w(x)2 r(x)2
    we've lost replica equality, but not (yet) violated seq consistency
    but system cannot now let CPU0 execute r(x)!
  Lesson: for each memory location, execute operations one at a time

These two rules are sufficient to implement sequential consistency:
  1. Operations on each CPU must execute one at a time
  2. Operations on each memory location must execute one at a time.
  XXX why is this the case?

What kind of implementation would fit well with these rules?
  Single entity in charge of ordering each CPU's operations (i.e. the CPU).
  Single entity in charge of ordering each location's operations.
  You don't need a central entity to choose the single total order!
  Example: partition memory over multiple modules on a network.
    Send all memory ops to home module.
    Divides up memory load nicely for good parallelism.
    Like Stanford FLASH.

Does your lab 5 enforce sequential consistency for i-node blocks?
  Each machine's operations are ordered (due to lock client code...)
  Lock server serializes read and write ops by different machines.
  Lots of details: e.g. wait for block server reply before releasing lock.

So you don't *need* to send all memory operations to home module.
  Home module can grant ownership using tokens or locks.

To what extent can you optimize sequential consistency?
  Delegate ownership via tokens
    So home module is serializing token grants, not memory operations
    Memory operations execute (mostly) in local caches
    This makes single-writer workloads fast
  Shared read caching also works
    CPUs cannot tell there was no global total order for reads
    Still need to serialize writes through home module
  But you can't make both reads and writes fast, in general
    Because memory system has to serialize operations for each location
    Which requires communication

Example of a faster consistency model?
  We're willing to accept more work for the programmer.
    Though we still want a well-defined model.
  And in return we expect faster execution.

Release Consistency
  You rarely see programs like the a=1; if(b==0) example.
    Because it's so hard to reason about them.
  Instead, parallel programs typically lock data that is shared and mutable.
    To create atomic multi-step sequences.
    (Not the same as cache ownership tokens...)
  Example: bank account transfer:
    acquire(l);
    b1 = b1 + x;
    b2 = b2 - x;
    release(l);
  Other CPUs aren't allowed to look at b1 or b2 while l is locked.
  So CPU could do the operations in any order within the critical section.
    I.e. load b2 before storing b1.
  Rules:
    1. CPU can't re-order any LD/ST before the acquire().
       (otherwise you might read b1 while someone else has the lock)
    2. Writes must finish before release() completes.
       (otherwise other CPUs might not see the writes)
  Can re-order, cache, &c within release/acquire, so fast.
  But: memory system must understand locks, acquire(), and release().

Tomorrow's paper (Treadmarks) is all about implementing release consistency.