6.824 2005 Lecture 11: Memory Consistency (1)

Replicated data a huge theme in distributed systems
  For performance and fault tolerance
  Often easier to replicate data than computation (Hypervisor...)
  Examples:
    Replicated mailboxes and user info in Porcupine.
    Caches in NFS and Echo.
    File server in labs 4 and 5.
    shared-memory multiprocessors
  All these examples involve sophisticated optimizations for performance
  How do we know if an optimization is correct?
  We need to know how to think about correct execution of distributed programs.
  Most of these ideas from multiprocessors 20/30 years ago.
    So I'll talk about memory, loads, stores.
    But ideas are similar for e.g. lab 5
  For now, just correctness and efficiency, not fault-tolerance.

What is "correct" for uniprocessor programs?
  First, define "correct" for each instruction independently
    takes the machine from one state to another
    e.g. ADD R1, R2, R3
    or LD gets value of last ST to same address
  Rule: a result is "correct" if it's the same as a result obtainable by:
    Executing the instructions one at a time, waiting for each to complete
  
Uniprocessor correctness is a useful definition because programmer can use it
  to predict what program will do, and thus write correct programs
  you can tell this is a *definition* because modern
    CPUs don't work like this; they break this rule;
    lots of logic to ensure they *look* like they are enforcing it

Example of why CPU doesn't want to *implement* the uniprocessor rule:
  MUL R1, R2, R3
  ST x, R1
  LD y, R4
  MUL is pretty slow, ST has to wait for it.
    Dependency via R1.
  But LD does not need to wait: will get same result if we execute it early.
  So generally the LD executes before the ST, for speed.
  But CPU h/w checks if x==y, stalls LD if they are equal.
  Point: this optimization only possible w/ a definition!

What about correctness for distributed computations?
  multiple hosts, shared memory
  memory could be files, DSM (next paper), or DHT

Naive distributed memory:
  internet cloud, hosts CPU0, CPU1, CPU2
  assume each host has a local copy of all of memory
  reads are local, so they are very fast
  send write msg to each other host
    (but don't wait)

Example 1:
  Simple mutual exclusion algorithm, for locking.
  x and y start as zero on both CPUs
  CPU0:
    x = 1;
    if(y == 0)
      critical section;
  CPU1:
    y = 1;
    if(x == 0)
      critical section;
  Intuitive explanation for why this should "work":
    If CPU0 sees y == 0, CPU1 can't have reached "y = 1",
    so CPU1 must see x == 1, so it won't execute critical section.

Example 1 won't work w/ naive distributed memory (or indeed most multi-processors).
  Problem A:
    both reads come from local replica of memory
    before write messages are delivered

Example 2:
  CPU0:
    v0 = f0();
    done0 = true;
  CPU1:
    while(done0 == false)
      ;
    v1 = f1(v0);          
    done1 = true;
  CPU2:
    while(done1 == false)
      ;
    v2 = f2(v0, v1);
  Intuitive intent:
    CPU2 should execute f2() with results from CPU0 and CPU1
    waiting for CPU1 implies waiting for CPU0

Example 2 won't work with naive distributed memory:
  [not Problem A, since no reads after write]
  Problem B:
    CPU0's writes of v0 and done0 may be interchanged by network
    leaving v0 unset but done0=true
  But assume each CPU sees each other's writes in issue order
  Problem C:
    CPU2 sees CPU1's writes before CPU0's writes
    i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes

Lesson:
  either naive distributed memory isn't "correct"
  or we should not have expected Examples to work

How can we write correct distributed programs w/ shared storage?
  Memory system promises to behave according to certain rules.
  We write programs assuming those rules.
  Rules are a "consistency model"
  Contract between memory system and programmer
  
What makes a good consistency model?
  There are no "right" or "wrong" models
  A model may make it harder or easier to program
    i.e. lead to more or less intuitive results
  A model may be harder or easier to implement efficiently

How about "strict consistency":
  each instruction is stamped with the wall-clock time at which it started
    across all CPUs
  Rule 1: LD gets value of most recent previous ST to same address
  Rule 2: each CPU's instructions have time-stamps in execution order
  Essentially the same as on uniprocessor

Would strict consistency execute Example 1 intuitively?
  Could both CPUs be in the critical section?
    i.e. could both CPUs read 0?
  I.e. can we show a time-stamp ordering of operations, consistent with rules,
    that leads to both CPUs in critical section?
  Rule 2 says each CPU's operations occur in time order in execution order, so
    CPU0: w(x)1 r(y)0
    CPU1: w(y)1 r(x)0
    but we're not sure of interleave
  CPU0's r(y)0 means w(y)1 hadn't executed by Rule 1, so
    CPU0: w(x)1 r(y)0
    CPU1:             w(y)1 r(x)0
  But we've violated Rule 1, since w(x)1 followed by r(x)0 
  So both CPUs cannot be in the critical section.
  In general strict consistency produces intuitive behavior.

How do you implement strict consistency?
  Time:  1  2  3  4
  CPU0:  ST       ST
  CPU1:     LD LD
  Time between instructions << speed-of-light between CPUs!
  How is LD@2 even aware of ST@1?
  How does ST@4 know to pause until LD@3 has finished?
    how does ST@4 know how long to wait?
  Too hard to implement!