6.824 Lecture 6: Consistency

Outline
  Consistency
  Consistency models
    Strict consistency
    Sequential consistency
  IVY

Consistency = a constraint on the memory model of a storage system
  Interesting: when data is replicated and concurrent access

Replicated data a huge theme in distributed systems
  For performance and fault tolerance
  Often easier to replicate data than computation
  Examples:
    Caching of web pages (Web browser)
    Caches in YFS (labs 5 and 6)
    Caches in Linux kernel
  All these examples involve sophisticated optimizations for performance
  How do we know if an optimization is correct?
  We need to know how to think about correct execution of distributed programs.
  Most of these ideas from multiprocessors and databases 20/30 years ago.
  For now, just correctness and efficiency, not fault-tolerance.

Replicating content that isn't modified (or has a single writer) is "simple"
  See, for example, Web browser
  Browser obeys HTTP expiration directive and last-modified

Topic gets much more interesting when concurrent reads and writes.
  Let's assume we are implementing a traditional memory (i.e. with LD/ST)
    matches many read/write abstract interfaces (e.g., file systems)
  Let's distribute it naively
    internet cloud, hosts CPU0, CPU1, CPU2
    assume each host has a local copy of all of memory
    reads are local, so they are very fast
    write: send update msg to each other host (but don't wait)
  Does this memory work well?
   
Example 1:
  CPU0:
    v0 = f0();
    done0 = true;
  CPU1:
    while(done0 == false)
      ;
    v1 = f1(v0);          
    done1 = true;
  CPU2:
    while(done1 == false)
      ;
    v2 = f2(v0, v1);
  Intuitive intent:
    CPU2 should execute f2() with results from CPU0 and CPU1
    waiting for CPU1 implies waiting for CPU0

Example 1 won't work with naive distributed memory:

  Problem A:
    CPU0's writes of v0 and done0 may be interchanged by network
    leaving v0 unset but done0=true
  But assume each CPU sees each other's writes in issue order  (thus,
  scenario A could not happen with the YFS RPC system.)

  Problem B:
    CPU2 sees CPU1's writes before CPU0's writes
    i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes
    (this could happen even when using YFS RPC.)

Lesson:
  either naive distributed memory isn't "correct"
  or we should not have expected Examples to work

How can we write correct distributed programs w/ shared storage?
  Memory system promises to behave according to certain rules.
  We write programs assuming those rules.
  Rules are a "consistency model"
  Contract between memory system and programmer

What makes a good consistency model?
  There are no "right" or "wrong" models
  A model may make it harder or easier to program
    i.e. lead to more or less intuitive results
  A model may be harder or easier to implement efficiently
  Also application dependent
    A consistency model for Web pages different than for memory

How about "strict consistency":
  each instruction is stamped with the wall-clock time at which it started
    across all CPUs
  Rule 1: LD gets value of most recent previous ST to same address
  Rule 2: each CPU's instructions have time-stamps in execution order
  Essentially the same as on uniprocessor

Would strict consistency avoid problem A and B?

How do you implement strict consistency?
  Time:  1  2  3  4
  CPU0:  ST       ST
  CPU1:     LD LD
  Time between instructions << speed-of-light between CPUs!
  How is LD@2 even aware of ST@1?
  How does ST@4 know to pause until LD@3 has finished?
    how does ST@4 know how long to wait?
  Too hard to implement!

One reasonable model: sequential consistency
  Is an execution (a set of operations) correct?
  There must be some total order of operations such that
  1. all CPUs see results consistent with that total order
     i.e. reads see most recent write in the total order
  2. each CPU's instructions appear in-order in the total order

Intuitive justification:
  The single total order means it's easy for one CPU to predict what
    other CPUs will see
  The "consistent with" and lack of real time may make it easy to implement
  The system appears free to interleave instruction streams however it likes
    to form the total order
  However! When executing in real time, once the system reveals
    a written value to a read operation, the system has committed to a
    little bit of partial order. this may have transitive effects.
  So in real life the system only has freedom in ordering more or
    less concurrent operations -- ones that haven't been observed yet

Problems A and B above examples of a memory system that is not
sequentially consistent

  It is not possible to construct a total order that captures
  execution A and B. Consider execution A. In the total order op 1
  must precede op2 because CPU 0 executes them in that order, but op2
  must precede op1 on CPU 1.  We cannot have a single total order in
  which both are true.  A similar argument holds for execution
  B. Therefore the a memory system that allows the execution A and B
  could not have been sequentially consistenty.

  To proof that a memory system is sequentially consistent we have to
  merge the total orders of ops on each processor into a global total
  order.  Some ops create a partial order between between different
  cpus because they the op causes a value to be visible at another
  cpu, which is then used there.  We have to show that all possible
  partial orders that the protocol of the memory systems can create
  can be embedded in a global total order.

Let's look at IVY.  What consistency does it provide and how?

Why is Ivy cool?
  All the advantages of *very* expensive parallel hardware.
  On cheap network of workstations.
  No h/w modifications required!

Do we want a single address space?
  Or have programmer understand remote references?
Shall we make a fixed partition of address space?
  I.e. first megabyte on host 0, second megabyte on host 1, &c?
  And send all reads/writes to the correct host?
    We can detect reads and writes using VM hardware.
    I.e. I read- or write-protect remote pages.
  What if we didn't do a good job of placing the pages on hosts?
  Maybe we cannot predict which hosts will use which pages?

Could move the page each time it is used.
  When I read or write, I find current owner, and I take the page.
  So need a more dynamic way to find current location of the page.
  What if lots of people read a page?

Move page for writing, but allow read-only copies.
  When I write a page, I invalidate r/o cached copies.
  When I read a non-cached page, I find most recent writer.
  Works if pages are r/o and shared, or r/w by one host.
  Only bad case is write sharing.
  When might this arise? False sharing...

How does their scheme work? Section 3.1
  three CPUs, one MGR, draw table per entity


ptable: on per CPU, one entry per page
  lock
  access (read or write or nil)
  i_am_owner (same as write?)

info: just on MGR, one entry per page
  lock
  copy_set
  owner
 
Message types:
  RQ read query (reader to MGR)
  RF read forward (MGR to owner)
  RD read data (owner to reader)
  RC read confirm (reader to MGR)
  WQ
  IV
  IC
  WF
  WD
  WC

scenario 1: owned by CPU0, CPU1 wants to read
  0. page fault on CPU1, since page must have been marked invalid
  1. CPU1 sends RQ to MGR
  2. MGR sends RF to CPU0, MGR adds CPU1 to copy_set
  3. CPU0 sends RD to CPU1, CPU0 marks page as access=read
  4. CPU1 sends RC to MGR

scenario 2: owned by CPU0, CPU2 wants to write
  0. page fault on CPU2
  1. CPU2 sends WQ to MGR
  2. MGR sends IV to copy_set (i.e. CPU1)
  3. CPU1 sends IC msg to MGR (or does it? does MGR wait?)
  4. MGR sends WF to CPU0
  5. CPU0 sends WD to CPU2, clears access
  6. CPU2 sends WC to MGR

what if two CPUs want to write the same page at the same time?

problem:
  a write has many steps and modifies multiple tables
  the tables have invariants:
    MGR must agree w/ CPUs about the single owner
    MGR must agree w/ CPUs about the copy_set
    copy_set != {} must agree with owner's writeability
  thus write implementation should be atomic

what enforces the atomicity?
 what are the RC and WC messages for?
  what if RF is overtaken by WF?
  (I'm not sure about this. Perhaps they assume FIFO message order anyway?)
  (per-CPU ptable locks fix many potential races. they seem unnecessary!)

does Ivy provide strict consistency?
  no: ST may take a long time to revoke read access on other CPUs
  so LDs may get old data long after the ST issues

does Ivy provide sequential consistency?
  does it work on our examples?
  v = fn(); done = true;
  can a cpu see done=true but still see old v?

Invariants:
    1. every page has exactly one current owner
    2. current owner has a copy of the page
    3. if mapped r/w by owner, no other copies
    4. if mapped r/o by owner, maybe identical other r/o copies
    5. manager knows about all copies

Ivy does seem to use our two seq consist implementation rules.  You
 can construct a total order always:
  1. Each CPU to execute reads/writes in program order, one at a time
  2. All writes are in a total order (manager orders them)
  3. Once read observes effect on write, it is partially ordered
     behind it.  Order the reads in any total order that obeys the
     partial order.

  If you study the protocol carefully, then it is possible to
  construct an argument that there is no partial order created by the
  protocol than cannot be embedded in a total order.  All CPUs observe
  all local ops in a local total order (1).  All CPUs observe other
  CPU's operations in order that is consistent with a total order.
  For writes that is easy to see because they form a total order
  because there is only a single writer (2).  For reads the argument
  is more complex because reads can happen truely concurrent, but it
  is never the case that a read on one processor observes a result
  that is inconsistent with an observation on another processor in a
  total order.  This could only happen if a scenario like A or B above
  is possible, and the confirmation messages+locks ensure that never
  happens (3).

What's a block odd-even based merge-split algorithm?
  Why is appropriate to this kind of parallel architecture?
  Partition over N CPUs.
  Local sort at each CPU.
  View CPUs as logical chain.
  Repeat N times:
    Even CPUs send to (higher) odd CPUs.
    Odd CPUs merge, send low half back to even CPU.
    Odd CPUs send to (higher) even CPUs.
    Even CPUs merge, send low half back to odd CPU.
  Note that "send to" means look at the right place in shared memory.
    Probably everything in a single huge array.

What's the best we could hope for in terms of performance?
  Linear speedup...

When are you likely to get linear speedup?
  When there's not much sharing -- strict partition.

How well do they do?
  linear for PDE and matrix mul, not so good for sorting

In what sense does it subsume the notion of RPC?
  When would DSM be better than RPC?
    More transparent. Easier to program.
  When would RPC be better?
    Isolation. Control over communication. Tolerate latency.
    Portability.
  Might you still want RPC in your DSM system? For efficient sleep/wakeup?

Has their idea been successful?
  Using workstations for parallel processing: yes! Beowulf, MapReduce, Dryad
  Shared memory? Hard to say. Lack of control over communication details?