6.824 Lecture 6: Consistency

Outline
  Consistency
  Consistency models
    Strict consistency
    Sequential consistency
  IVY

Replicated data a huge theme in distributed systems
  For performance and fault tolerance
  Often easier to replicate data than computation
  Examples:
    Caching of web pages (Web browser)
    Caches in YFS (labs 4 and 5)
    memcached for web servers
  All these examples involve optimizations for performance
  How do we know if an optimization is correct?
  We need to know how to think about correct execution of distributed programs.
  Most of these ideas from multiprocessors and databases 20/30 years ago.
  For now, just correctness and efficiency, not fault-tolerance.

Replicating content that isn't modified (or has a single writer) is "simple"
  See, for example, Web browser
  Browser obeys HTTP expiration directive and last-modified

More interesting with multiple writers!

Naive distributed memory
  [diagram]
  CPU0, CPU1, CPU2, LAN
  each host has a local copy of all of memory
  read: from local memory
  write: send update msg to each other host (but don't wait)
  fast: never waits for communication
  Does this memory work well?
   
Example 1:
  CPU0:
    v0 = f0();
    done0 = true;
  CPU1:
    while(done0 == false)
      ;
    v1 = f1(v0);          
    done1 = true;
  CPU2:
    while(done1 == false)
      ;
    v2 = f2(v0, v1);
  Intuitive intent:
    CPU2 should execute f2() with results from CPU0 and CPU1
    waiting for CPU1 implies waiting for CPU0

Example 1 won't work with naive distributed memory:

Problem A:
  [time diagram]
  CPU0's writes of v0 and done0 may be interchanged by network
  leaving v0 unset but done0=true
  how to fix? would lab RPC fix?

Problem B:
  [time diagram]
  CPU2 sees CPU1's writes before CPU0's writes
  i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes
  how to fix?

Naive distributed memory is fast but has unexpected behavior
  maybe it isn't "correct"
  maybe we should never have expected Example 1 to work

How can we write correct distributed programs w/ shared storage?
  Memory system promises to behave according to certain rules.
  We write programs assuming those rules.
  Rules are a "consistency model"
  Contract between memory system and programmer

What makes a good consistency model?
  There are no "right" or "wrong" models
  A model may make it harder or easier to program
    i.e. lead to more or less intuitive results
  A model may be harder or easier to implement efficiently
  Also application dependent
    A consistency model for Web pages different than for memory

How about "strict consistency":
  each instruction stamped with its start time (global time)
  Rule 1: LD gets value of most recent previous ST to same address
  Rule 2: each CPU executes instructions one at a time, in order
  Essentially the same as on uniprocessor
  Very intuitive consistency model

Would strict consistency avoid problem A and B?

How do you implement strict consistency?
  Time:  1  2  3  4
  CPU0:  ST       ST
  CPU1:     LD LD
  Time between instructions << speed-of-light between CPUs!
  How is LD@2 even aware of ST@1?
  How does ST@4 know to pause until LD@3 has finished?
    how does ST@4 know how long to wait?
  Too hard to implement!

A reasonable model: sequential consistency
  Is an execution (a set of operations) correct?
  There must be some total order of operations such that
  1. each CPU's instructions appear in-order in the total order
  2. all CPUs see results consistent with that total order
     i.e. reads see most recent write in the total order

A sequentially consistent system would not have Problems A/B
  Problem A
    CPU0's execution order was v0= done0=
    CPU1 saw done0= v0=
    each CPU's operations must appear in execution order
    so cannot happen w/ sequential consistency
  Problem B
    CPU1 saw v0= done0= done1=
    CPU2 saw done1= v0=
    this cannot occur given a single total order
    so cannot happen w/ sequential consistency

Better performance than strict consistency
  System has some freedom in how it interleaves different CPUs' ops
  not forced to order by op start time, as in strict consistency
  system can delay a read or write while it finds current values

Performance is still not great
  Once the system reveals
    a written value to a read operation, the system has committed to a
    little bit of partial order. this may have transitive effects.
  example: system can delay revealing CPU1's done1=true to CPU2,
    but once revealed, must also reveal CPU0's writes
  So the system only has freedom in ordering concurrent operations
    ones that haven't been observed yet

A simple implementation of sequential consistency
  each CPU sends R/W operations to a single memory server
  memory server chooses order (interleaves requests from diff CPUs)
  server executes operations one at a time, sends R replies
  
Simple implementation will be slow
  single server will get overloaded
  no local cache, so all operations are slow
  each CPU must usually block waiting for each read reply

Idea 1:
  partition memory across multiple servers
  eliminates single-server bottleneck
  can serve many CPUs in parallel if they don't use same memory
  Lamport paper from 1979 shows system is seq consistent if:
    1. each CPU executes one op at a time, waiting for it to complete
    2. executes ops on each mem location one at a time
  i.e. you can have lots of independent CPUs and memory systems
    
Idea 2:
  if a memory location is not written, you can replicate it
  i.e. cache it on each CPU, so reads are fast
  but must ensure reads and writes are ordered
    once the write modifies the location, no read should return old value
    thus must revoke cached copies before writing
    this delays writes to improve read performance

which brings us to IVY, which uses both ideas, and more

Why is Ivy cool?
  Acts like an expensive shared-memory multiprocessor
  On a network of cheap machines
  [diagram: LAN, CPUs w/ RAM, MGR]
  Runs threaded code w/o modification
    e.g. matrix multiply, physical simulation, sort
  Lab 5/6 is closely related to IVY, though much simpler

Where to store each memory location?
  In the memories of the machines
  Moves each location adaptively, in response to use
  One copy on the last machine to write
  One copy on each recent machine to read
  With luck, most memory accesses will be local -- and fast

IVY's strategy for sequential consistency
  For a given location, there are reads and writes from different machines
  Sequential consistency demands a total order
  Thus, after a write, everyone must see the new data
  CPU0: wx1 rx1        rx2
  CPU1:     rx1  wx2
  After the wx2, no CPU should see 1 in x

What happens during an IVY write?
  [diagram: CPUs, RAM, divided into a few pages marked r/o, r/w, x]
  IVY uses VM hardware to detect reads and writes
  Tells VM h/w to mark each page r/w, r/o, invalid
  Before write, all copies marked r/o
  CPU1's write generates a page fault
    Send "invalidate" msgs to CPU0 &c
    CPU1 marks page r/w
    Program resumes and modifies page
  CPU0's read generates a page fault
    Ask CPU1 for current page content
    CPU0 and CPU1 mark page r/o

Will page size of e.g. 4096 bytes be good or bad?
  good if spatial locality, i.e. program looks at large blocks of data
  bad if program writes just a few bytes in a page
    subsequent readers copy who page just to get a few new bytes
  bad if false sharing
    i.e. two unrelated variables on the same page
    only one computer can have writeable page
    will bounce among writing computers
    even though those computers never modify the same location

How does IVY work? Section 3.1
  LAN, three CPUs, one MGR, draw table per entity

ptable: one per CPU, one entry per page
  page #
  lock
  access (read or write or nil)
  i_am_owner (not same as write: might be owner of r/o page)

info: just on MGR, one entry per page
  page #
  lock
  copy_set
  owner
 
Message types:
  [don't list these on board, just for reference]
  RQ read query (reader to MGR)
  RF read forward (MGR to owner)
  RD read data (owner to reader)
  RC read confirm (reader to MGR)
  WQ
  IV
  IC
  WF
  WD
  WC

scenario 1: owned by CPU0, CPU1 wants to read
  [time diagram: M 0 1]
  0. page fault on CPU1, since page must have been marked invalid
  1. CPU1 sends RQ to MGR
  2. MGR sends RF to CPU0, MGR adds CPU1 to copy_set
  3. CPU0 sends RD to CPU1, CPU0 marks page as access=read
  4. CPU1 sends RC to MGR

scenario 2: owned by CPU0, CPU2 wants to write
  [time diagram: M 0 1 2]
  0. page fault on CPU2
  1. CPU2 sends WQ to MGR
  2. MGR sends IV to copy_set (i.e. CPU1)
  3. CPU1 sends IC msg to MGR (or does it? does MGR wait?)
  4. MGR sends WF to CPU0
  5. CPU0 sends WD to CPU2, clears access
  6. CPU2 sends WC to MGR

note there are some per-page invariants that hold across computers
  only one CPU has i_am_owner
  MGR owner must be that CPU
  MGR copy_set consistent with CPU ptable access
  all copies of the page are the same
  writeable on owner -> copy set is empty
  does IVY protocol enforce these invariants?

what if two CPUs want to write the same page at the same time?

what if one CPU reads just as ownership is changing hands?

what if there were no RC message?
  i.e. MGR unlocked after sending RF?

no WC?
  i.e. MGR unlocked after sending WF?

no IC?
  i.e. MGR didn't wait for holders of copies to ack?

Big point:
  MGR's locks and confirm messages force sequential processing
  each operation completely finishes before next starts
  for each page (different pages can have overlapping operations)
  similarly, each CPU performs operations one at a time
  result: sequential consistency

does Ivy provide sequential consistency for our example?
  v = fn(); done = true;
  can a cpu see done=true but still see old v?

does Ivy provide strict consistency?
  no: MGR might process two STs in order opposite to issue time
  no: ST may take a long time to revoke read access on other CPUs
      so LDs may get old data long after the ST issues

What about IVY's performance?
  after all, the real point was to run programs faster

What's the best we could hope for in terms of performance?
  Nx faster on N machines

What might prevent us from getting Nx speedup?
  Network traffic (moving lots of pages)
  locks
  Many CPUs writing the same page

How well do they do?
  Figure 4: near-linear for PDE
  Figure 6: very sub-linear for sort
  Figure 7: near-linear for matrix multiply

Why did sort do poorly?
  Here's my guess
  Partitions data over CPUs
  Phase 1: Local sort
  Phase 2: Multiple rounds of exchanging all data
  Phase 1 probably gets linear speedup
  Phase 2 probably does not -- limited by LAN speed
    also more CPUs may mean more rounds
  So for small # CPUs, local sort dominates, more CPUs helps
  For large # CPUs, communication dominates, more CPUs don't help

How could one speed up IVY?
  paper suggests splitting up MGR
  or eliminating MGR and using broadcast to find pages
  next week: relax the consistency model
    allow multiple writers to same page!

Paper intro says DSM subsumes RPC -- is that true?
  When would DSM be better than RPC?
    More transparent. Easier to program.
  When would RPC be better?
    Isolation. Control over communication. Tolerate latency.
    Portability. Define your own semantics.
  Might you still want RPC in your DSM system? For efficient sleep/wakeup?