6.824 2017 Lecture 13: Munin distributed shared memory

Today: Distributed computing
  The Big Idea: your huge computation on a room full of cheap computers!
    Instead of one big, expensive shared-memory multiprocessor
    Today: data-center clusters with many inexpensive shared-memory multiprocessors
  An old and difficult goal; many approaches; much progress; still hot.
    Other cluster computing papers: MapReduce, Spark

Why Munin paper?
  Distributes shared memory is an interesting idea
    Most popular topic for projects in 6.824
  Large literature on distributed-shared memory systems
    This paper one of the earlier ones
      Software lazy-release consistency (introduced by the paper)
  Software artifact itself not influential, but DSM ideas are

Distributed shared memory (DSM)
  DMS provides the illusion of shared memory on distributed system
  You all know how to write parallel (threaded) Go programs
  Let's farm the threads out to a big cluster of machines!
  What's missing? shared memory!

DSM plan:
  Programmer writes parallel program: threads, shared variables, locks, &c
  DSM system farms out threads to a cluster of machines
  DSM system creates illusion of single shared memory
  [diagram: LAN, machines w/ RAM]

DSM advantages
  familiar model -- shared variables, locks, &c
  general purpose (compared to e.g. MapReduce and Spark)
  can use existing apps and libraries written for multiprocessors
  lots of machines on a LAN much cheaper than huge multiprocessor

But:
  machines on a LAN don't actually share memory

General DSM approach (but slow):
  The "conventional" pattern in the Munin paper
  Use hardware's virtual memory protection (r/w vs r/o vs invalid)
  General idea illustrated with 2 machines:
     Part of the address space starts out on M0
	On M1, marked invalid 
     Part of the address space start out on M1
        On M0, marked invalid
  A thread of the application on M1 may refer to an address that lives on M0
    If thread LD/ST to that "shared" address, M1's hardware will take a page fault
       Because page is marked invalid
    OS propagates page fault to DSM runtime
    DSM runtime can fetch page from M0
    DSM on M0, marks page invalid, and sends page to M1
    DSM on M1 receives it from M0, copies it to underlying physical memory
    DSM on M1 marks the page valid
    DSM returns from page fault handler
    Hardware retries LD/ST
  Runs threaded code w/o modification
    e.g. matrix multiply, physical simulation (e.g., SOR), sort

Challenges:
  Memory consistency model (does memory act like programmers expect it to act?)
    What result does a read observe?
    The last write?
    Stale data?
  Performance (is it fast?)

Example:
  x and y start out = 0
  thread 0:
    x = 1
    if y == 0:
      print yes
  thread 1:
    y = 1
    if x == 0:
      print yes

Would it be OK if both threads printed yes?
  What is the outcome using the general approach?
  Is it even possible to have both threads print "yes"?
  Could they both print "yes" if this were a Go program?
    How could that happen?
    Why is that allowed?

Conventional approach provides strong consistency
  We don't care much about the specific definition (often very technical)
  We do care about the general notion of strong consistency,
    both for the papers that deliver it (or don't) and for Lab 3
    
What are we looking for in strong consistency?
  We want the replication system to behave like an unreplicated one
    If client 1 calls ST and it completes, and then client 2's LD must see the ST
    Makes sense for if each call is sequential (one at the time)
      a one-at-a-time sequence sees the updates of its predecessors
  One interesting case left: concurrent operations
    LDs and STs start about the same time and their execution overlap
    Strong consistency in that case:
      The return values and updated state after a set of concurrent operations must be the same as
      if the operations had executed one at a time in some order

You have seen lots of refinements of this basic idea of strong consistency
   linearizable and sequentially consistent (for simple reads and writes)
   serializable and strictly serializable (for multi-step transactions).
   In this class we will not worry much about the detailed differences.
  
Main trade-off in consistency models
  Lax model => greater freedom to optimize
    I.e., send fewer messages, wait less for responses, etc.
  Strict model => matches programmer intuition (e.g. read sees latest write)
  This tradeoff is a huge factor in many designs

Munin: a case study of DSM
  Software release consistency
    Not unlike Go's memory model
  Optimized implementations for particular sharing patterns
    read-only
    write-shared
    producer consumer

Munin high level goals?
  Better DSM performance than conventional
  Run existing parallel code

What specific problems with previous DSM are they trying to fix?
  false sharing: two machines r/w different vars on same page
    M1 writes x, M2 writes y
    M1 writes x, M2 just reads y
    Q: what does the conventional approach do in this situation?
  write amplification: a one byte write turns into a whole-page transfer

First goal: eliminate write amplification
  don't send whole page, just written bytes

Big idea: write diffs
  on M1 write fault:
    tell other hosts to invalidate but keep hidden copy
    M1 makes hidden copy as well
  on M2 fault:
    M2 asks M1 for recent modifications
    M1 "diffs" current page against hidden copy
    M1 send diffs to M2 (and all machines w/ copy of this page)
    M2 applies diffs to its hidden copy
    M1 marks page r/o

Q: do write diffs change the consistency model?
   At most one writeable copy, so writes are ordered
   No writing while any copy is readable, so no stale reads
   Readable copies are up to date, so no stale reads
   Still sequentially consistent

Q: do write diffs fix false sharing?

Next goal: allow multiple readers+writers
  to cope with false sharing
  => don't invalidate others when a machine writes
  => don't demote writers to r/o when another machine reads
  => multiple *different* copies of a page!
     which should a reader look at?
  diffs help: can merge writes to same page
  but when to send the diffs?
    no invalidations -> no page faults -> what triggers sending diffs?

Big idea: release consistency (RC)
  no-one should read data w/o holding a lock!
    so let's assume a lock server
  send out write diffs on release
    to *all* machines with a copy of the written page(s)

Example 1 (RC and false sharing)
x and y are on the same page
M0: a1 for(...) x++ r1
M1: a2 for(...) y++ r2  a1 print x, y r1
What does RC do?
  M0 and M1 both get cached writeable copy of the page
  during release, each computes diffs against original page,
    and sends them to all copies
  M1's a1 causes it to wait until M0's release
    so M1 will see M0's writes

Q: what is the performance benefit of RC?
   What does the conventional approach do with Example 1?
   multiple machines can have copies of a page, even when 1 or more writes
   => no bouncing of pages due to false sharing
   => read copies can co-exist with writers

Q: does RC change the consistency model? yes!
   M1 won't see M0's writes until M0 releases a lock
   I.e. M1 can see a stale copy of x; not possible w/ general approach
   if you always lock: locks force order -> no stale reads

Q: does RC make sense without write diffs?
   probably not: diffs needed to reconcile concurrent writes to same page

Q: how to determine to which machines to send diff too?
   any machine that has a copy of the page must see diffs
   producer-consumer optimization limits the page to producer and consumer

Performance?
  Table 3 and 4
    DSM performance is similar to hand-coded message passing version
  How important are the optimizations?
    Table 6 suggest they are important
    But not 100% clear
      a better copyset could may have producer-consumer insignificant for SOR?

What happened to DSM?
  The cluster approach was a great idea
  Targeting *existing* threaded code was not a long-term win
  Overtaken by MapReduce and successors
    MR tolerates faults
    MR guides programmer to good split of data and computation
    BUT people have found MR too rigid for many parallel tasks
  The last word has not been spoken here
    Much recent work on flexible memory-like cluster programming models
    For example, Spark (see Thursday's paper)

References
 http://www.bailis.org/blog/linearizability-versus-serializability/
 http://www.bailis.org/blog/understanding-weak-isolation-is-a-serious-problem/
 https://aphyr.com/posts/313-strong-consistency-models
 https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/