6.824 Lecture 7: Release consistency

Today:
  Lazy release consistency
  TreadMarks
  Contrast to IVY and sequential consistency
  
Lab 5 will implement simplified lazy release consistency on extents.
Lab 4 lays the ground work by doing lazy releasing of locks.

Review: what makes a good consistency model?
  Model is a contract between memory system and programmer
    Programmer follows some rules about reads and writes
    Model provides guarantees
  Model embodies a tradeoff
    Intuitive for programmer
    vs.
    Can be implemented efficiently

Treadmarks high level goals?
  Better DSM performance.
  Run existing parallel code.

What specific problems with previous DSM are they trying to fix?
  false sharing of different variables on same page
    only one writer of a page at a time
    write of irrelevant data may invalidate my page
  send less data over the network

Big idea: write diffs
  don't send whole page, just send modified locations
  benefit: less network traffic
  can be made to work by themselves
    on first write:
      tell other hosts to invalidate but keep hidden copy
      owner makes hidden copy as well
    when other CPU reads:
      owner "diffs" current page against hidden copy
      send diffs over network
      other CPU applies diffs to its hidden copy
  but most useful when combined with...

Big idea: release consistency
  only send out write diffs on release of a lock, to all readers
  this is a new consistency model!
    CPU0 won't see CPU1's writes until CPU1 does a release
    so machines can temporarily disagree on memory contents
  benefit?
    multiple machines can have writeable copies of a page
    i.e. no bouncing of pages due to false sharing
    i.e. no invalidates
  relies on write diffs
    otherwise can't reconcile concurrent writes to same page

Big idea: lazy release consistency
  only fetch write diffs on acquire of a lock
    only fetch from previous holder of that lock
  i.e., nothing happens at time of write or release
  LRC is yet another new consistency model!
    LRC hides some writes that RC hides
  benefit?
    less network traffic
    some machines don't see updates at all, even to pages they cache
    if you don't acquire lock on object, you don't see updates to it

Example 1 (false sharing)
x and y are on the same page.
CPU0: a1 for(...) x++ r1
CPU1: a2 for(...) y++ r2  a1 print x, y r1
What does IVY do?
What does Treadmarks do?
  CPU0 and CPU1 both get cached writeable copy of the page
  when they release, each computes diffs against original page
  CPU1's a1 causes it to pull write diffs from last holder of lock 1
    so CPU1 updates x in its page
How do RC improve performance?
  semantics allow two CPU0 to write w/o notifying each other
  thus can write same page at same time
  page does not bounce back and forth, as in IVY
How do write diffs improve performance over IVY?
  allow merge of writes to different part of same page
  less network traffic

Example 2 (LRC)
x and y on same page (otherwise IVY avoids copy too)
CPU0: a1 x=1 r1
CPU1:           a2 y++ r2
CPU2                      a1 print x r1
What does IVY do?
What does Treadmarks do?
  CPU2 only asks previous holder of lock 1 for write diffs
  CPU2 does not see CPU1's modification to y, even tho on same page

Q: is LRC a win over IVY if all variables are on different pages? (no)

Q: why is LRC a reasonably intuitive model for programmers?

Example 3 (motivate vector timestamps)
CPU0: a1 x=1 r1
CPU1:             a1 y=x r1
CPU2:                       a1 print x, y r1
What's the "right" answer?
  we need to define what LRC guarantees!
  answer: when you acquire a lock,
    you see all modifications by previous holder
    and all modifications previous holder saw 
What does TreadMarks do?
  CPU2 and CPU1 need to decide what CPU2 needs and doesn't already have
  uses "vector timestamps"
  each CPU numbers its releases (i.e. write diffs)
  CPU1 tells CPU2:
    at release, had seen CPU0's writes through #20, &c
    0: 20
    1: 25
    2: 19
    3: 36
    ...
    this is a "vector timestamp"
  CPU2 remembers a vector timestamp of writes it has seen
  CPU2 compares w/ CPU1's VT to see what writes it needs from other CPUs

Q: this scheme implies that if CPU1 has seen CPU0's 20th write, CPU1 must
   also have seen all previous CPU0 writes. is that required by LRC model?
   yes, for writes that might have affected computation of 20th write.
   maybe not for writes that didn't.

More on VTs next lecture...

VTs order writes to same variable by different CPUs:
CPU0: a1 x=1 r1  a2 y=9 r2
CPU1:              a1 x=2 r1
CPU2:                           a1 a2 z = x + y r2 r1
CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1.
  How does CPU2 know what to do?

Could the VTs for two values of the same variable not be ordered?
CPU0: a1 x=1 r1
CPU1:              a2 x=2 r2
CPU2:                           a1 a2 print x r2 r1

What rules does the programmer have to follow?
  What can the programmer expect from the memory system?
  1. must lock around all writes (to order writes to same var)
  2. to read latest value, must lock
  3. if no lock for read, guaranteed to see values that
     contributed to the variables you did lock

Example of when LRC might work too hard.
CPU0: a2 z=99 r2  a1 x=1 r1
CPU1:                            a1 y=x r1
TreadMarks will send z to CPU1, because it comes before x=1 in VT order.
  Assuming x and z are on the same page.
  Even if on different pages, CPU1 must invalidate z's page.
But CPU1 doesn't use z.
How could a system understand that z isn't needed?
  Compiler figures out what locations the program reads?
    CPU1 only asks about variables it will use.
  Relax the causal part of the LRC model?
    So you must lock all data you read.

Q: could TreadMarks work without using VM page protection?
   1. mark all r/o to help find out what pages to generate diffs for
   2. mark invalid to defer having to fetch diffs
   neither is really crucial
   so TM doesn't depend on VM as much as IVY does

Performance?

Figure 3 shows mostly good scaling
  is that the same as "good"?
  though apparently Water does lots of locking / sharing

How close are they to best possible performance?
  maybe Figure 5 implies there is only about 20% fat to be cut

Does LRC beat previous DSM schemes?
  they only compare against their own straw-man ERC
    not against best known prior work
  Figure 9 suggests not much win, even for Water

Has DSM been successful?
  clusters of cooperating machines are hugely successful
  DSM not so much
    main justification is transparency for existing threaded code
    that's not interesting for new apps
    and transparency makes it hard to get high performance
  MapReduce, DryadLINQ, or storage+messages more common than DSM