6.824 Lecture 7: Release consistency Today: Lazy release consistency TreadMarks Contrast to IVY and sequential consistency Lab 5 will implement simplified lazy release consistency on extents. Lab 4 lays the ground work by doing lazy releasing of locks. Review: what makes a good consistency model? Model is a contract between memory system and programmer Programmer follows some rules about reads and writes Model provides guarantees Model embodies a tradeoff Intuitive for programmer vs. Can be implemented efficiently Treadmarks high level goals? Better DSM performance. Run existing parallel code. What specific problems with previous DSM are they trying to fix? false sharing of different variables on same page only one writer of a page at a time write of irrelevant data may invalidate my page send less data over the network Big idea: write diffs don't send whole page, just send modified locations benefit: less network traffic can be made to work by themselves on first write: tell other hosts to invalidate but keep hidden copy owner makes hidden copy as well when other CPU reads: owner "diffs" current page against hidden copy send diffs over network other CPU applies diffs to its hidden copy but most useful when combined with... Big idea: release consistency only send out write diffs on release of a lock, to all readers this is a new consistency model! CPU0 won't see CPU1's writes until CPU1 does a release so machines can temporarily disagree on memory contents benefit? multiple machines can have writeable copies of a page i.e. no bouncing of pages due to false sharing i.e. no invalidates relies on write diffs otherwise can't reconcile concurrent writes to same page Big idea: lazy release consistency only fetch write diffs on acquire of a lock only fetch from previous holder of that lock i.e., nothing happens at time of write or release LRC is yet another new consistency model! LRC hides some writes that RC hides benefit? less network traffic some machines don't see updates at all, even to pages they cache if you don't acquire lock on object, you don't see updates to it Example 1 (false sharing) x and y are on the same page. CPU0: a1 for(...) x++ r1 CPU1: a2 for(...) y++ r2 a1 print x, y r1 What does IVY do? What does Treadmarks do? CPU0 and CPU1 both get cached writeable copy of the page when they release, each computes diffs against original page CPU1's a1 causes it to pull write diffs from last holder of lock 1 so CPU1 updates x in its page How do RC improve performance? semantics allow two CPU0 to write w/o notifying each other thus can write same page at same time page does not bounce back and forth, as in IVY How do write diffs improve performance over IVY? allow merge of writes to different part of same page less network traffic Example 2 (LRC) x and y on same page (otherwise IVY avoids copy too) CPU0: a1 x=1 r1 CPU1: a2 y++ r2 CPU2 a1 print x r1 What does IVY do? What does Treadmarks do? CPU2 only asks previous holder of lock 1 for write diffs CPU2 does not see CPU1's modification to y, even tho on same page Q: is LRC a win over IVY if all variables are on different pages? (no) Q: why is LRC a reasonably intuitive model for programmers? Example 3 (motivate vector timestamps) CPU0: a1 x=1 r1 CPU1: a1 y=x r1 CPU2: a1 print x, y r1 What's the "right" answer? we need to define what LRC guarantees! answer: when you acquire a lock, you see all modifications by previous holder and all modifications previous holder saw What does TreadMarks do? CPU2 and CPU1 need to decide what CPU2 needs and doesn't already have uses "vector timestamps" each CPU numbers its releases (i.e. write diffs) CPU1 tells CPU2: at release, had seen CPU0's writes through #20, &c 0: 20 1: 25 2: 19 3: 36 ... this is a "vector timestamp" CPU2 remembers a vector timestamp of writes it has seen CPU2 compares w/ CPU1's VT to see what writes it needs from other CPUs Q: this scheme implies that if CPU1 has seen CPU0's 20th write, CPU1 must also have seen all previous CPU0 writes. is that required by LRC model? yes, for writes that might have affected computation of 20th write. maybe not for writes that didn't. More on VTs next lecture... VTs order writes to same variable by different CPUs: CPU0: a1 x=1 r1 a2 y=9 r2 CPU1: a1 x=2 r1 CPU2: a1 a2 z = x + y r2 r1 CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1. How does CPU2 know what to do? Could the VTs for two values of the same variable not be ordered? CPU0: a1 x=1 r1 CPU1: a2 x=2 r2 CPU2: a1 a2 print x r2 r1 What rules does the programmer have to follow? What can the programmer expect from the memory system? 1. must lock around all writes (to order writes to same var) 2. to read latest value, must lock 3. if no lock for read, guaranteed to see values that contributed to the variables you did lock Example of when LRC might work too hard. CPU0: a2 z=99 r2 a1 x=1 r1 CPU1: a1 y=x r1 TreadMarks will send z to CPU1, because it comes before x=1 in VT order. Assuming x and z are on the same page. Even if on different pages, CPU1 must invalidate z's page. But CPU1 doesn't use z. How could a system understand that z isn't needed? Compiler figures out what locations the program reads? CPU1 only asks about variables it will use. Relax the causal part of the LRC model? So you must lock all data you read. Q: could TreadMarks work without using VM page protection? 1. mark all r/o to help find out what pages to generate diffs for 2. mark invalid to defer having to fetch diffs neither is really crucial so TM doesn't depend on VM as much as IVY does Performance? Figure 3 shows mostly good scaling is that the same as "good"? though apparently Water does lots of locking / sharing How close are they to best possible performance? maybe Figure 5 implies there is only about 20% fat to be cut Does LRC beat previous DSM schemes? they only compare against their own straw-man ERC not against best known prior work Figure 9 suggests not much win, even for Water Has DSM been successful? clusters of cooperating machines are hugely successful DSM not so much main justification is transparency for existing threaded code that's not interesting for new apps and transparency makes it hard to get high performance MapReduce, DryadLINQ, or storage+messages more common than DSM