6.824 2005 Lecture 12: Memory Consistency (2) Review from previous lecture: We want to make it possible to write correct parallel/distributed programs. We assume different CPUs interact only through a storage system. Memory, distributed shared memory, or a file system. So we need a "memory consistency model" That tells us what to expect when we read/write memory. We want a model that: Is easy to understand, so programmers can easily write correct programs. Is possible to implement efficienctly. One reasonable model: sequential consistency Is an execution (a set of operations) correct? There must be some total order of operations such that 1. all CPUs see results consistent with that total order i.e. reads see most recent write in the total order 2. each CPU's instructions appear in-order in the total order Intuitive justification: The single total order means it's easy for one CPU to predict what other CPUs will see The "consistent with" and lack of real time may make it easy to implement The system appears free to interleave instruction streams however it likes to form the total order However! When executing in real time, once the system reveals a written value to a read operation, the system has committed to a little bit of partial order. this may have transitive effects. So in real life the system only has freedom in ordering more or less concurrent operations -- ones that haven't been observed yet Remember our mutual exclusion example: CPU0: x = 1; if(y == 0) { critical section; } CPU1: y = 1; if(x == 0) { critical section; } We want this to work. Lay out style of argument there is more than one legal result, depending on interleaving! (not like uniprocessor) typical question: is xxx a correct result under sequential consistency? "yes" if you can demonstrate an interleaving that gets that result "no" if you can show no interleaving could get that result main example: CPU0: w(x)0 w(x)1 r(y)? CPU1: w(y)0 w(y)1 r(x)? we can evaluate all legal seq consistency interleavings manually: 1/1? 1/0? 0/1? 0/0? [only 0/0 is illegal] Good: sequential consistency causes our example to have intuitive results How can we implement sequential consistency? straw man 1: internet cloud, hosts assume each host has a big cache and that all data is cached on every host reads are local, so they are very fast send write msg to each other host (but don't wait) what goes wrong? CPU0 starts the x=1 write, and CPU1 starts y=1 write. I.e. they send a packet on the network. Both read before their write is visible So they both read 0 and enter the critical section. i.e. read is before write in total order this violates Rule 2 Lesson: each CPU must wait for each operation to complete. straw man 2: we can achieve per-CPU order by changing write: write local cache send write msgs to other CPUs wait for ACKs from all other CPUs only then proceed to instruction after the write this fixes our mutex example if CPU0's r(y) = 0, then CPU0 has not sent write ACK for CPU1's w(y)1, so CPU1 has not executed r(x). what goes wrong? turns out we need a new example CPU0: w(x)1 r(x)? CPU1: w(x)2 r(x)? legal seq consistency results? 1/1 2/2 1/2 BUT NOT 2/1 2/1 would violate rule 1 can we get 2/1 w/ straw man 2 implementation? yes: if both write local cache, then wait for remote write ACK. more generally, if writes arrive in different orders on different CPUs Lesson: for each memory location, execute operations one at a time These two rules are sufficient to implement sequential consistency: 1. Each CPU to execute reads/writes in program order, one at a time 2. Each memory location to execute reads/writes in arrival order, one at a time proof in Lamport 1979 What kind of implementation would fit well with these rules? Single entity in charge of ordering each CPU's operations (i.e. the CPU). Single entity in charge of ordering each location's operations. You don't need a central entity to choose the single total order! Example: partition memory over multiple modules on a network. Send all memory ops to relevant module. Divides up memory load nicely for good parallelism. Does your lab 5 enforce sequential consistency for i-node blocks? Each machine's operations on an i-node are ordered (due to lock client code...) Lock server serializes read and write ops on an i-node by different machines. Lots of details: e.g. wait for block server reply before releasing lock. But across i-nodes: no! ccfs issues concurrent operations. So you don't *need* to send all memory operations to home module. Home module can grant ownership using tokens or locks. To what extent can you optimize sequential consistency? Delegate ownership via tokens So home module is serializing token grants, not memory operations Memory operations execute (mostly) in local caches This makes single-writer workloads fast Shared read caching also works CPUs cannot tell there was no global total order for reads Still need to serialize writes through home module But you can't make both reads and writes fast, in general Because memory system has to serialize operations for each location Which requires communication In what sense is sequential looser than strict? I.e. what are we giving up? I.e. what programs will break? Answer: seq const doesn't let you reason about timing. In general seqential consistency doesn't let you reason based on real time CPU0: w(x)0 w(x)1 CPU1: w(y)0 w(y)2 CPU2: r(y)? r(x)? Suppose observer knows operations occured in this temporal order Strict consistency requires r(y)1 r(x)2 But sequential consistency allows either or both to read as zero You *can* reason based on per-CPU instruction order and observed values: e.g. CPU1: if(x==1)y=2 then r(y)2 => r(x)1 because w(x)1 must have finished before r(x) starts Example of a faster consistency model? We're willing to accept more work for the programmer. Though we still want a well-defined model. And in return we expect faster execution. Release Consistency You rarely see programs like the a=1; if(b==0) example. Because it's so hard to reason about them. Instead, parallel programs typically lock data that is shared and mutable. To create atomic multi-step sequences. (Not the same as cache ownership tokens...) Example: bank account transfer: acquire(l); b1 = b1 + x; b2 = b2 - x; release(l); Other CPUs aren't allowed to look at b1 or b2 while l is locked. So CPU could do the operations in any order within the critical section. I.e. load b2 before storing b1. Rules: 1. CPU can't re-order any LD/ST before the acquire(). (otherwise you might read b1 while someone else has the lock) 2. Writes must finish before release() completes. (otherwise other CPUs might not see the writes) Can re-order, cache, &c within release/acquire, so fast. But: memory system must understand locks, acquire(), and release(). Tomorrow's paper (Treadmarks) is all about implementing release consistency.