6.824 2004 Lecture 14: Memory Consistency (2) Review from previous lecture: We want to make it possible to write correct parallel/distributed programs. We assume different CPUs interact only through a storage system. Memory, distributed shared memory, or a file system. So we need a "memory consistency model" That tells us what to expect when we read/write memory. We want a model that: Is easy to understand, so programmers can easily write correct programs. Is possible to implement efficienctly. One reasonable model: sequential consistency There must be some total order of operations such that 1. all CPUs see results consistent with that total order i.e. reads see most recent write in the total order 2. each CPU's instructions appear in-order in the total order Intuitive justification: The single total order means it's easy for one CPU to predict what other CPUs will see The "consistent with" and lack of real time may make it easy to implement The system appears free to interleave instruction streams however it likes to form the total order However! When executing in real time, once the system reveals a written value to a read operation, the system has committed to a little bit of partial order. this may have transitive effects. So in real life the system only has freedom in ordering more or less concurrent operations -- ones that haven't been observed yet Remember our mutual exclusion example: CPU0: x = 1; if(y == 0) { critical section; } CPU1: y = 1; if(x == 0) { critical section; } We want this to work. Lay out style of argument there is more than one legal result, depending on interleaving! (not like uniprocessor) typical question: is xxx a correct result under sequential consistency? "yes" if you can demonstrate an interleaving that gets that result "no" if you can show no interleaving could get that result that's the definition of correctness main example: CPU0: w(x)0 w(x)1 r(y)? CPU1: w(y)0 w(y)1 r(x)? we can evaluate all legal seq consistency interleavings manually: 1/1? 1/0? 0/1? 0/0? [only 0/0 is illegal] Good: sequential consistency gets the intuitive result we expected How can we implement sequential consistency? straw man 1: internet cloud, hosts assume each host has a big cache and that all data is cached on every host reads are local, so they are very fast send write msg to each other host (but don't wait) what goes wrong? CPU0 starts the x=1 write, and CPU1 starts y=1 write. I.e. they send a packet on the network. Both read before the write message reaches them So they both read 0 and enter the critical section. We know this isn't legal for sequential consistency. Lesson: a CPU must wait for each operation to complete. [to give the system time to order concurrent operations] straw man 2: we can fix this by changing write: write local cache send write msgs to other CPUs wait for ACKs from all other CPUs only then proceed to instruction after the write this fixes our mutex example if CPU0's r(y) = 0, then CPU0 has not sent write ACK for CPU1's w(y)1, so CPU1 has not executed r(x). BUT if CPU0 executed r(y), then it had gotten CPU1's write ACK, so when CPU1 executes r(x), it will see 1. what goes wrong? turns out we need a new example CPU0: w(x)1 r(x)? CPU1: w(x)2 r(x)? legal seq consistency results? 1/1 2/2 1/2 BUT NOT 2/1 can we get 2/1 w/ straw man 2 implementation? if both write local cache, then wait for remote. how about wait, then write local? then CPU0's x=1, CPU1's x=2 as if this total order: w(x)1 r(x)1 w(x)2 r(x)2 we've lost replica equality, but not (yet) violated seq consistency but system cannot now let CPU0 execute r(x)! Lesson: for each memory location, execute operations one at a time These two rules are sufficient to implement sequential consistency: 1. Operations on each CPU must execute one at a time 2. Operations on each memory location must execute one at a time. XXX why is this the case? What kind of implementation would fit well with these rules? Single entity in charge of ordering each CPU's operations (i.e. the CPU). Single entity in charge of ordering each location's operations. You don't need a central entity to choose the single total order! Example: partition memory over multiple modules on a network. Send all memory ops to home module. Divides up memory load nicely for good parallelism. Like Stanford FLASH. Does your lab 5 enforce sequential consistency for i-node blocks? Each machine's operations are ordered (due to lock client code...) Lock server serializes read and write ops by different machines. Lots of details: e.g. wait for block server reply before releasing lock. So you don't *need* to send all memory operations to home module. Home module can grant ownership using tokens or locks. To what extent can you optimize sequential consistency? Delegate ownership via tokens So home module is serializing token grants, not memory operations Memory operations execute (mostly) in local caches This makes single-writer workloads fast Shared read caching also works CPUs cannot tell there was no global total order for reads Still need to serialize writes through home module But you can't make both reads and writes fast, in general Because memory system has to serialize operations for each location Which requires communication Example of a faster consistency model? We're willing to accept more work for the programmer. Though we still want a well-defined model. And in return we expect faster execution. Release Consistency You rarely see programs like the a=1; if(b==0) example. Because it's so hard to reason about them. Instead, parallel programs typically lock data that is shared and mutable. To create atomic multi-step sequences. (Not the same as cache ownership tokens...) Example: bank account transfer: acquire(l); b1 = b1 + x; b2 = b2 - x; release(l); Other CPUs aren't allowed to look at b1 or b2 while l is locked. So CPU could do the operations in any order within the critical section. I.e. load b2 before storing b1. Rules: 1. CPU can't re-order any LD/ST before the acquire(). (otherwise you might read b1 while someone else has the lock) 2. Writes must finish before release() completes. (otherwise other CPUs might not see the writes) Can re-order, cache, &c within release/acquire, so fast. But: memory system must understand locks, acquire(), and release(). Tomorrow's paper (Treadmarks) is all about implementing release consistency.