6.824 2002 Lecture 17: Memory Consistency (2) Review from previous lecture: We want to make it possible to write correct parallel/distributed programs. We assume different CPUs interact only through a storage system. Memory, distributed shared memory, or a file system. So we need a "memory consistency model" That tells us what to expect when we read/write memory. We want a model that: Is easy to understand, so programmers can easily write correct programs. Is possible to implement efficienctly. One reasonable model: sequential consistency Rule 1: all CPUs see all operations in the same total order Rule 2: each CPU executes its own instructions in order Remember our mutual exclusion example: CPU0: x = 1; if(y == 0) { critical section; } CPU1: y = 1; if(x == 0) { critical section; } We want this to work. Lay out style of argument there is more than one legal result, depending on interleaving! (not like uniprocessor) typical question: is xxx a correct result under sequential consistency? "yes" if you can demonstrate an interleaving that gets that result "no" if you can show no interleaving could get that result that's the definition of correctness main example: CPU0: w(x)0 w(x)1 r(y)? CPU1: w(y)0 w(y)1 r(x)? there are interleavings for 1/1, 1/0, 1/0 show there's no interleaving that yields 0/0 How can we implement sequential consistency? straw man 1: internet cloud, hosts assume each host has a big cache and that all data is cached on every host reads are local, so they are very fast send write msg to each other host (but don't wait) what goes wrong? CPU0 starts the x=1 write, and CPU1 starts y=1 write. I.e. they send a packet on the network. Both read before the write message reaches them So they both read 0 and enter the critical section. We know this isn't legal for sequential consistency. Lesson: a CPU must wait for each operation to complete. straw man 2: we can fix this: wait for write acks, don't let write proceed until collected all acks what goes wrong? (need a different example here) CPU0: w(x)1 r(x)? CPU1: w(x)2 r(x)? both might write locally, then wait for remote. other other way around. either way CPU0 and CPU1 don't agree on final value of x Lesson: order the operations on each memory location It turns out these two rules are sufficient to implement sequential consistency: 1. Each CPU must complete each operation before starting the next one. 2. Operations on each memory location must execute one at a time. These are nice implementation rules: They are completely local. Assume many CPUs, and memory partitioned across many modules. A CPU can enforce instruction order w/o interacting with other CPUs. A memory module can enforce order w/o interacting with other modules. Show that Ivy enforces sequential consistency implementation rules Each CPU issues instructions one at a time, so per-CPU operations are serial. Order for each page (and thus memory location): Reads don't need to be ordered, since they see the same thing. So the copy-set is legal. Writes are ordered with each other because manager ensures there's only one owner at a time. Writes ordered w/ reads: Manager delays write until all CPUs ACK copy invalidation. Otherwise mutual exclusion example would break: read stale data. In what sense is sequential looser than strict? I.e. what are we giving up? I.e. what programs will break? Answer: seq const doesn't let you reason about timing. In general seqential consistency doesn't let you reason about timing Example: CPU0: w(x)0 w(x)1 CPU1: w(y)0 w(y)1 CPU2: r(x)1 r(y)0 CPU3: r(x)0 r(y)1 Suppose reads (of each variable) were issued at the same time. Seq const demands that all CPUs agree on the order of w(x)1 and w(y)1. But example seems to suggest that CPU2 and CPU3 must have seen two writes in different order? No: it only suggests that if you think they read **at the same time**. And if you think consistency guarantees you read most recent write. Here's the order that shows it satisfies sequential consistency: CPU0: w(x)0 w(x)1 CPU1: w(y)0 w(y)1 CPU2: r(x)1 r(y)0 CPU3: r(x)0 r(y)1 Why did this happen? Maybe LDs were *started* at the same instant of time. But memory system for x is required to execute them one at a time. So they certainly don't actually read the value at the same time. In fact memory system (e.g. Ivy) may be forced to delay an LD. Due to the page being owned, or not in the local cache. That's why you can't reason about the time at which an LD occured. To what extent can you optimize sequential consistency? You can speed up reads: e.g. each CPU keeps read-cache of all memory. But you have to slow down writes: e.g. all writes have to be broadcast and waited for. You can speed up writes: e.g. Ivy-style exclusive ownership But you have to slow down reads: then reads require round-trip time But you can't make both reads and writes fast, in general Because memory system has to serialize operations for each location Which requires communication Example of a faster consistency model? We're willing to accept more work for the programmer. Though we still want a well-defined model. And in return we expect faster execution. Release Consistency You rarely see programs like the a=1; if(b==0) example. Because it's so hard to reason about them. Instead, parallel programs typically lock data that is shared and mutable. Example: bank account transfer: acquire(l); b1 = b1 + x; b2 = b2 - x; release(l); Other CPUs aren't allowed to look at b1 or b2 while l is locked. So CPU could do the operations in any order within the critical section. I.e. load b2 before storing b1. Rules: 1. CPU can't re-order any LD/ST before the acquire(). (otherwise you might read b1 while someone else has the lock) 2. Writes must finish before release() completes. (otherwise other CPUs might not see the writes) The point: make the memory system understand about locks, acquire(), and release().