6.824 2002 Lecture 17: Memory Consistency (2)

Review from previous lecture:
  We want to make it possible to write correct parallel/distributed programs.
  We assume different CPUs interact only through a storage system.
    Memory, distributed shared memory, or a file system.
  So we need a "memory consistency model"
    That tells us what to expect when we read/write memory.
  We want a model that:
    Is easy to understand, so programmers can easily write correct programs.
    Is possible to implement efficienctly.

One reasonable model: sequential consistency
  Rule 1: all CPUs see all operations in the same total order
  Rule 2: each CPU executes its own instructions in order

Remember our mutual exclusion example:
  CPU0: x = 1; if(y == 0) { critical section; }
  CPU1: y = 1; if(x == 0) { critical section; }
  We want this to work.

Lay out style of argument
  there is more than one legal result, depending on interleaving! (not like uniprocessor)
  typical question: is xxx a correct result under sequential consistency?
  "yes" if you can demonstrate an interleaving that gets that result
  "no" if you can show no interleaving could get that result
  that's the definition of correctness
  main example:
  CPU0: w(x)0      w(x)1 r(y)?
  CPU1: w(y)0      w(y)1 r(x)?
  there are interleavings for 1/1, 1/0, 1/0
  show there's no interleaving that yields 0/0

How can we implement sequential consistency?

straw man 1:
  internet cloud, hosts
  assume each host has a big cache
    and that all data is cached on every host
  reads are local, so they are very fast
  send write msg to each other host
    (but don't wait)
  what goes wrong?
    CPU0 starts the x=1 write, and CPU1 starts y=1 write.
      I.e. they send a packet on the network.
    Both read before the write message reaches them
    So they both read 0 and enter the critical section.
    We know this isn't legal for sequential consistency.
  Lesson: a CPU must wait for each operation to complete.

straw man 2:
  we can fix this:
    wait for write acks, don't let write proceed until collected all acks
  what goes wrong?
    (need a different example here)
    CPU0: w(x)1  r(x)?
    CPU1: w(x)2  r(x)?
    both might write locally, then wait for remote. other other way around.
    either way CPU0 and CPU1 don't agree on final value of x
  Lesson: order the operations on each memory location

It turns out these two rules are sufficient to implement sequential consistency:
  1. Each CPU must complete each operation before starting the next one.
  2. Operations on each memory location must execute one at a time.

These are nice implementation rules:
  They are completely local.
  Assume many CPUs, and memory partitioned across many modules.
  A CPU can enforce instruction order w/o interacting with other CPUs.
  A memory module can enforce order w/o interacting with other modules.

Show that Ivy enforces sequential consistency implementation rules
  Each CPU issues instructions one at a time, so per-CPU operations are serial.
  Order for each page (and thus memory location):
    Reads don't need to be ordered, since they see the same thing.
      So the copy-set is legal.
    Writes are ordered with each other because manager ensures there's only one owner at a time.
    Writes ordered w/ reads:
      Manager delays write until all CPUs ACK copy invalidation.
      Otherwise mutual exclusion example would break: read stale data.

In what sense is sequential looser than strict?
  I.e. what are we giving up?
  I.e. what programs will break?
  Answer: seq const doesn't let you reason about timing.

In general seqential consistency doesn't let you reason about timing
  Example:
  CPU0:  w(x)0    w(x)1
  CPU1:  w(y)0    w(y)1
  CPU2:        r(x)1 r(y)0
  CPU3:        r(x)0 r(y)1
  Suppose reads (of each variable) were issued at the same time.
  Seq const demands that all CPUs agree on the order of w(x)1 and w(y)1.
  But example seems to suggest that CPU2 and CPU3 must have seen two writes in different order?
  No: it only suggests that if you think they read **at the same time**.
    And if you think consistency guarantees you read most recent write.
  Here's the order that shows it satisfies sequential consistency:
  CPU0:  w(x)0       w(x)1
  CPU1:  w(y)0                         w(y)1
  CPU2:                    r(x)1 r(y)0
  CPU3:        r(x)0                         r(y)1
  Why did this happen?
    Maybe LDs were *started* at the same instant of time.
    But memory system for x is required to execute them one at a time.
      So they certainly don't actually read the value at the same time.
    In fact memory system (e.g. Ivy) may be forced to delay an LD.
      Due to the page being owned, or not in the local cache.
    That's why you can't reason about the time at which an LD occured.

To what extent can you optimize sequential consistency?
  You can speed up reads: e.g. each CPU keeps read-cache of all memory.
    But you have to slow down writes: e.g. all writes have to be broadcast and waited for.
  You can speed up writes: e.g. Ivy-style exclusive ownership
    But you have to slow down reads: then reads require round-trip time
  But you can't make both reads and writes fast, in general
    Because memory system has to serialize operations for each location
    Which requires communication

Example of a faster consistency model?
  We're willing to accept more work for the programmer.
    Though we still want a well-defined model.
  And in return we expect faster execution.

Release Consistency
  You rarely see programs like the a=1; if(b==0) example.
    Because it's so hard to reason about them.
  Instead, parallel programs typically lock data that is shared and mutable.
  Example: bank account transfer:
    acquire(l);
    b1 = b1 + x;
    b2 = b2 - x;
    release(l);
  Other CPUs aren't allowed to look at b1 or b2 while l is locked.
  So CPU could do the operations in any order within the critical section.
    I.e. load b2 before storing b1.
  Rules:
    1. CPU can't re-order any LD/ST before the acquire().
       (otherwise you might read b1 while someone else has the lock)
    2. Writes must finish before release() completes.
       (otherwise other CPUs might not see the writes)
  The point: make the memory system understand about locks, acquire(), and release().