6.824 2004 Lecture 16: Group Multicast

Group Multicast
  an excersize in thinking about ordering and implementing it
  has turned out to be a useful building block
  example: thursday's paper

Applications
  in general, a group of applications that want to hear each others messages
  location service or name service
  mailing list
  distributed chat room or bulletin board
  distributed game or simulation
  stock market ticker tape: companies make automated trading decisions
    based on the ticker tape, it might be bad if different companies
    saw different tapes.

Applications vary in semantic needs
  location service: just order my own updates to my own record
  mailing list: show replies after original message
  distributed game: everyone sees events in the same order
  ticker tape: maybe total
  chat room: causal or total?

The challenge is to provide a range of semantics
  so you only pay for what you need
  there are a bunch of group multicast toolkits around

Delivery model
  there is a "group" of processes on a network
  they all agree on the group membership list
  the multicast s/w for each process may "hold" a message internally
  then deliver the message when the time is right
  that is, there's a distinction between the host receiving
    a msg and the multicast library delivering the msg to the process

A typical group multicast menu:
  1. Basic: live hosts hear if sender doesn't crash
  2. Reliable: every live host hears, or no live hosts hear
     (note this is a slightly odd sense of reliability, not just
      "keep resending" which is Basic, but a notion of atomicity)
  3. FIFO order: if one process sends m1 then m2, any delivery
     of m2 will have been preceded by m1
  4. Total order: if any process sees m1 delivered then m2, then
     any process that sees m2 will have first seen m1
  5. Causal order: if one host sees (or sends) m1 then sends m2,
     then any process that sees m2 will see m1 first

Note:
  definition of ordered schemes does not imply "Reliable"
  total order allows some processes to proceed while others are blocked
  you can combine ordering with Reliable by layering
  but then everyone has to block if any one host is slow!
  causal implies FIFO
  but total does not imply causal (and may not imply FIFO)

Basic multicast
  sender sends a copy to each host, keeps sending until ACK
  per-sender sequence numbers to eliminate duplicates
  might use underlying multicast for efficiency
  might use NAKs to avoid ACK implosion
  not "Reliable": sender might fail after a few sends

Reliable multicast (1)
  orignal sender uses Basic multicast to each host
  on receive:
    ignore if we've already seen the sequence number
    Basic multicast (again!) to each host
    only then delivery to the process
  this works but it is not efficient

Reliable multicast (2)
  all we really need to know is that the others heard it
    so we can deliver the message
    we don't have to actually send it again unless they didn't hear it
  on receipt:
    just multicast the sequence number to each other host
      other host either says yes, or please send me the msg
  or piggyback highest seq you've heard from each host on each msg
    don't deliver a msg until you've heard its seq from all other hosts

FIFO order
  send with Basic to make sure everyone eventually gets it
  rcvr hold msgs until all previous msgs from same host have arrived
  easy to do with sequence numbers

Total order (1)
  send all msgs to a single sequencing host
  sequencer assigns each msg a sequence number
  sequencer Basic-multicasts to the group
  probably slow!

Total order (2)
  let's get rid of the sequencing host!
  distribute the next seq # state over the nodes
  "distributed Lamport clock"
  every host maintains a variable X holding highest seq seen so far
    from any source: its proposals or msgs it sends or receives
  form: S.H
    S is an integer, H is a host-id
    H used only to break ties if S is the same
  1. send message to every host
  2. each host sends back X + 1.ID, increments X
  3. sender picks highest sequence number, multicasts it
  receiver:
    keeps all msgs in a queue, sorted by seq #, proposed or agreed
    re-orders into the queue if it gets a different seq # in 3rd phase
    delivers if 3rd phase and at head of queue
  does this work?
    what if concurrent senders? does everyone agree on the order?
    if one got to all nodes first, no problem
    if mixture, then they might both see the same high X value
    but they cannot have seen it from the same host! due to X = X + 1.
    so they must differ in the ID part, and everyone will agree
  is it safe to deliver? can another lower seq# sneak in?
    if other msg arrived after, we gave it higher X, sender takes MAX
    if other msg arrived first, it will be earlier in the queue anyway,
      so our msg won't be first

Causal order
  goal: if one host sees (or sends) m1 then sends m2,
        then any process that sees m2 will see m1 first
  so each message needs to describe what the sender had seen
  the point: may be faster than total since some messages aren't ordered
    i.e. receivers wait less until they deliver
  each host maintains a vector timestamp
    for each other host, highest seq # delivered from each host
  sends increment host's entry in its VT, then carry copy of the VT
  on receive, can deliver immediately if VT shows it's the
    next msg from sender, and if all other elements of msg VT
    are <= host's VT
    then increment element of sender
  otherwise hold the message until host VT catches up

Retrospective
  Total order is a really good idea, replicated state machines,
    we will be seeing more of this style
  Causal order not so useful, typically either not really needed
    as a guarantee or else applications (e.g. msg-IDs in e-mail headers)
    had enough information anyway.

What did I leave out?
  What if a host crashes? Or joins?
  What about partition?