6.824 2004 Lecture 16: Group Multicast Group Multicast an excersize in thinking about ordering and implementing it has turned out to be a useful building block example: thursday's paper Applications in general, a group of applications that want to hear each others messages location service or name service mailing list distributed chat room or bulletin board distributed game or simulation stock market ticker tape: companies make automated trading decisions based on the ticker tape, it might be bad if different companies saw different tapes. Applications vary in semantic needs location service: just order my own updates to my own record mailing list: show replies after original message distributed game: everyone sees events in the same order ticker tape: maybe total chat room: causal or total? The challenge is to provide a range of semantics so you only pay for what you need there are a bunch of group multicast toolkits around Delivery model there is a "group" of processes on a network they all agree on the group membership list the multicast s/w for each process may "hold" a message internally then deliver the message when the time is right that is, there's a distinction between the host receiving a msg and the multicast library delivering the msg to the process A typical group multicast menu: 1. Basic: live hosts hear if sender doesn't crash 2. Reliable: every live host hears, or no live hosts hear (note this is a slightly odd sense of reliability, not just "keep resending" which is Basic, but a notion of atomicity) 3. FIFO order: if one process sends m1 then m2, any delivery of m2 will have been preceded by m1 4. Total order: if any process sees m1 delivered then m2, then any process that sees m2 will have first seen m1 5. Causal order: if one host sees (or sends) m1 then sends m2, then any process that sees m2 will see m1 first Note: definition of ordered schemes does not imply "Reliable" total order allows some processes to proceed while others are blocked you can combine ordering with Reliable by layering but then everyone has to block if any one host is slow! causal implies FIFO but total does not imply causal (and may not imply FIFO) Basic multicast sender sends a copy to each host, keeps sending until ACK per-sender sequence numbers to eliminate duplicates might use underlying multicast for efficiency might use NAKs to avoid ACK implosion not "Reliable": sender might fail after a few sends Reliable multicast (1) orignal sender uses Basic multicast to each host on receive: ignore if we've already seen the sequence number Basic multicast (again!) to each host only then delivery to the process this works but it is not efficient Reliable multicast (2) all we really need to know is that the others heard it so we can deliver the message we don't have to actually send it again unless they didn't hear it on receipt: just multicast the sequence number to each other host other host either says yes, or please send me the msg or piggyback highest seq you've heard from each host on each msg don't deliver a msg until you've heard its seq from all other hosts FIFO order send with Basic to make sure everyone eventually gets it rcvr hold msgs until all previous msgs from same host have arrived easy to do with sequence numbers Total order (1) send all msgs to a single sequencing host sequencer assigns each msg a sequence number sequencer Basic-multicasts to the group probably slow! Total order (2) let's get rid of the sequencing host! distribute the next seq # state over the nodes "distributed Lamport clock" every host maintains a variable X holding highest seq seen so far from any source: its proposals or msgs it sends or receives form: S.H S is an integer, H is a host-id H used only to break ties if S is the same 1. send message to every host 2. each host sends back X + 1.ID, increments X 3. sender picks highest sequence number, multicasts it receiver: keeps all msgs in a queue, sorted by seq #, proposed or agreed re-orders into the queue if it gets a different seq # in 3rd phase delivers if 3rd phase and at head of queue does this work? what if concurrent senders? does everyone agree on the order? if one got to all nodes first, no problem if mixture, then they might both see the same high X value but they cannot have seen it from the same host! due to X = X + 1. so they must differ in the ID part, and everyone will agree is it safe to deliver? can another lower seq# sneak in? if other msg arrived after, we gave it higher X, sender takes MAX if other msg arrived first, it will be earlier in the queue anyway, so our msg won't be first Causal order goal: if one host sees (or sends) m1 then sends m2, then any process that sees m2 will see m1 first so each message needs to describe what the sender had seen the point: may be faster than total since some messages aren't ordered i.e. receivers wait less until they deliver each host maintains a vector timestamp for each other host, highest seq # delivered from each host sends increment host's entry in its VT, then carry copy of the VT on receive, can deliver immediately if VT shows it's the next msg from sender, and if all other elements of msg VT are <= host's VT then increment element of sender otherwise hold the message until host VT catches up Retrospective Total order is a really good idea, replicated state machines, we will be seeing more of this style Causal order not so useful, typically either not really needed as a guarantee or else applications (e.g. msg-IDs in e-mail headers) had enough information anyway. What did I leave out? What if a host crashes? Or joins? What about partition?