6.824 2007 Lecture 15: paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction 2-phase commit is good if different nodes are doing different things but in general you have to wait for all sites and TC to be up you have to know if each site voted yes or no and the TC must be up to decide not very fault-tolerant: has to wait for repair can we get work done even if some nodes can't be contacted? yes: in the special case of replication state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if deterministic, replicas will end up with same state how to ensure all replicas see operations in the same order? primary + backup(s) clients send all operations to current primary primary chooses order, sends to backups, replies to client what if the primary fails? need to worry about that last operation, possibly not complete need to pick a new primary can't afford to have two primaries! suppose lowest-numbered live server is the primary so after failure, everyone pings everyone then everyone knows who new primary is? well, maybe not: pings may be lost => two primaries pings may be delayed => two primaries partition => two primaries idea: a majority of nodes must agree on the primary at most one network partition can have a majority if two potential primaries, their majorities must overlap technique: "view change" algorithm system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view the participant set allows everyone to agree on new primary view change requires "fault-tolerant agreement" at most a single value is chosen agree despite lost messages and crashed nodes can't really guarantee to agree but we can guarantee to *not* "agree" on different values! Paxos fault-tolerant agreement protocol eventually succeeds if a majority of participants are reachable best known algorithm general Paxos approach one (or more) nodes decide to be the leader leader chooses a proposed value to agree on (view# and participant set) leader contacts participants, tries to assemble a majority participants are all the nodes in the old view (including unreachable) or a fixed set of configuration master nodes if a majority respond, we announce result why agreement is hard what if two nodes decide to be the leader? what if network partition leads to two leaders? what if the leader crashes after persuading only some of the nodes? what if leader got a majority, then failed, without announcing result? or announced result to only a few nodes? new leader might choose a different value, even though we agreed Paxos has three phases may have to start over if failure/timeouts see handout with code (included below) or lab 8 assignment we run an instance of this protocol for each view. the n's in the protocol are proposal numbers, not view numbers. per view a node may make many proposals. each of the proposals is numbered in increasing order. done is to false when a node suspects a problem with a node in the current view, and invokes Paxos to create the next view. if at any time any node gets bored (times out) it declares itself a leader and starts a new Phase 1 if nothing goes wrong, Paxos clearly reaches agreement how do we ensure good probability that there is only one leader? every node has to be prepared to be leader, to cope w/ failure so delay a random amount of time after you realize a new view is required or delay your ID times some constant key danger: nodes w/ different v_a receive donereq goal: if donereq *could* have been sent, future donereqs guaranteed to have same v_a what if more than one leader? due to timeout or partition or lost packets the two leaders used different n, say 10 and 11 if 10 didn't get a majority to acceptres it never will, since no-one will acceptres 10 after seeing 11's preparereq or perhaps 10 is in a network partition if 10 did get a majority to acceptres i.e. might have sent donereq 10's majority saw 10's acceptreq before 11's preparereq otherwise they would have ignored 10's acceptreq, so no majority so 11 will get a prepareres from at least one node that saw 10's acceptreq so 11 will be aware of 10's value so 11 will use 10's value, rather than making a new one so we agreed on a v after all what if leader fails before sending acceptreqs? some node will time out and become a leader old leader didn't send any donereq, so we don't care what he did it's good, but not neccessary, that new leader chooses higher n if it doesn't, timeout and some other leader will try eventually we'll get a leader that knew old n and will use a higher n what if leader fails after sending a minority of acceptreqs? same as two leaders... what if leader fails after sending a majority of acceptreqs? i.e. potentially after reaching agreement! same as two leaders... shows scenario say last view's v = {1,2,3} 2 fails 1 becomes leader, receives prepareress from 1 and 3, choses v = {1,3} 1 sends acceptreq to 1 and 3, and fails. 3 will start running paxos, but cannot make a majority. keeps trying 2 reboots, reloads state from disk (nothing related to this view change) 2 becomes a leader and proposes acceptreq with n higher than 1 used 3 send acceptres, which will include v = {1, 3}. 2 and 3 will switch to view {1,3} 2 will run Paxos again to add itself. No majority of {1,3} is alive, however, to add in 2. must wait until 1 is back up. (note that if we change the protocol, we could continue, because 3 knows that 1 cannot have a majority---because that would be 1 and 2---but 2 is talking to 3 to make a view. what if a node fails after receiving acceptreq? if it doesn't restart, possible timeout in Phase 3, new leader it it does restart, it must remember v_a/n_a! (on disk) leader might have failed after sending a few donereqs new leader must choose same value our node might be the intersecting node of the two majorities what if a node reboots after sending prepareres? does it have to remember n_h on disk? it uses n_h to reject preparereq/acceptreq with smaller n scenario: leader1 sends preparereq(n=10), a bare majority sends prepareres so node X's n_h = 10 leader2 sends preparereq(n=11), a majority intersecting only at node X sends acceptreq node X's n_h = 11 leader2 got no prepareres with a value, so it chooses v=200 node X crashes and reboots, loses n_h leader1 sends acceptreq(n=10, v=100), its bare majority gets it including node X (which should have rejected it...) so we have agreement w/ v=100 leader2 sends acceptreq(n=11, v=200) its bare majority all accept the message including node X, since 11 > n_h so we have agreement w/ v=200. oops. so: each node must remember n_h on disk conclusion what have we achieved? remember the original goal was replicated state machines and we want to continue even if some nodes are not available after each failure we can perform view change using Paxos agreement that is, we can agree on exactly which nodes are in the new view so, for example, everyone can agree on a single new primary but we haven't talked at all about how to manage the data ------------------ paxos.txt -------------------- state: n_a, v_a: highest value and proposal # which node has accepted n_h: highest proposal # seen in a prepare my_n: the last proposal # the node has used in this round of Paxos vid_h: highest view number we have accepted views: map of past view numbers to values done: leader says agreement was reached, we can start new view on each view change, initialize state: n_a = 0 n_h = 0 my_n = 0 v_a = () // empty list Paxos Phase 1 a node (maybe more than one...) decides to be leader (may not be in current view): my_n = max(n_h, my_n)+1, append node ID // unique proposal number done = false sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself} if node receives prepare(vid, n): if vid <= vid_h: return oldview(vid, views[vid]) else if n > n_h: n_h = n done = false return prepareres(n_a, v_a) else: return reject() Paxos Phase 2 if leader gets oldview(vid, v): views[vid] = v vid_h = vid view change restart paxos else if leader gets reject(): delay and restart paxos else if leader gets prepareres from majority of nodes in views[vid_h]: if any prepareres(n_i, v_i) exists such that v_i is not empty: v = non-empty value v_i corresponding to highest n_i received else leader gets to choose a value: v = set of pingable nodes (including self) send accept(vid_h+1, my_n, v) to all responders else: delay and restart paxos if node gets accept(vid, n, v): if vid <= vid_h: return oldview(vid, views[vid]) else if n >= n_h: n_a = n v_a = v return acceptres() else return reject() Paxos Phase 3 if leader gets oldview(vid, v): views[vid] = v vid_h = vid view change restart paxos else if leader gets acceptres from a majority of nodes in views[vid_h]: send decide(vid_h+1, v_a) to all (including self) else: delay and restart paxos if node gets decide(vid, v): if vid <= vid_h: return oldview(vid, views[vid]) else: done = true primary is lowest-numbered node in v views[vid] = v vid_h = vid view change