6.824 - Fall 2007

6.824 Lab 8: Paxos

Last change: $Date: 2007/11/16 16:15:59 $

Due: Wednesday, November 21st, 11:59pm.


Introduction

For the replicated state machine approach to be useful, it needs to handle nodes failing and (re-)joining (perhaps after rebooting); in this lab, you will implement Paxos, an agreement protocol, to accomplish that goal. Whenever a node discovers a failure (e.g.,, the heartbeater thread in Lab 7 times out, or the master attempts to invoke a command on a failed slave), it signals the manager thread in paxos.cc. This will halt all processing of incoming requests to rsm, and the manager will run Paxos to agree on which nodes are in a particular view of the system. A view is an iteration of the configuration of the system, and each view has an monotonically increasing view number associated with it. In this lab, you will use Paxos to agree on the list of nodes for a particular view number.

After Paxos runs to completion successfully, the system will end up in a new view that consists of a majority of the nodes from the previous view. Once we have a new view, we can declare the node with the lowest ID the current master, and rsm can start processing client requests again.

In this lab you will ensure that the replicated state machine works correctly while members are joining and leaving the system. The primary task is to implement Paxos, and use it to allow nodes to join the RSM and survive failures of RSM nodes.

Paxos is a more complicated algorithm than the ones you have implemented in previous labs (it has three phases). To help you, read the paper Paxos made simple, and attend the corresponding lecture.

Getting Started

Since you are building on the past labs, ensure the code in your Lab 7 directory passes all tests for Labs 1, 2, 3, 4, 5, 6, and 7 before starting in on this lab.

Begin by initializing your Lab 8 directory with your implementation from Lab 7 and then updating it with some new infrastructure code in http://pdos.csail.mit.edu/6.824/labs/yfs-lab8.tgz for Lab 8.

% wget -nc http://pdos.csail.mit.edu/6.824/labs/yfs-lab8.tgz
% rsync -av l7/* l8
% tar xzvf yfs-lab8.tgz

This will add these new files to your l8/ directory:

The new code will overwrite the Makefile, view.{cc,h}, and rsm_tester.pl files from your previous labs, so you may have to redo any changes in those files.

Testing

The measure of success for this lab is to be able to survive failures of replicas (including the master) as well as joins (and re-joins) of replicas. rsm_tester.pl now defines nine different test cases your code must pass (including the three tests from Lab 7). By the end of the lab, your code must be able to pass the full set of tests:
% ./rsm_tester.pl
test1...
...
test2...
...
test3...
...
test4...
...
test5...
...
test6...
...
test7...
...
test8...
...
test9...
...
tests done OK

You should also ensure that the code passes all the file system tests for Lab 4 once Paxos is implemented.

Paxos-izing the Replicated State Machine

Each instance of the replicated state machine module goes through a sequence of view numbers, starting with 0 (no nodes are part of the system). For this lab we have only one instance of the replicated state machine library, namely the one that you used in Lab 7 to replicate the caching lock server from Lab 5.

Corresponding with each view number is a value, which for our Paxos implementation is the list of nodes that are part of that view (that is, currently participating as replicas in the RSM). When the system starts from scratch, the first node creates view 1 using as a value a list containing only its own identifier.

As nodes join and leave, each node uses Paxos to move from one view number to the next view number, and all nodes go through all view numbers. If node 2 joins after the first node, it starts out in view 0. It will propose view 1 with a list of nodes containing just itself as the value. It will learn from node 1 that view 1 already exists, and it will learn the nodes in view 1. Now that node 2 has caught up, it can propose view 2 containing the nodes 1 and 2. And so on.

Typically nodes re-join the system after they recover from failure, and start from the latest view they were part of. To be able to do so correctly, Paxos logs view changes and other Paxos-related state (see the protocol below).

To be able to bring other nodes up to date to the latest formed view, each a node will have a complete history of all view numbers and their values that it knows about. At any time a node can reboot and when it re-joins, it may be many views behind; by remembering all views, the other nodes can bring this re-joined node up to date.

One node in each view is the master for the replicated state machine. This can be any node as long as all nodes agree on the same node. We use the convention that the node with lowest node identifier in the view is the master.

When the master is stable (i.e., not running Paxos), the replicated state machine can process client requests. (The Paxos class provides a stable() method to test whether a node is currently stable.) If a node is running Paxos to create a new view, the replicated state machine should tell the RSM client to back-off and retry later. Hopefully by that time, Paxos has been able to create a new view and the system can proceed.

A node in the replicated state machine invokes Paxos when it discovers that some node in the current view is not responding. For example, when a slave's heartbeat to the master times out, or an invoke request from the master to a slave times out, the node should run Paxos to propose a new view. In such a case, the replicated state machine calls Paxos's change() method. This method should wake up a manager thread, which will then start running the Paxos protocol. When Paxos is done, the node can start processing requests again. Since each node may independently decide to run Paxos, several instances of Paxos may be running in the system with different leaders. The protocol sorts that out correctly.

More concretely, there is a thread inside paxos that runs the Paxos protocol. In simple pseudocode:

manager() { 
   while (1) { 
     if (done) pthread_cond_wait(manager) 
     paxos();   // if paxos succeeds, it will set done to true
   } 
}

This thread is kicked into action from the outside using the change() method:

change() {
   if (done) {
     done = false; 
     pthread_cond_signal(manager);
   }
}

When a join becomes part of a new view, it may discover while running Paxos that it missed some views and state changes. Paxos will send a transferstate request to the node that advertises the most up-to-date state during the Paxos protocol. That node invokes an application-specific method to marshall the application state, which the node sends back to the sender. The sender calls on its node's application-specific method to unmarshall that state and install it. For this lab, the application-specific state is the state of the caching lock server.

The Paxos Protocol

Below is the pseudocode for the basic protocol you should implement. The provided paxos skeleton class and protocol contain member variables, RPCs, and RPC handlers corresponding to this code.

state:
  n_a, v_a: highest value and proposal # which node has accepted
  n_h: highest proposal # seen in a prepare
  my_n: the last proposal # the node has used in this round of Paxos
  vid_h: highest view number we have accepted
  views: map of past view numbers to values
  done: leader says agreement was reached, we can start new view

on each view change, initialize state:
  n_a = 0
  n_h = 0
  my_n = 0
  v_a = () // empty list

Paxos Phase 1
  a node (maybe more than one...) decides to be leader (need not be in current view):
    my_n = max(n_h, my_n)+1, append node ID  // unique proposal number
    done = false
    sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself}
  if node receives prepare(vid, n):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n > n_h:
      n_h = n
      done = false
      return prepareres(n_a, v_a)
    else:
      return reject()

Paxos Phase 2
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets reject():
    delay and restart paxos
  else if leader gets prepareres from majority of nodes in views[vid_h]:
    if any prepareres(n_i, v_i) exists such that v_i is not empty:
      v = non-empty value v_i corresponding to highest n_i received
    else leader gets to choose a value:
      v = set of pingable nodes (including self)
    send accept(vid_h+1, my_n, v) to all responders
  else:
    delay and restart paxos
  if node gets accept(vid, n, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n >= n_h:
      n_a = n
      v_a = v
      return acceptres()
    else
      return reject()

Paxos Phase 3
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets acceptres from a majority of nodes in views[vid_h]:
    send decide(vid_h+1, v_a) to all (including self)
  else:
    delay and restart paxos
  if node gets decide(vid, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else:
      done = true
      primary is lowest-numbered node in v
      views[vid] = v
      vid_h = vid
      view change

Note that there are two types of numbers in this system: view numbers and proposal numbers. For a given view, potentially many nodes can make potentially many propsals for a new view, and each of these proposals has a unique proposal number. When comparing different proposals, the highest proposal number wins. To ensure that each proposal number is unique, each proposal consists of a number and the node's identifier. We provide you with a struct prop_t in paxos_protocol.h that you should use for proposal numbers; we also provide the > and >= operators for the class.

At any time a node can decide it wants to start a view change, and start Paxos off in Phase 1. If nothing goes wrong, and there are no concurrent proposals for the next view, Paxos clearly reaches agreement. However, since many nodes can become leaders at the same time, how do we ensure with good probability that there is only one leader? Each leader should delay a random amount of time before beginning phase 1; furthermore if a leader learns of another instance of Paxos started with a higher proposal number for the same view, it will delay for a random amount of time and then attempt another proposal. In this way the system will eventually converge with high probability. Note that paxos.cc already provides a dodelay() method for this purpose.

Each Paxos node must log every change to its state (in particular the n_a, v_a, and n_h fields), as well as log every view change. The provided log class does this for you; please use it without modification, as the test program depends on its output in a particular format.

To implement the complete protocol perform the following steps, which allow you to build and test the protocol incrementally.

Step One: Implement Paxos

Implement the Paxos protocol listed above, but don't worry about logging and failures yet. At the end of this step, you need only be able to run 'rsm_tester.pl 1' again.

To begin, make a small change to your rsm object in order to use Paxos. It should initialize a paxos object in the constructor, using the same view and rpcs objects as the rsm. (It no longer needs to set the master on the view object.)

Then of course you must also fill in the paxos skeleton class with your Paxos implementation. Try to follow the pseudocode provided above, and use the RPC protocol we provide in paxos_protocol.h. Note that though the pseudocode shows different types of responses to each kind of RPC, our protocol combines these responses into one type of return structure. For example, the prepareres struct can act as a prepareres, an oldview, or a reject message, depending on the situation. Be aware that you may not need all fields in all the messages for this step, but you will probably find them useful in later steps.

Now, when the first node starts, it will call the paxos constructor, and it should initialize view 1 to itself. When the second node starts, it will also call the paxos constructor, and it should try to invoke Paxos to make view 1, sending node 1 its proposed view. Node 1 will inform node 2 that view 1 already exists (and provide it will the value). Node 2 should switch to view 1. Then, it should try to create view 2 in order to get a view that includes itself. A similar sequence will occur for subsequent joins.

Once this works, you should be able to pass 'rsm_tester.pl 1'.

Step Two: Simple failures

Next you should handle the simple failure cases of a single slave or master failing, as you did in Lab 7.

When dealing with failed nodes, rsm no longer needs to remove members from the view or call set_master -- these can be replaced by calling change on the paxos object. (You should probably remove the break statement in the heartbeater thread, as it allows only one change to a new master.)

RSM objects should not process any invoked requests while Paxos is unstable. To accomplish this, you should add a BUSY status message to the RSM client protocol, and return it whenever stable() returns true for the paxos object. The rsm_client should wait for a small amount of time after receiving a BUSY, and then try again. Furthermore, the rsm_client should periodically fetch a list of the current replicas periodically (perhaps whenever it receives a NOTMASTER status).

Once this works, you should be able to start 3 nodes, kill any one of them, and have the system continue to process requests. At this point, you should be able to run 'rsm_tester.pl 1 2 3' again.

Step Three: Implement state transfer

In this step you will add a node to the system after it has been running for a while. This node must do a state transfer to become up to date, so that its lock_server_cache is in sync with the other replicas.

To begin, alter your lock_server_cache (or possibly its parent class, if necessary) to be subclass of rsm_state_transfer. You must then implement the marshal_state and unmarshal_state methods for lock_server_cache. These are fairly easy to implement using the YFS RPC marshalling code. For example, if state of your lock server consists of an integer sequence number seqno and a std::map called locks that mapped std::strings to std::vectors, the code might look roughly as follows:


std::string 
lock_server_cache::marshal_state() {

  // lock any needed mutexes
  marshall rep;
  rep << seqno;
  rep << locks.size();
  std::map< std::string, std::vector >::iterator iter_lock;
  for (iter_lock = locks.begin(); iter_lock != locks.end(); iter_lock++) {
    std::string name = iter_lock->first;
    std::vector vec = locks[name];
    rep << name;
    rep << vec;
  }
  // unlock any mutexes
  return rep.str();

}

void 
lock_server_cache::unmarshal_state(std::string state) {

  // lock any needed mutexes
  unmarshall rep(state);
  rep >> seqno;
  unsigned int locks_size;
  rep >> locks_size;
  for (unsigned int i = 0; i < locks_size; i++) {
    std::string name;
    rep >> name;
    std::vector vec;
    rep >> vec;
    locks[name] = vec;
  }
  // unlock any mutexes
}


In the lock_server_cache constructor, call the rsm's set_state_transfer method with this as the argument, to set the lock server up as the application.

Hopefully in Lab 7 you already used the grp variable in rsm to assign and track sequence numbers for RSM invocations. Whenever Paxos does a view change, you should have it initialize the sequence number back to zero using initseqno, so that sequence numbers become unique on a per-view basis.

Now modify the paxos protocol to keep track of the highest sequence number and which node has it. You can use the appropriate fields in the prepareres and proposearg in paxos_protocol.h. When a node in Phase 2 accepts a view, it should see if it is behind (i.e., its highest sequence number is smaller than the highest sequence number seen in the system in the current view). If so, the node invokes the provided statetransfer method to get state from the node that has the highest sequence number.

Now you should be able to pass test 4 of rsm_tester.pl. This test starts two nodes, then performs lock operations for a while, and then starts a third node, which must get its state up-to-date. Finally it kills the master and ensures the lock tester still completes successfully.

Step Four: Logging Paxos state

Modify your Paxos implementation to use the log class to log new view changes, as well as changes to n_h, and n_a and v_a when they are updated. We have provided the code to write and read logs in log.cc (see log::logview(), log::loghigh(), and log::logprop()), so you just have to make sure to call the approriate methods at the right times.

log writes its messages to a file in the current directory called log_paxos-[port]. Note that rsm_tester.pl will remove these logs when a test finishes successfully, unless you comment out the second line of the cleanup() subroutine in the script.

Now you can run tests that involve restarting a node after it fails. In particular, you should be able to pass tests 5 and 6 of rsm_tester.pl:

Step Five: Complicated failures

Finally, you need to verify that your code handles some of the tricky corner cases that Paxos is supposed to deal with. Our test scripts do not test all possible corner cases, so you could still have a buggy Paxos implementation after this step, but you will have a good feel for the protocol.

In paxos.cc, we provide two methods: breakpoint1() and breakpoint2(). Your Paxos code must call breakpoint1() just after completing Phase 1, but before starting Phase 2. Similarly it must call breakpoint2() in between Phases 2 and 3. Then, if you send a SIGUSR1 or SIGUSR2 to a lock_server process, it will exit at the respective breakpoint. You can try this manually on the command line with a command like 'kill -USR1 [pid]', but rsm_tester.pl also tests the following cases automatically:

It's been a long journey, but your code should now reliably pass the full gamut of tests in rsm_tester.pl.

Hints

Challenges

Here are a few things you can do if you finish the lab early and feel like improving your code. These are not required, and there are no associated bonus points, but some of you may find them interesting.

Collaboration policy

You must write all the code you hand in for the programming assignments, except for code that we give you as part of the assignment. You are not allowed to look at anyone else's solution (and you're not allowed to look at solutions from previous years). You may discuss the assignments with other students, but you may not look at or copy each others' code.

Handin procedure

You will need to email your completed code as a gzipped tar file to 6.824-submit@pdos.csail.mit.edu by the deadline at the top of the page. To do this, execute these commands:
% cd ~/yfs/l8
% make clean
% rm core*
% rm *log
% cd ..
% tar czvf `whoami`-lab8.tgz l8/
That should produce a file called [your_user_name]-lab8.tgz in your yfs/ directory. Attach that file to an email and send it to the 6.824 submit address.
For questions or comments, email 6.824-staff@pdos.csail.mit.edu.
Back to 6.824 home.