6.824 - Fall 2007

6.824 Lab 5: Caching locks

Last change: $Date: 2007/10/05 20:11:49 $

Due: Friday, October 19th, 5:00pm.


Introduction

In this lab you will build a server and client that cache locks at the client, reducing the load on the server and improving client performance. For example, when client 1 asks for lock "a" repeatedly and no other client does, then all acquire and releases can be performed on client 1 without having to contact the server.

The challenge in the lab is the protocol between the clients and server. For example, when client 2 acquires a lock that client 1 has cached, the server must revoke that lock from client 1 by sending a revoke RPC to client 1. The server can give client 2 the lock only after client 1 has released the lock, which may a long time after sending the revoke (e.g., if a thread on client 1 holds the lock for a long period). The protocol is is further complicated by the fact that requests may be lost, duplicated, and delivered out of order. We will test your lock server code with RPC_LOSSY set to 5, like in Lab 1.

The reason that the lock server must work in these lossy conditions is that in later labs we will make the lock service fault tolerant to failures of individual lock servers, and then these conditions can show up. Of course, in practice, similar conditions could happen with the extent server, but we won't make the extent server fault tolerant in the labs and will assume the network works well; thus, we will ignore these problems for the extent server and will not test for it with the extent server.

To make the lock service fault tolerant, we will want to put a constraint on the lock server operations in later labs. This constraint will have implications for the protocol between the lock clients and lock server. The constraint is that handlers on the server should run to completion without blocking. That is, a server thread should not block on condition variables or remote RPCs. Of course, the thread can wait to take out locks as long as it can be sure that the lock will never be held by another thread across an RPC, and once it has acquired the lock it should run to completion.

To avoid having to re-implement the lock server again for these later labs, your implementation of the lock server for this lab should adhere to this non-blocking constraint. We won't test in this lab whether your implementation adheres to the constraints, but Lab 7 will.

Your server will be a success if it manages to operate out of its local lock cache when reading/writing files and directories that other hosts aren't looking at, but maintains correctness when the same files and directories are concurrently read and updated on multiple hosts. We will test with RPC_LOSSY set to 0 and RPC_LOSSY set to 5.

Getting Started

Since you are building on the past labs, ensure the code in your Lab 4 directory passes all tests for Labs 1, 2, 3 and 4 before starting in on this lab.

Begin by initializing your Lab 5 directory with your implementation from Lab 4 and then updating it with some new infrastructure code in http://pdos.csail.mit.edu/6.824/labs/yfs-lab5.tgz for Lab 5.

$ wget -nc http://pdos.csail.mit.edu/6.824/labs/yfs-lab5.tgz
$ rsync -av l4/* l5
$ tar xzvf yfs-lab5.tgz

This will add these new files to your l5/ directory:

This will overwrite the following existing files (any changes you made to them in previous labs will be lost, so you may have to redo them):

In order to test the caching performance of your caching lock server and clients, we extended the RPC library with the ability to count unique RPCs arriving at the server. If you'd like to use this, you'll need to update your RPC package as follows:

% cd ~/yfs  # or wherever your root YFS directory is
% wget -nc http://pdos.csail.mit.edu/6.824/labs/yfs-rpc.tgz
% tar xzvf yfs-rpc.tgz
With this new RPC package, you can set the environment variable RPC_COUNT before you launch a server process, in order to print out RPC statistics every N RPCs. For example, in the bash shell you could do:
% export RPC_COUNT=25
% ./lock_server 3772
RPC STATS: 7001:23 7002:2 
...
This means that the RPC with the procedure number 0x7001 (acquire in the original lock_protocol.h file) has been called 23 times, while RPC 0x7002 (release) has been called twice.

Testing Performance

Our measure of performance is the number of acquires that your lock clients send to the lock server. You can use RPC_COUNT, as mentioned above, to print out these stats every N calls.

The workload on which we'll evaluate your server's performance is generated by test-lab-4-c. It creates two sub-directories and creates/deletes 100 files in each directory, using each directory through only one of the two servers.

Using the lock server of lab 4 you will see that the number of acquires is at least a few thousand. With the caching lock server you will see that the number of acquires is only a few hundreds (i.e., less than a thousand). We are a bit vague in this performance goal because the exact numbers depend a bit on how you use locks in yfs_server. Suffice it to say, the drop in acquires should be significant.

Of course your server must also remain correct, so we will require the server you hand in to pass all lab 4 testers as well as getting good performance on test-lab-4-c tester.

Protocol and implementation hints

We strongly recommend you implement a protocol in the style suggested below. This protocol has most of the complexity on the client. All the handlers on the server run till completion and threads wait on condition variables on the client when a lock is taken out by another thread (on this client or another client). This allows the server to be replicated using the replicated state machine approach in labs 7 and 8. If you change the protocol, make sure that handlers on the server run to completion.

On the client a lock can be in several states:

In many of the states there may be several threads waiting for the lock, but only one thread per client ever needs to be interacting with the server; once that thread has acquired and released the lock it can wake up other threads, one of which can acquire the lock (unless it has been revoked and released back to the server). (If you need a way to identify a thread, you can use its thread id (tid), which you can get using pthread_self().)

When a client asks for a lock with an ACQUIRE RPC, the server responds with the lock if the lock is not owned by another client (i.e., the lock is free). If the lock is owned by another client, the client will receive a RETRY reply. At some point later (after another client has released the lock using an RELEASE RPC), the server sends the client a GRANT RPC with the lock.

Once a client has acquired ownership of a lock, the client caches the lock (i.e., it doesn't send an RPC to the server when a thread releases the lock on the client). The client can grant the lock to other threads using the same client without interacting with the server. The server will inform the client when it wants a lock back.

The server sends the client a REVOKE RPC to get the lock back. This request tells the client that it should send the lock back to the server when it releases the lock or right now if no thread on the client is holding the lock. You may find it useful to define a new RPC protocol (similar to the one in lock_protocol.h) to use when sending RPCs from the server to the client. This protocol contains definitions for the GRANT and REVOKE RPCs.

A good way to implement releasing locks on the client is using a separate releaser thread (as mentioned above, the skeleton code already launches one for you). When receiving a revoke request, the client adds the revoke request to a list and wakes up the releaser thread. The releaser thread will release the lock (i.e., send a RELEASE RPC to the server) when the lock becomes free on the client. Using a separate a thread is good because it avoids potential distributed deadlocks and ensures that REVOKE RPCs from the server to the client run to completion on the client.

On the server, handlers shouldn't block either. A good way to implement this on the server is to have revoker and granter threads that are in charge of sending REVOKE and GRANT RPCs, respectively. When a client asks for a lock that is taken, the acquire handler adds a REVOKE request to a queue and wakes up the revoker thread. When a client releases a lock, the release handler adds a GRANT request to a list for the granter (if there are clients who want the lock) and wakes up the granter thread. This design ensures that the handlers run to completion. Blocking operations are performed by the granter and revoker threads, and those blocking operations are just RPCs to the client, whose handlers also should run to completion without blocking.

A challenge in the implementation is that GRANT and REVOKE RPCs can be out of order with the acquire and release requests. That is, a client may receive a GRANT request before it has received the response on its acquire request. Similarly, a client may receive a REVOKE before it has received a response on its initial acquire request. Lost, duplicated, and delayed requests raises yet another set of challenges, similar to the ones in Lab 1. For example, the server may receive a delayed duplicate of an acquire request for a lock after the client already released the lock; this duplicate should be ignored; otherwise, your system will deadlock.

A good way to handle these cases is to assign sequence numbers to all requests. That is each request should have a unique client ID (e.g., a random number or the hostname string) and a sequence number. For an acquire the client picks the first unused sequence number and supplies that sequence number as an argument to the acquire RPC, along with the client ID. You probably want to send no additional acquires for the same lock to the server until the oustanding one has been completed (i.e., the client has released the lock). The corresponding release (which may be much later because the lock is cached) should probably carry the same sequence number as the last acquire, in case the server needs to tell which acquire goes along with this release. This approach helps handling lost, duplicate, and delayed requests, and requires the server to remember only one sequence number per client.

Unless your server in Lab 1 has non-blocking handlers and uses GRANT and REVOKE RPCs, you are probably best off starting from scratch for this lab. Lab 1 didn't require you to write much code, and morphing it into something that is suitable for lab 5 is likely to be more work than doing it right from scratch.

Step One: design protocol

You should design the protocol and basic system structure on paper (after playing perhaps a little bit around with the code). In particular, carefully think through the different scenarios due to lost, duplicated, and delayed messages. Changing the basic system structure and tracking down errors in your implemented protocol is painful. If you have thought through all scenarios before you start implementing and have the right system structure, you can save yourself much time.

The following questions might help you with your design (they are in no particular order):

Step Two: lock client and server, and test with RPC_LOSSY=0

A reasonable first step would be to implement the basic design of your ACQUIRE protocol on both the client and the server, including having the server send REVOKE messages to the holder of a lock if another client requests it. This will involve registering RPC handlers on the client, and devising a way for the server to receive and remember each client's location address (i.e., the hostname variable in lock_client_cache) and using it to send the client RPCs.

Next you'll probably want to implement the RELEASE code path on both the client and the server. Of course, the client should only inform the server of the release if the lock has been revoked. This will also involve having the server send a GRANT RPC to the next client in line waiting for the lock.

Also make sure you instantiate a lock_server_cache object in lock_smain.cc, and correctly register the RPC handlers.

Once you have your full protocol implemented, you can run it using the lock tester, just as in Lab 1. For now, don't bother testing with loss:

% export RPC_LOSSY=0
% ./lock_server 3772

Then, in another terminal:

% ./lock_tester 3772

Run lock_tester. You should pass all tests and see no timeouts. You can hit Ctrl-C in the server's window to stop it.

Step Three: test lock client and server with RPC_LOSSY=5

Now that it works without loss, you should try testing with RPC_LOSSY=5. Here you may discover problems with duplicate, delayed, and/or dropped RPCs. Just as before, except with lossy turned on:

% export RPC_LOSSY=5
% ./lock_server 3772

Then, in another terminal:

% ./lock_tester 3772

Step Four: run file system tests

In the constructor for your yfs_server, you should now instantiate a lock_client_cache object, rather than a lock_client object. You will also have to include lock_client_cache.h. Once you do that, your YFS should just work under all the Lab 4 tests. We will run your code against all 3 tests (a, b, and c) from Lab 4.

You should also compare running your YFS code with the two different lock clients and servers, with RPC count enabled at the lock server. For this reason, it would be helpful to keep your Lab 4 code around and intact, the way it was when you submitted it. As mentioned before, you can turn on RPC statistics using the RPC_COUNT environment variable. Look for a dramatic drop in the number of ACQUIRE (0x7001) RPCs between your Lab 4 and Lab 5 code during the test-lab-4-c test.

Also, be aware that if lock acquire operations block for a long time in yfs_server, the RPC client in yfs_client could end up retrying its operations. This could happen, for example, if you have blocking RPC handlers in your lock_server_cache, or if you're running your lock server process using RPC_LOSSY=5. Some of these operations are not idempotent, or re-orderable, so this could cause problems in your code. We are not requiring your code to pass the file system tests under lossy conditions, just the lock tests. A real file system would need to be able to handle these cases of course, and so we invite you to think through how you might handle these cases.

Challenges

Here are a few things you can do if you finish the lab early and feel like improving your code. These are not required, and there are no associated bonus points, but some of you may find them interesting.

Collaboration policy

You must write all the code you hand in for the programming assignments, except for code that we give you as part of the assignment. You are not allowed to look at anyone else's solution (and you're not allowed to look at solutions from previous years). You may discuss the assignments with other students, but you may not look at or copy each others' code.

Handin procedure

You will need to email your completed code as a gzipped tar file to 6.824-submit@pdos.csail.mit.edu by the deadline at the top of the page. To do this, execute these commands:
% cd ~/yfs/l5
% make clean
% rm core*
% rm *log
% cd ..
% tar czvf `whoami`-lab5.tgz l5/
That should produce a file called [your_user_name]-lab5.tgz in your yfs/ directory. Attach that file to an email and send it to the 6.824 submit address.
For questions or comments, email 6.824-staff@pdos.csail.mit.edu.
Back to 6.824 home.