6.824 Lab 3: A Web Proxy

Due date: Thursday, October 4th.

Introduction

Now that you have created an asynchronous TCP proxy, you will create an asynchronous caching web proxy. Unlike the TCP proxy, the web proxy is optimized for a specific task: serving web pages. For instance, your proxy will cache web pages to reduce network traffic. The web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.

In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests[3]. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server)[1]. Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).

Design Criteria

The HTTP/1.0 spec, RFC 1945, defines a web proxy as a transparent, trusted intermediary between web clients and web servers for the purpose of making requests on behalf of clients. Requests are serviced internally or by passing them, with possible translation, on to other servers. A proxy must interpret and, if necessary, rewrite a request message before forwarding it. In particular, your proxy must address:

Caching. The proxy must cache web files when RFC 1945 allows it to and not contact servers unless it is necessary. Web caches decrease latency and total network load at the cost of sometimes serving stale data[5]. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.
Non-blocking. Your web proxy must operate asynchronously, so that it can talk to multiple clients and servers concurrently. You'll use the same C++ async library you used in the first two labs. A related requirement is that your webproxy handle streaming web pages (pages which produce an unbounded amount of data).
Transparency. The proxy should be transparent to the client and server. Your design cannot rely on any special modification to the client or server (i.e. your proxy should work with the standard proxy support of web browsers).

Your proxy should function correctly for any HTTP/1.0 GET, POST, or HEAD request. However, you may ignore any references to cookies and authentication, except that you should not cache the results of requests that carry client authorization.

Requirements

In particular, your proxy should statisfy the following minimum criteria:

Basic

GET, conditional GET, POST, HEAD request works
Error response works (e.g. Error 404)
Images/Binary files are transferred correctly
Client sees some data from body before proxy sees all of the body (i.e. streaming web pages work).

Asynchronous I/O

handles multiple requests simultaneously using libasync
deals with hanging(non-responding) client/server

Caching

Has memory cache or disk cache (both shoud have a reasonable size restriction)
Obeys expires header
does not cache the non-cachable and correctly handles "Pragma: no cache" header

The HTTP Protocol

The Hypertext Transfer Protocol (HTTP) is the most commonly used protocol on the web today. For this lab, you will use the somewhat out-of-date version 1.0 of HTTP.

The HTTP protocol assumes a reliable connection and, in current practice, uses TCP to provide this connection. Thus, we can use libasync with TCP sockets just as in the previous labs.

The HTTP protocol is a request/response protocol. When a client opens a connection, it immediately sends its request for a file. A web server then responds with the file or an error message. You can try out the protocol yourself. For example, try:

% telnet web.mit.edu 80

Then type

GET / HTTP/1.0

followed by two carriage returns. See what you get.

To form the path to the file to be retrieved on a server, the client takes everything after the machine name. For example, http://web.mit.edu/resources.html means we should ask for the file /resources.html. If you see a URL with nothing after the machine name and port, then / is assumed (The server determines what page to return when just given /. Typically this default page is index.html or home.html).

On most servers, the HTTP protocol lives on port 80. However, one can specify a different port number in the URL. For example, entering http://hera.lcs.mit.edu:11977/ into your favorite web browser connects to the machine hera.lcs.mit.edu on port 11977 using the HTTP protocol and shows you a picture of your TA's office. Your web proxies will listen on ports other than port 80.

The format of the request for HTTP is quite simple. A request consists of a method followed by arguments, each separated by a space and terminated by a carriage return/linefeed pair. Your web proxy should support three methods: GET, POST, and HEAD[3]. Methods take two arguments: the file to be retrieved and the HTTP version. Additional headers can follow the request. The web proxy will especially care about the following headers: Allow, Date, Expires, From, If-Modified-Since, Pragma: no-cache, Server. However, your proxy must handle the other HTTP/1.0 headers[3]. Fortunately, the web proxy can forward most headers verbatim to the appropriate server. Only a handful of headers require proxy intervention.

Once the request line is received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the appropriate file and send back a response (usually the file contents) and close the connection. Note that the proxy should not attempt to buffer the entire response before returning any data to the client.

Using a Web Proxy

To use a web proxy, you must configure your web browser. In Netscape, find the Network Preferences and manually setup a proxy. The process is similiar for IE, Mozilla, Konquerer or other browswers. For instance, you can set the HTTP proxy to hera.lcs.mit.edu and the port to 3128 and experiment with the Squid proxy. When testing your proxy set your browswer to use your proxy (i.e. blood.lcs.mit.edu on 6666 or whatever port you choose).

HTTP in Action!

How does one watch an HTTP request in action? To make a simple HTTP request emulating a browser, you can use telnet. However, telnet does not let you watch incoming TCP connections. For this, you need a more sophisticated tool, such as nc (Net cat). nc lets you read and write data across network connections using UDP or TCP[10]. The class machines already have nc installed.

First we'll examine the requests that clients will make on the proxy. To do this we'll use nc to listen on a port and direct our web client (lynx) to use that host and port as a proxy. To use nc to listen to the network on port 8888, run:

% nc -lp 8888

Now try to retrieve a URL using this port as a proxy:

% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com

You will see netcat print out the request headers:

% nc -l -p 8888
GET http://www.yahoo.com/ HTTP/1.0
Host: www.yahoo.com
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14

The first line asks for a file called http://www.yahoo.com using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines.

The above shows what a web browser sends to a web proxy. Now we'll try to obtain sample data from a real web proxy (hera.lcs.mit.edu port 3128) and see what the proxy sends to the server. In this example nc pretends to be a web server. Run the following command on one of the class machines where port is a port of your choosing:

% nc -l -p port

This causes nc to listen on port port and print out any requests it receives. Now we'll try to retrieve a web page from nc using the web proxy. We'll use the web browser lynx to connect to the proxy running on hera.lcs.mit.edu which will connect to netcat. Run the following where port is the port you chose above and blood.lcs.mit.edu is replaced by the machine you ran nc on:

% env http_proxy=http://hera.lcs.mit.edu:3128/ lynx -source http://blood.lcs.mit.edu:port

nc running on blood will show the following request:

% nc -lp 8888
GET / HTTP/1.0
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14
Via: 1.0 hera.lcs.mit.edu:3128 (Squid/2.4.STABLE1)
X-Forwarded-For: 18.26.4.74
Host: blood.lcs.mit.edu:5678
Cache-Control: max-age=259200
Connection: keep-alive

Look for differences between the web browser's request and the corresponding proxy request. Your proxy will have to translate the request the client makes (the one that starts with GET http://) into a request (like the one directly above) that the server understands.

Getting Started

Skeleton Directory

We have provided a skeleton webproxy directory as we did in the previous two labs. It is available in /home/to2/labs/webproxy.tar.gz. Use it as you used the skeleton directories for the TCP proxy and multifinger:

% tar xzvf /home/to2/labs/webproxy.tar.gz
% cd webproxy
% ./setup
% ./configure --with-sfs=/home/to2/sfs-debug
%  gmake

HTTP parser

We have provided a parser for the HTTP language. It is implemented in the files http.C and http.h that are included in the skeleton tarball.

http.h defines two useful classes, httpreq and httpresp. Both of these classes inherit from the class httpparse (if you are unfamiliar with C++ inheritance consult the Stroustrup C++ language guide referenced in the course information page.) To parse a request or a response, first create a httpreq or httpresp object accordingly. Both of these classes are fed data for parsing by the method int parse (suio *) that removes lines of HTTP headers from a suio structure and parses them. parse returns 1 on completion (after which any data following the headers will still be in the suio structure), 0 if it needs more data to see the complete headers, and -1 on a parse error. The parse method removes the data from the suio as it parses and copies (possibly mutating) headers to the headers field of the object. The possible mutation includes modifying a request (parsed by the httpreq) object to a form suitable to forward to a server. The parse method only operates on headers; it will not read, for instance, any part of the body of a response. When parsing is complete, member variables of the httpparse class contain useful information about the request or response.

The request object contains:

str method The 'type' of request (POST, GET, HEAD)
str host The destination host
short port The destination port
str url The requestied URL
str urlhash A cryptographic hash of the url (useful for caching)
bool nocache Set to true if the Pragma: no-cache option was set
time_t if_modified_since A unix time representation of the if_modified_since header.
ssize_t content_length The size of the following body
str authentication Any authentication information
str headers headers suitably modified for passing to the server

The response object contains the nocache and content_length fields as well as:

int status The "return value" that the request produced (i.e. 404)
time_t (date|expires|last_modified) unix time representations of the creation date, expiration date, and last modified date for this object.

The following routines also defined in http.C may also be useful:

void suio_print (suio *, str)
Adds the contents of the string onto the end of the suio structure in such a way that the string will not be garbage collected until the bytes have been removed from the suio structure.
str httperror (int status, str statmsg, str url, str description);
Returns a fully formatted HTTP message (status, headers and all) containing an error message. You can use this to report an error back to the user of the proxy. For example, if tcpconnect fails, you might want to creat an error message and stick it in suio *buf:
```
    suio_print (buf, httperror (503, "Service Unavailable",
                                url, strerror (errno)));
```
void suio::take (suio *uio)
Appends the contents of uio on to the end of this, removing the bytes from uio. As an example, if you wanted to append the body of an HTTP reply (in suio *buf) onto the headers built up in an httpresp structure, you might do the following:
```
    resp->headers << "\r\n";    // httpresp does not add blank line
    resp->headers.tosuio ()->take (buf);
```

Hash tables

In addition, you may be interested in using one of the hash tables provided by libasync. These are template based C++ data structures and this description will assume a knowledge of C++ templates. The hash structure described here is declared as:

template < class K, class V, K V::*key, ihash_entry < V > V::*field,
  class H = hashfn < K >, class E = equals < K > > class ihash;

ihash provides a hash table abstraction: values are placed into the hash table and associated with a key; they may be later retreived by presenting the key. ihash is an 'intrusive' hash table: the key associated with a value is stored as part of the value in a location known to the data structure. ihash is a template based class with six template arguments. We will only deal with the first four:

class K
This is the type of the hash key. Items of this type will be used to query the table.
class V
This is the type of the value. Items of this type will be returned by queries to the table.
K V::*key
V::*key is the location (the field) in the object of type V where the hash key (of type K) is stored.
ihash_entry < V >::*field
This field (also part of the objects you will store in ihash) is used internally by ihash.

An example best illustrates the use of ihash. If we wished to create a hash table of students indexed by their ID numbers we might declare the following structure:

#include "ihash.h"
struct student {
   str name;
   ... // grades, etc.
   
   long ID;

   ihash_entry < student > hash_link;
   
   student (str name, char *grades, ... );
};

We could then declare a hash table as: ihash < long, student, &student::ID, &student::hash_link > students;

We can now use students to store and retrieve objects of type student. The following operations are supported by ihash:

void insert (V *item)
Insert item into the table. Note that we do not specify the key under which item is to be inserted explicitly (it is already present in the item itself). ihash does not make a copy of item so do not delete it after inserting. Inserting a item into an ihash table might be done as follows:
```
student *S = New student ("Ben Bitdiddle", "F D F C");
students.insert (S);
```
V *operator[] (K key)
Retreive the value associated with the key key. Returns NULL if key is not stored in the table. In our example fetching a document would be done as follows:
```
long ID = 12345;
student *S = students[ID];
if (S) {
  grade_lab3 (S, lab);
} else {
 warn << "not in class\n";
}
```
void remove (V *item) Remove the item from the table.

Note that this discussion assumes that K is one of the "standard" types for which libAsync defines a hash function. These types include most of the types you might consider using as hash keys in this assignment (notably str). If you choose to use a different type you will need to define a hash function for that type. For futher information on how ihash works, consult /home/to2/sfs/async/ihash.h libAsync provides other, similiarly constructed, data structures including lists, queues, and vectors which are can be used analagously to ihash. Consult the source code for more information.

Running and testing the proxy

The proxy should take exactly one argument, a port number on which to listen. For example, to run the proxy on port 2000:

% ./webproxy 2000

As a first test of the proxy you should attempt to use it to browse the web. Set up your web browswer to use one of the class machines running your proxy as a proxy and experiment with a variety of different pages.

When your proxy seems ready, you can run it against the test program test-webproxy (a binary of which will soon be found in /home/to2/labs/). Run test-webproxy with your proxy as an argument:

% /home/to2/labs/test-webproxy ./webproxy
trying to launch webproxy: ./webproxy listen port 1697
Starting server on port 1530....
Test Phase 1: test if cachable pages are cached...Succeeded...
Test Phase 2: test if non-cachable pages are NOT cached...Succeeded...
Test Phase 3: test timeout behavior...Succeeded...
%

The test program runs three tests:

Test Phase 1: test if cachable pages are cached...
This test is to check if you have implemented any caching scheme at all. In the test-webproxy program, the server keeps track of which of the cachable pages have been accessed before and denies access to them the second time. Thus if you have caching in your proxy, all the clients will be able to succeed in getting the cachable pages (first time from the server, second time onwards from your proxy).
Warning: Since the http responses from the server are randomly generated, you need to restart your webproxy every time you run this test if you have running webproxy manually in gdb.
Test Phase 2: test if non-cachable pages are NOT cached...
This test is to check if your cache manager caches anything that's not allowed to be cached. In the test-webproxy program, the server randomly changes the responses to a particular URL (if the type of response associated with the URL is not cachable). The client always checks the response it got from the proxy with the actual response data in the server. Therefore, if your proxy caches anything non-cachable, the response the client got will be different from the server and you will get the following error message:
```
http response (#num) different from the server offset...
```
Specially, if (#num) is:
- 5: Your proxy is caching a http response that has "Pragma: no-cache" header
- 6: Your proxy is caching a http response that is expired
- 7: Your proxy is caching "HTTP/1.0 404 Not Found" response
- 8: Your proxy is caching a response whose corresponding request has "Pragma: no-cache" header
- 9: Your proxy is caching a response whose corresponding request has "Authentication" header
- 10: Your proxy is caching a response which is the result of a "POST" request
Test Phase 3: test timeout behavior...
In the test-webproxy program, the server refuses to read any data from the client has accepted the connection. Your proxy should be able to timeout and disconnect the corresponding client and server.

You can run only one of the test phases by supplying it as an additional argument on the command line:

% test-webproxy ./webproxy 2
trying to launch webproxy: ./webproxy listen port 1652
Starting server on port 1588....
Test Phase 2: test if non-cachable pages are NOT cached...Succeeded...
%

You may wish to run your proxy under the debugger while testing it. You can do so by supplying a port number to test-webproxy instead of a program name. For example:

% ./webproxy 2000 &
[1] 7013
% test-webproxy -d 2000 1
Starting server on port 1625....
Test Phase 1: test if cachable pages are cached...Succeeded...
shutting down proxy..
Killed
%

Handin Procedure

You should hand in a gzipped tarball produced by gmake dist as in the previous labs. This lab should be handed in in ~/handin/lab3/webproxy-0.0.tar.gz The proxy is due on October 4th before class

FAQ

Q: my proxy crashes when I try to access the authorization field (or any str that is NULL) like this:

if (authorization == NULL) {
 ...
} else {
 warn << authorization << "\n";
}

A: This code will actually crash in the comparison, the way to fix this is to instead test:

if (!authorization) {
	warn << authorization << "\n";
} else {
 ...
}

The original code crashes because C++ attempts to cast NULL to an str object so that it can call the binary == operator.

Q: What does "http response incomplete at # #" mean?

A: The tester prints this message when the response it receives from your proxy is not the same size as the response it expected. The first number is how much data it read, the second is how much it expected to read. A similiar messages is "http response %d different from the server offset %d(total len %d)" which indicates that your proxy somehow returned different data than the server expected. A good way to debug these problems is to run the tester under the debugger and examine exactly what is in the tester's buffers when it prints these messages. Setting a break point around line 140 in client.C of the tester might be helpful. Don't forget that source for the tester is in /home/to2/labs/test-webproxy-0.0.tar.gz.

One more note on this question: because the tester runs over the loopback interface it often exposes timing bugs that don't appear when the proxy is used with slower sights located on the wide area network. It is possible to write a proxy that works despite bugs on ccn.com but fails the tester.

Q: How do I copy a strbuf?

A: the strbuf copy constructor is private and can't be called. To copy the data from one strbuf to another try this:

strbuf copy;
strbuf orig;

copy << orig;

<< is the append operator for strbufs.

References

1

Apache Web Proxy, http://www.apache.org/docs/mod/mod_proxy.html.

2

T. Berners-Lee. Propagation, Replication and Caching on the Web,
http://www.w3.org/Propagation/.

3

T. Berners-Lee, et al. RFC 1945: Hypertext Transfer Protocol - HTTP/1.0, May 1996.

4

CERN Web Proxy, http://www.w3.org/Daemon/User/Proxies/Proxies.html.

5

A. Dingle, T. Partl. Web Cache Coherence,
http://sun3.ms.mff.cuni.cz/~dingle/webcoherence.html, May 1996.

6

SquidCache, http://squid.nlanr.net/Squid/.

7

R. Fielding, et al. RFC 2616: Hypertext Transfer Protocol - HTTP/1.1, June 1999.

8

J. Franks, et al. RFC 2069: An Extension to HTTP : Digest Access Authentication, January 1997.

9

J. C. Mogul, et al. RFC 2145: Use and Interpretation of HTTP Version Numbers, May 1997.

10

Netcat. http://c0re.l0pht.com/~weld/netcat/.

11

D. Wessels. Web Caching Reading List, http://ircache.nlanr.net/Cache/reading.html.