6.894 Lab 3: A Web Proxy

Due date: Thursday, Oct 12.

News

The webproxy tester is for HTTP 1.0 (no-persistent-connection) testing ONLY! If you are implementing HTTP1.0 with persistent connection, i'll test your webproxy manually rather than an automated test.
There's bug in old webproxy tester relating to the phase 3 testing timeout. The modified tar file version is here. Or you could simply replace the client.C file. (Thanks to Steve Bauer, Bodhi, Wei Shi for pointing it out)
The webproxy tester is here. untar the files and edit the Makefile to point to your asynchronous library. `gmake' the files to get executable test-webproxy.

To test your webproxy in normal mode (i.e. having the test-webproxy program automatically start your webproxy), type

%./test-webproxy (path of your webproxy program) (test#)
There are 3 available tests, the default is to run all of them.

To test your webproxy in debug mode, edit the test-webproxy.C file, change the line
#define DEBUG_WEBPROXY 0
to
#define DEBUG_WEBPROXY 1
You can now run your webproxy in debugger(such as gdb) manually. When you run test-webproxy program this time, you need to tell it the port the proxy is running on:

%./test-webproxy (path of your webproxy) (proxy port) (test#)

As in the previous lab, we will have a list of FAQ to help you debug your webproxy.


The http parser has a bug and does not parse HTTP response such as "HTTP/1.0 404 Not Found" corrrectly. The new http.C corrected this bug. (Thanks to Kalpak).

Introduction

For students who have done 6.033 lab before, refer to this document about what you do for this lab.

Now that you have programmed an asynchronous TCP proxy, you will enjoy creating an asynchronous caching web proxy. Unlike the TCP proxy, your web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.

In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests[3]. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server)[1]. Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).


Design Criteria

The HTTP/1.0 spec, RFC 1945, defines a web proxy as a transparent, trusted intermediary between web clients and web servers for the purpose of making requests on behalf of clients. Requests are serviced internally or by passing them, with possible translation, on to other servers. A proxy must interpret and, if necessary, rewrite a request message before forwarding it. In particular, your proxy must address:

Your proxy should function correctly for any HTTP/1.0 GET, POST, or HEAD request. However, you may ignore any references to cookies and authentication.


Desirable Properties of Your Web Proxy

Your web proxy should tolerate many simultaneous requests. The web proxy will accept connections from multiple clients and forward them using multiple connections to the appropriate servers. No client or server should be able to hang the web proxy by refusing to read or write data on its connection.

You should ensure that your proxy serves cached pages to clients when RFC 1945 allows, and only contacts a server when it has to. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.

You'll want to search RFC 1945 for any warnings about ``proxy'' behavior. The lab TA's will test that your proxy handles requests as stated in RFC 1945.


The HTTP Protocol

The Hypertext Transfer Protocol (HTTP) is the most commonly used protocol on the web today. For this lab, you will use the somewhat out-of-date version 1.0 of HTTP.

The HTTP protocol assumes a reliable connection and, in current practice, uses the TCP protocol to provide this reliable connection. The TCP protocol provides the reliable transport of bytes between programs on two separate machines, even over an unreliable network. Luckily for us, the TCP protocol is built into the UNIX operating system.

The HTTP protocol is a request/response protocol. When a client opens a connection, it immediately sends its request for a file. A web server then responds with the file or an error message. You can try out the protocol yourself. For example, try:

(~/)% telnet web.mit.edu 80
Then type
GET /6.033/www/ HTTP/1.0
followed by two carriage returns. See what you get.

To form the path to the file to be retrieved on a server, the client takes everything after the machine name and port number. For example, http://www.mit.edu/original/ means we should ask for the file /original/. If you see a URL with nothing after the machine name and port, then / is assumed (The server determines what page to return when just given /. Typically this default page is index.html or home.html).

On most servers, the HTTP protocol lives on port 80. However, it turns out that port 80 is protected on most UNIX systems, so we will have to run our web proxy on a higher port (> 1023). To use other ports, we need to modify our URLs a bit, adding the port number after the machine name. For example, entering http://www.mit.edu:8008/ into your favorite web browser connects to the machine www.mit.edu on port 8008 using the HTTP protocol.

The format of the request for HTTP is quite simple. A request consists of a method followed by arguments, each separated by a space and terminated by a carriage return/linefeed pair. Your web proxy should support three methods: GET, POST, and HEAD[3]. Methods take two arguments: the file to be retrieved and the HTTP version. Additional headers can follow the request. The web proxy will especially care about the following headers: Allow, Date, Expires, From, If-Modified-Since, Pragma: no-cache, Server. However, your proxy must handle the other HTTP/1.0 headers[3]. Fortunately, the web proxy can forward most headers verbatim to the appropriate server. Only a handful of headers require proxy intervention.

Once the request line is received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the appropriate file and send back a response (usually the file contents) and close the connection.


Using a Web Proxy

To use a web proxy, you must configure your web browser. For Lynx, wget, or Mosaic, you must set an environment variable. The following sets your proxy to squid.lcs.mit.edu in csh.

(~/)% setenv http_proxy http://squid.lcs.mit.edu:3128/

In Netscape, find the Network Preferences and manually setup a proxy. For instance, you can set the HTTP proxy to squid.lcs.mit.edu and the port to 3128. Remember to revert your changes. Not all requests will work transparently through the squid.lcs.mit.edu proxy.


HTTP in Action!

How does one watch an HTTP request in action? To make a simple HTTP request, most people will use telnet. However, telnet does not let you watch incoming HTTP requests. For a more sophisticated connection, use nc (NetCat). nc lets you read and write data across network connections using UDP or TCP[10].

If you use athena, type the following to get nc

(~/)% add sipb
A standard Linux installation usually comes with nc. If you don't have nc on your machine, go to this site to download and install it.

To use nc: For instance, this listens to the network on port 8000:

(~/)% add sipb
(~/)% nc -p 8000 -l -v
listening on [any] 8000 ...

Now point your favorite web browser to http://localhost:8000/ (no proxy). My version of Netscape generates:

connect to [127.0.0.1] from localhost [127.0.0.1] 5854
GET / HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/3.01 (X11; U; Linux 2.0.30 i586)
Host: localhost:8000
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*

The first line asks for a file called / using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines. Lynx produces a similar request:

connect to [127.0.0.1] from localhost [127.0.0.1] 5917
GET / HTTP/1.0
Host: localhost:8000
Accept: application/postscript, image/gif, application/postscript, */*;q=0.001
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.6  libwww-FM/2.14

Set your browser to use port 8000 of localhost as a proxy, and retrieve http://c0re.l0pht.com/weld/netcat/readme.html; this will produce something like:

connect to [127.0.0.1] from localhost.mit.edu [127.0.0.1] 2328
GET http://c0re.l0pht.com/~weld/netcat/readme.html HTTP/1.0
If-Modified-Since: Thursday, 12-Sep-96 02:25:13 GMT; length=63340
Referer: http://c0re.l0pht.com/~weld/netcat/
Proxy-Connection: Keep-Alive
User-Agent: Mozilla/3.01Gold (X11; U; OpenBSD 2.2 i386)
Pragma: no-cache
Host: c0re.l0pht.com
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*

The above shows what a web browser sends to a web proxy. Now we'll try to obtain sample data from a real web proxy (squid.lcs.mit.edu, port 3128). Set your browser's proxy to http://squid.lcs.mit.edu:3128/ and run the following command on your local machine, say abc.mit.edu (you can obtain the name of the local machine using command "hostname")

nc -p 8000 -v -l
listening on [127.0.0.1] 8000 ...

When I ask my web browser for http://abc.mit.edu:8000/, nc reports:

connect to [18.26.4.118] from xyz.lcs.mit.edu [18.24.10.20] 3037
GET / HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
User-Agent: ANONYM/0.0 (ITS; KL-10)
Host: abc.mit.edu:8000
Cache-Control: max-age=259200
Connection: keep-alive

Try this on your machine. Look for differences between the web browser's request and the corresponding proxy request.


Administrivia

Where to Start?

Read over some of the suggested literature at the end of this document. (If you use athena, there is no need for you to download the RFCs from this page. All RFCs are in the rfc locker.)

After you have a general understanding of the problem, play with nc and begin your web proxy design. Once you have convinced yourself of the correctness of your design, you should implement the web proxy. Likely you will discover new, fascinating problems and will need to modify your design appropriately.

For this lab, we will provide you with a simple HTTP/1.0 parser to save you the pain of parsing. Download it from here. The tar file also contains a sample Makefile which you could modify to suit your needs for this assignment.

Handin Procedure

You should hand in your source and a Makefile EXCLUDING the md5 files we gave you but INCLUDING the http parser files. The Makefile should create a proxy executable named webproxy. We must be able to run your proxy simply by giving it a port number (on which it should listen for HTTP requests) as a command-line argument. That is, we should be able to type the following to cause your proxy to listen on port 8088:
% make
% ./webproxy 8088

Hand in your lab by creating a tar file with all your source files (and Makefile), uuencoding it, and e-mailing it to 6.894-submit@lcs.mit.edu. For example:

% tar cf lab3.tar Makefile http.h http.il http.C webproxy.C (and other files you created)
% uuencode < lab3.tar lab3.tar | Mail -s '6.894 lab3.tar' 6.894-submit@pdos.lcs.mit.edu
For students working on athena, type the following instead:
% tar cf lab3.tar Makefile http.h http.il http.C webproxy.C (and other files you created)
% uuencode < lab3.tar lab3.tar | mhmail -subject '6.894 lab3.tar' 6.894-submit@pdos.lcs.mit.edu
Please don't send attachments in your email submission. If you cannot submit, email jinyang@lcs.mit.edu for help. :-)

We must be able to compile your software with our standard async library, so don't modify the async library.

The lab is due by the beginning of class on Thursday, October 12th.

Collaboration

Like the TCP Proxy, this is an individual project. But you are otherwise free (and encouraged) to discuss the design and implementation details with other 6.894 lab students. Include a README file in the tar archive you submit giving credit where it is due.

Requirements

Your proxy should statisfy the following minimum criteria:


References

1
Apache Web Proxy, http://www.apache.org/docs/mod/mod_proxy.html.

2
T. Berners-Lee. Propagation, Replication and Caching on the Web,
http://www.w3.org/Propagation/.

3
T. Berners-Lee, et al. RFC 1945: Hypertext Transfer Protocol - HTTP/1.0, May 1996.

4
CERN Web Proxy, http://www.w3.org/Daemon/User/Proxies/Proxies.html.

5
A. Dingle, T. Partl. Web Cache Coherence,
http://sun3.ms.mff.cuni.cz/~dingle/webcoherence.html, May 1996.

6
SquidCache, http://squid.nlanr.net/Squid/.

7
R. Fielding, et al. RFC 2616: Hypertext Transfer Protocol - HTTP/1.1, June 1999.

8
J. Franks, et al. RFC 2069: An Extension to HTTP : Digest Access Authentication, January 1997.

9
J. C. Mogul, et al. RFC 2145: Use and Interpretation of HTTP Version Numbers, May 1997.

10
Netcat. http://c0re.l0pht.com/~weld/netcat/.

11
D. Wessels. Web Caching Reading List, http://ircache.nlanr.net/Cache/reading.html.