Buffer cache

Required reading: I/O-lite.

Overview

The design of a buffer management system impacts the performance of servers. For example, consider the design of a Web server on V6 (assuming that V6 had networking support). An HTTP get request goes through the following subsystems:

  1. The network system demultiplexes the message and delivers it to user space, copying the message from kernel to user space (perhaps requiring a copy from network card to kernel).
  2. The Web server looks in its cache for the requested URL. If it is there, it sends the data back to the client, which involves copying the data from user space into kernel buffers.
  3. The network system in the kernel hands the buffers to the network card, which may involve a copy. The network system doesn't release the buffers until the data has been acknowledged by the client.
  4. If the data is not in the cache of the Web, then the servers read the file from the file system.
  5. The file system calls into the file system buffer cache, which copies the data from the disk into the buffer cache. The file system copies the data into the Web server's address space.
  6. The Web server then copies the data into kernel buffers so that the network system can send the data to the client.

If we look at this picture we see that the data is in 3 places: file system cache, Web server cache, and network buffers. We also see that the data is copied 3-4 times: disk to kernel, kernel to to server's address space, back into the kernel, and onto the network (which may involve a copy to the network card). The two necessary copies are from disk to kernel and from kernel to network.

IO-lite

IO-lite is a unified buffer and cache system. Its design allows applications to avoid redudant copies, multiple buffering, and enables cross-subsystem optimizations (e.g., caching Internet checksum).

The keys ideas:

Together these ideas allow a buffer to be mapped safely, using the VM system, into multiple address space and shared efficiently among multiple subsystems.

Pseudocode for read:

read(fd, Agg **agg) {  // assume 8K read from file system
   (fileid, offset) = fd->(fileid, offset);
   cache_agg = lookup_cache(fileid, offset);
   if (!cache_agg) {
       allocate cache_agg for in cache
       allocate buffer(s) from appropriate pool
       read data into it from the disk.
       insert cache_agg in cache;
   }
   if (!buffer mapped into process) map buffers into process's space.
   copy cache cache_agg into *agg;
}

Pseudocode for write:

write(fd, Agg **agg) {  // assume 8K write to file system
   (fileid, offset) = fd->(fileid, offset);
   cache_agg = lookup_cache(fileid, offset);
   update (cache_agg, *agg)
}

update cache_agg (**agg) {
   There are three cases:
   (1) The write modifies every word in the file
   (2) Only a subset of the file is modified by this write
   (3) Enough words are modified in the file so that making a redundant 
       copy of the file cache buffer is less expensive than the
       fragmentation overhead if IO-Lite doesn't.

   In (1) and (3), allocate a new buffer(s), and write changes to
   it. In (3) it would also copy over unchanged data. 

   For (2), store the new values to a newly allocated buffer and then
   combine the new and old values by creating a new buffer aggregate
   that reflects the logical layout of the file

   decrease refcnt for buffers that were freedup in update of cache_agg
}

Paper discussion

How is the Web server implemented using IO-lite?
  1. The network system demultiplexes the message into an aggregator to the Web server. (How does the network subsystem know from which pool to allocate the aggregator? Answer: something called early multiplexing, which is implemented using packet filters.)
  2. The Web server looks in its cache for requested URL. If present, passes the aggregator to the network subsystem, but the pointers to the data are passed by reference. (What happens if the server modifies these buffers?)
  3. If the network system has sent this data earlier, it might have kept a copy of the checksum of the data. It can safely to that, because buffers are immutable. (Are buffers uniquely name?)
  4. The network system prepends a header to the aggregator, and DMAs the aggregator onto the network. (Why is this important? What does this assume of the network card?)
  5. If the data is not in the cache of the Web, then the servers read the file from the file system, which issues an IOL_read.
  6. I/O-lite copies the data from the disk in a buffer in the appropriate pool. The file system copies the aggregator to the server, which may involve mapping the buffer into the server's address space (if there was no free buffer that was already mapped).
  7. Then the server hands the aggregator to the network system, which does its job as before.

How many copies? How many bufferings? What cross-subsystem optimizations?

How does IO-lite and paging interact?

What is the page replacement policy?

Is I/O-lite worth it? (See figure 3 and 4) Is it worth for small pages? What limits the performance of I/O-lite? How about Flash and Apache?

How are CGI-bin apps implemented? Is it worth it? (See figures 5 and 6).

Is double buffering a problem? (See figure 7).

Is there any other app than a Web server that benefits (See figure 8).