Buffer cache

Required reading: I/O-lite.

Overview

The design of a buffer management system impacts the performance of servers. For example, consider the design of a Web server on V6 (assuming that V6 had networking support). An HTTP get request goes through the following subsystems:

The network system demultiplexes the message and delivers it to user space, copying the message from kernel to user space (perhaps requiring a copy from network card to kernel).
The Web server looks in its cache for the requested URL. If it is there, it sends the data back to the client, which involves copying the data from user space into kernel buffers.
The network system in the kernel hands the buffers to the network card, which may involve a copy. The network system doesn't release the buffers until the data has been acknowledged by the client.
If the data is not in the cache of the Web, then the servers read the file from the file system.
The file system calls into the file system buffer cache, which copies the data from the disk into the buffer cache. The file system copies the data into the Web server's address space.
The Web server then copies the data into kernel buffers so that the network system can send the data to the client.

If we look at this picture we see that the data is in 3 places: file system cache, Web server cache, and network buffers. We also see that the data is copied 3-4 times: disk to kernel, kernel to to server's address space, back into the kernel, and onto the network (which may involve a copy to the network card). The two necessary copies are from disk to kernel and from kernel to network.

IO-lite

IO-lite is a unified buffer and cache system. Its design allows applications to avoid redudant copies, multiple buffering, and enables cross-subsystem optimizations (e.g., caching Internet checksum).

The key idea is to provide an IPC interface that scatter/gathers pages of memory and transfers them between address space by reference, and showing how that IPC interface can be used to avoid redundant copies, multiple buffering, and cross-subsystem optimizations. More specifically:

Buffers are an integral number of virtual contiguous VM pages. This decisions allows pages to be mapped into different address spaces easily.
Buffers are immutable. This decision allows buffers to be safely shared. I/O-lite makes a copy of a buffer when a buffer is written (perhaps optimizing the copy by creating just a new slice for the new data). This immutable strategy is also called copy on write (COW).
Buffers are aggregated into buffer aggregates (see figure 1). This decision allows buffers to be added (e.g., when a buffer needs to be modified), concatenated, headers to be appended, etc.
I/O-lite passes buffer aggregates by value, but buffers by reference. All buffers are allocated in the IO-lite window, which is mapped into all address spaces. This decisions avoids copying buffers and allows sharing among subsystems. Because the kernel knows what the IO-lite window is and can inspect the page tables, it can figure out what parts of a destination address space is not used and map the pages there.
Cache is divided into pools of buffers with the same access control list. This decision allows page to stay mapped with the same privileges and to be reused safely without having to change VM mappings.
A new API for the buffer cache; its calls take an aggregator as argument. For example, read returns an aggregator and writes takes an aggregator. Other calls include: concatenate, split, length, duplicate, seek, tell, mmap, etc. (How are v6 read and write implemented in terms of these calls? Why is this interface better? Answer: application can avoid copies because IO-lite can choose the address where the data will appear.)

Why do you need aggregator? Couldn't you send a page maping from the source to destination address space and make the page COW? It is not general enough. For example, the destination address may want to add a header (e.g., the HTTP header) and then forward the mappings to the network server.

Together these ideas allow a buffer to be mapped safely, using the VM system, into multiple address space and shared efficiently among multiple subsystems.

Pseudocode for read:

read(fd, Agg **agg) {  // assume 8K read from file system
   (fileid, offset) = fd->(fileid, offset);
   cache_agg = lookup_cache(fileid, offset);
   if (!cache_agg) {
       allocate cache_agg for in cache
       allocate buffer(s) from appropriate pool
       read data into it from the disk.
       insert cache_agg in cache;
   }
   if (!buffer mapped into process) map buffers into process's space.
   copy cache cache_agg into *agg;
}

Pseudocode for write:

write(fd, Agg **agg) {  // assume 8K write to file system
   (fileid, offset) = fd->(fileid, offset);
   cache_agg = lookup_cache(fileid, offset);
   update (cache_agg, *agg)
}

update cache_agg (**agg) {
   There are three cases:
   (1) The write modifies every word in the file
   (2) Only a subset of the file is modified by this write
   (3) Enough words are modified in the file so that making a redundant 
       copy of the file cache buffer is less expensive than the
       fragmentation overhead if IO-Lite doesn't.

   In (1) and (3), allocate a new buffer(s), and write changes to
   it. In (3) it would also copy over unchanged data. 

   For (2), store the new values to a newly allocated buffer and then
   combine the new and old values by creating a new buffer aggregate
   that reflects the logical layout of the file

   decrease refcnt for buffers that were freedup in update of cache_agg
}

How does this apply to 6.828 kernel and operating system? Already, the file server has the buffer cache mapped in memory. When a client opens a file, the server sends already a mapping to the client, which maps a page from the buffer cache into the client's address space. (If we had, a network server, the client can sends a mapping to the network server). Ref counting works out, because the kernel maintains a ref count for each physical page; if the ref count is 1 for a page in the buffer cache, then only the file server has it mapped.

If we want I/O lite functions, we need generalize the IPC interface so that we can send I/O vectors (i.e., scatter/gather structures). This extension would allow intermediate environments to add headers etc., without having to copy.

Paper discussion

How is the Web server implemented using IO-lite?

The network system demultiplexes the message into an aggregator to the Web server. (How does the network subsystem know from which pool to allocate the aggregator? Answer: something called early multiplexing, which is implemented using packet filters.)
The Web server looks in its cache for requested URL. If present, passes the aggregator to the network subsystem, but the pointers to the data are passed by reference. (What happens if the server modifies these buffers?)
If the network system has sent this data earlier, it might have kept a copy of the checksum of the data. It can safely to that, because buffers are immutable. (Are buffers uniquely name?)
The network system prepends a header to the aggregator, and DMAs the aggregator onto the network. (Why is this important? What does this assume of the network card?)
If the data is not in the cache of the Web, then the servers read the file from the file system, which issues an IOL_read.
I/O-lite copies the data from the disk in a buffer in the appropriate pool. The file system copies the aggregator to the server, which may involve mapping the buffer into the server's address space (if there was no free buffer that was already mapped).
Then the server hands the aggregator to the network system, which does its job as before.

How many copies? How many bufferings? What cross-subsystem optimizations?

How does IO-lite and paging interact?

What is the page replacement policy?

Is I/O-lite worth it? (See figure 3 and 4) Is it worth for small pages? What limits the performance of I/O-lite? How about Flash and Apache?

How are CGI-bin apps implemented? Is it worth it? (See figures 5 and 6).

Is double buffering a problem? (See figure 7).

Is there any other app than a Web server that benefits (See figure 8).