"Application Performance and Flexibility on Exokernel Systems"
Kaashoek, Engler, Ganger, Briceno, Hunt, Mazieres, Pinckney, Grimm, Jannotti, Mackenzie

What's the overall point?
  O/S abstractions get in the way of aggressive applications
  Not about performance of individual operations (e.g. system call or IPC)
  The problem is application structure
    You often just can't do what you want in an ordinary O/S
  How will we be able to tell if they are right?
    They need to demonstrate an app with structure impossible in UNIX.
    And they need to show the app is desireable.
      I.e. *much* higher performance, or *much* more functionality.

What's an abstraction?
  Typically a virtualization of some hardware resource.

Examples?
  Disk blocks vs file systems.
  Phys mem vs address space / process.
  CPU vs time slicing or scheduler activations.
  TLB entries vs address spaces.
  Frame buffer vs windows.
  Ethernet frames vs TCP/IP.

Why would you want them?
  More convenient API.
  Allows sharing (files, TCP ports).
  Helps w/ composable, re-usable, general-purpose applications.
    e.g. "standard output" and UNIX pipes.
  Helps make applications portable to different hardware.
  Mediate/protect shared resources.
    Apps don't get direct hardware access.
    O/S mediates all accesses to enforce protection.
    E.g. files have owners; apps can't directly read the disk.
    This is the only deep reason!

OK, let's design a high-performance application.
  See whether we run into trouble with UNIX abstractions.
  Let's DMA data directly from disk buffer cache to net.
  Or stream data at full speed from disk to net.

What actually happens on UNIX on a simple server.
  SYN, SYN/ACK, ACK -- now tell process.
  Request arrives, ACKed.
  Copy request data -> process.
  open()
    may block...
    UNIX directory structure has some O(N) problems.
  read() from file
    disk -> buffer cache (maybe)
    buffer cache -> application (always)
  write() to TCP connection
    application -> mbufs in TCP retransmit queue
    TCP must keep copy for possible re-transmission.
    Packetization may be different from disk block-ization.
    TCP segments, computes checksum, sends -> net.
  As ACKs arrive,
    TCP sends more.
    May decide to retransmit, re-computes checksum.
  Figure 3 shows about 700 requests per second (from cache).

What Cheetah does. Section 7.3.
  Avoid copies.
    Just disk -> cache -> net.
    rxmt out of disk cache.
  Store TCP checksum per block in file.
    IE file format a bit like packet format.
    Avoids checksum costs.
      Avoid *re*-computing checksum on retransmit.
  Intelligent ACK merging.
    ACK for request with first data packet.
  Intelligent clustering on disk.
    GIFs with pages. Inodes near data.
    Cheetah pre-fetches intelligently.

How do we know if Cheetah is a good idea?
  Performance data in Figure 3.
  Result: 8000 requests per second -- a factor of 10 faster.
  From cache or disk? (must be cache)
  Same document over and over, or distribution?
  Why such a speedup for 0-byte docs?
    Not due to HTML-based file grouping.
    Probably not due to copy avoidance.
    Probably not due to checksum avoidance (no data...).
    Maybe due to eliminating one ack...
  Can we explain the performance increase for 100 kbyte docs?
    In terms of memory copies avoided?

What facilities does Cheetah need from the O/S?
  User-level TCP, and thus low-level access to packet I/O.
  Control over memory.
    At least to avoid copies disk->kbuf->user->mbuf->net.
  Direct access to disk cache.
  Async read, including of meta-data.
  Needs to control disk layout.

Why are these facilities hard in UNIX?
  User-level TCP: protection/sharing no raw access to incoming packets.
  Disk layout: can't let apps have direct disk access.
  Typically the problem is protection of shared resources.

What's the exokernel's general approach?
  Move as much as possible to OS libraries.
    Libraries are easy to customize, and may be faster than system calls.
  Separate protection and management.
    Kernel just protects, lets apps manage.
  Expose allocation, physical names, revocation, information.

Let's design an exokernel network system.
  Goal: support user-level TCP.
  Can we just hand all incoming packets to any program that wants them?
    I.e. just expose raw hardware.
    No: I might see your packets.
    (Actually this is probably OK; any secure protocol encrypts...)
  I tell kernel what dst port I want.
    Kernel accepts if no other app wants that port.
    Rejects if some other app does.
    So kernel implements just the port abstraction; not TCP &c.
  This gets us first-come-first-served port access.
  Can generalize to patterns, not just ports: exokernel DPF does this.
    By downloading pattern-matching code into the kernel.
  Where to put incoming packet data?
    Don't know which process will get it until it has arrived.
    So must expose kernel network buffers to applications.
  Have we separated protection from management?

Do we believe their story?
  I.e. should we bag current O/S's and use exokernels and lib O/S's.
  Are exokernels easy to program?
  Are exokernel programs likely to be portable?
  Chaos if every program does its own abstractions?
  Are we likely to be always able to separate management from protection?
    Look at the XN file system; pretty complex.