Q: Is there any feasible way that an existing large system (say on the order of Google Search, Facebook, etc.) can switch over to using a system based on principles of IX? A: One might not need to convert an entire system. For example, Facebook makes heavy use of memcached. Memcached has a very simple interface and (as the paper mentions) can be converted to use IX pretty easily, yielding much greater performance. Client code would not have to be modified, so Facebook could continue using all its other software with an IX-based memcached. It may be that only very simple services would profit from using IX anyway (and these are likely to be relatively easy to convert), because a complex service would likely not be network-limited, and thus might not see much benefit from using IX. Q: What does it mean that middlebox dataplanes and IX run packets to completion? Doesn't this mean that they will waste time waiting to sync with the network controller? A: In a traditional network stack, incoming data is buffered in queues at many points during input and output. The NIC's DMA ring is one such queue, and the TCP implementation in the kernel puts data in a buffer waiting for the application to read() it. The application usually reads just one request at a time, and write()s the reply to a socket; the kernel buffers the reply in TCP output buffering. At some point later the TCP code in the kernel creates a TCP/IP packet with the content, and puts that packet in an outgoing IP queue. Then at a later point the NIC device drives moves the packet to the outgoing NIC DMA ring. This is a convenient arrangement, since it allows different modules to process data independently. But it's not efficient: there are lots queues that have to be locked, and lots of chances for data to sit around for a while and be evicted from the CPU's data cache. In contrast, IX takes a small batch of packets from the NIC incoming DMA queue and causes them all to be completely processed (including application processing and generation of replies) before moving to the next batch. Fewer queues are involved (just the NIC DMA rings), which eliminates some costs, and packet data is more likely to stay in the CPU data cache. More subtly, process-each-batch-to-completion helps ensure that IX polls NIC DMA queues for input at a rate that minimizes latency and precisely matches the system's capacity to serve requests, which turns out to be hard to achieve otherwise. Run-to-completion requires very different software structure from the traditional UNIX structure of independent protocol processing modules connected by queues or buffers. The traditional queues/buffers scheme allow the modules to execute asynchronously -- each can have its own policy for when it reads, processes, and replies to requests. For example, an application could read input in one thread, process it (for example read from the disk) in another thread, and send the reply in a third thread. This kind of flexibility is not possible in a run-to-completion scheme: there really can only be one thread, that steps through the different processing modules one at a time in a fixed order. That's fine for simple situations, but not very flexible. Q: According to the paper, the reason that the networking stacks of regular operating systems like Linux can't compete with specialized systems like IX is that they weren't designed for that kind of high-intensity usage -- huge numbers of small packets on relatively small networks with many-gigabit connections. But Linux is used ubiquitously in data centers and similar environments and is a constantly-changing and improving system. Would it not be more effective in the long run to integrate these changes into Linux directly rather than implementing them as some separate system (one that isn't likely to be used in production). A: Yes -- the ideal situation would be for a general-purpose operating system to also be very efficient. Over the years Linux has adopted many techniques to improve its network stack efficiency. It has some support for polling, RSS NICs, and multi-core execution of the network stack, for example. However, there are some techniques in IX which appear to be an awkward fit in a general-purpose kernel's network stack. Run-to-completion would require big changes to every inter-layer interface. Serious polling would require dedicated CPUs, which are a problem if there might be lots of programs running (window systems, text editors, compilers, games, &c). Lock-free parallel processing on a multi-core processor requires cooperation at many levels, to ensure that all layers involved are running on the same core, and to ensure that the NIC correctly predicts which core that is. Maybe all this could be integrated into the Linux stack with as great efficiency as IX, but no-one has figured out how to do it yet. Q: While IX decreases latency and increases throughput, I'm unclear what tradeoff is made. That is, is there something that is lost? A: IX is specialized to dedicated servers, using TCP, with many clients, processing short requests/responses, and not requiring much work per request. It is separate from the rest of Linux; for example, IX sockets don't appear as Linux file descriptors. It requires dedicated cores and NIC queues that might be difficult to manage in a more general purpose environment, e.g. on a workstation running lots of different programs. Q: The paper stresses that they uses adaptive batching, to choose batching only when there’s congestion. Do commodity OSes commonly try to apply batching all the time? A: It's hard for most existing kernels to do batching throughout the kernel/application code because the system call API (and inter-module interfaces within the kernel) don't have much support for batching. For example, if the application calls read() on a network socket, the definition of read() says that the kernel can only give it data that has arrived on that socket, not a big batch of data from many different sockets. On the other hand, there are smaller opportunities for batching even in traditional kernels. For example, when the packet arrival rate is high the NIC will typically interrupt much less often than packets arrive (i.e. batching multiple packets per interrupt). Q: How does zero-copy communication work? How is a computer able to directly access memory of another computer without security consequences? What is the zero-copy API referring to? A: In ordinary UNIX, the read() system call copies incoming data from a kernel buffer into user memory, and write() copies data from user memory into kernel memory. If you're sending lots of data at high speed, these copies can decrease throughput. What the paper means by "zero-copy" is that network messages aren't copied between kernel and user space. IX achieves zero-copy by using the MMU to map physical pages for packet buffers into both the application address space and into IX's address space. The NIC DMAs into that memory, then IX tells the application (via the return from run_io()) where to look for the incoming data in the shared memory. So the only copies involved are for DMA. Q: Why did they have to use a different API than POSIX API? Is it just for the zero-copy benefit? A: You are right that one reason for a different API is to get zero copy. Another reason is for batching -- the run_io() system call returns lots of incoming packets from many clients, and the application thread's next run_io() call call can give IX lots of reply packets to send to clients. Q: What does it mean for an API to be commutative and why is that important for making the syscall implementations synchronization-free? A: Two system calls commute if their specifications show that they don't interact -- that neither affects the result of the other. The fact that two calls commute means they can be implemented without sharing any data, without waiting for each other's locks, &c. And that in turn means that simultaneous calls on different cores will execute in parallel, which is good for performance. Q: The paper starts by defining the control plane as kernel management and scheduling, and a dataplane as network processing. Later on the paper says that the dataplane specifically runs the network stack and application logic. However, when describing the isolation IX OS provides, it says that it separates the control plane, the dataplane, and untrusted user code. What is the untrusted user code if not the application code handled by the dataplane, and why is it accounted for separately? A: The application runs at CPL=3, with a page table that gives it access only to its own memory and some packet buffers shared with IX. IX runs at CPL=0, and sets up the page table so that the application cannot read or write any of IX's memory (except the shared packet buffers). It's true that the paper sometimes uses "dataplane" to refer to the combination of IX and the application, and sometimes to refer to IX alone; that's confusing. Q: In IX the user-applications can't send raw-packets at all? only use IX tcp-stack? A: Correct. In terms of Figure 1(a), the application code (httpd or memcached) performs system calls into IX to ask it to send and receive TCP data. Really the only system call is run_io(), which carries a batch of output data and returns a batch of input. IX (running at CPL=0) can directly send and receive packets, by manipulating the DMA ring that the NIC looks at. Q: If that's the case - according to "4.3 Dataplane API and Operation" IX directly exposes flow control conditions to the application. If the application only has the ability to control the data, and not the internals of the protocol, why is this informtaion useful? A: If you call write() on a socket on ordinary Linux/POSIX, the write() copies the data to a buffer in the kernel, and returns. Or if the buffer is full, the write() blocks in the kernel until TCP has sent some data and freed up the corresponding space in the buffer. In IX, the sendv() either sends the data (in TCP packets) immediately, or returns an error; it doesn't buffer the data for sending later. Maybe the most crucial reason for this behavior is that the run_io() system call has to return pretty much immediately with the next batch of incoming events, so it can't afford to block waiting for data to be sent or buffer space to open up. Keep in mind that the real system call is run_io(), and that it both hands a batch of data to IX for it to send on many different connections, and returns a batch of new incoming events from many connections. Q: Since there's one dataplane per application, each with memory isolation, how does IX handle multiple different applications sending and receiving packets at one time? Do all the dataplanes receive the same packets (all of them)? Is there any conflict with sending packets out between the applications? Is each dataplane equivalent in capability of its API? (e.g. do some applications have dataplanes that lack some capabilities, depending on the app?) A: IX uses a feature of modern NICs called RSS to tell the NIC to deliver each application's packets to just the IX responsible for that application. The RSS feature in the NIC would use the port number to decide which application. The NIC supports multiple DMA queues, one per application. Each IX dataplane is identical, with the same API. Q: Why is the common case that most timers are canceled before they expired? A: The timers are there to detect when a transmitted TCP packet has been lost -- the sender expects an ACK packet from the receiver within a short time, and if a timeout expires without the ACK arriving, the sender assumes the original packet was lost and re-sends it. But loss rates are usually very low, so most ACKs will arrive quickly, and the TCP code will cancel the timer for the corresponding sent packet. Q: In what cases would you want to use a background thread versus an elastic thread? A: I think elastic threads can only be used to process network input and (for requests that can be served quickly) send replies. Everything else should be in a background thread so it doesn't hold up input processing. Q: Suppose I was running a MySQL application that needs to perform a disk read for a particular query. Isn't this bad to do in an elastic thread since it would completely freeze reading from the rx queue? In other words, since elastic threads are pinned to cores as well as rx/tx queues (I think), doesn't that means that high-latency operations within them severely impact performance? A: I agree, blocking operations such as reading the disk should be done in a background thread. Presumably the request would arrive in an elastic thread, which would hand off the request to a background thread for the disk read; when the background thread finished, it would send a reply. IX seems aimed at services that handle millions of requests per second. If you have a server than only handles a few thousand requests per second, then IX probably won't solve any problems for you. Throughputs that high mean that each request has to be very cheap to process. If more than about 0.01% (100 out of each million) of the requests require a disk read, that alone will probably prevent you from achieving the level of performance that would make IX attractive. Q: Are there any disadvantages to having larger pages? Why do contemporary OSes use page sizes that are < 2MB? A: Large pages can waste a lot of memory. If a process needs to allocate only a few thousand bytes of memory, it's a shame for the O/S to be forced to allocate it a minimum of 2 MB. Modern x86 kernels use both page sizes, and try to adapt which they allocate based on process behavior. If a process allocates huge amounts of contiguous memory, the kernel will give it 2 MB pages; if the process allocates little bits of memory here and there in its address space, the kernel will give it 4K pages. It turns out to be hard to get all the details right; here's a paper: http://static.usenix.org/event/osdi02/tech/full_papers/navarro/navarro.pdf Q: How does batching improve instruction cache locality and branch prediction accuracy? A: Let's suppose the alternative is to process each incoming packet to completion, before moving on to the next received packet. Then the system would do NIC driver processing for one packet; then TCP processing for one packet; then application processing for one packet. If the total code to do all processing doesn't fit in the instruction cache, then each packet would encounter instruction cache misses for each kind of processing. Since branch prediction also uses a limited-size cache indexed by instruction address, you'd also get branch prediction cache misses. If a system uses batching, then it does each kind of processing on a batch of packets before moving onto the next kind of processing. So for example the first packet in a batch would get instruction cache misses during TCP processing, but the rest of the packets in the batch would not. Q: By batching system calls, won't that affect latency? A: IX doesn't wait when batching. When the application calls run_io(), and there is even a single packet waiting in the NIC receive DMA queue, run_io() will return immediately with the waiting packet(s). So IX won't increase latency under light load. Under heavy load (lots of packets waiting to be processed), IX reduces latency because it is more efficient than Linux -- IX can complete the batch faster than Linux would be able to process the same set of packets. Q: What is the convergence time of this adaptive batching? While the OS is still figuring out optimal batch sizes, since there are no intermediate buffers, a lot of packets would be dropped, so this is a relevant question. A: My understanding is that IX doesn't explicitly adjust the batch size. Instead, when the application calls run_io(), IX returns whatever packets are waiting in the NIC DMA queue (up to some fixed maximum limit). Under low network load, only a few packets will have arrived since the application's last call to run_io(), so the batch will naturally be small. Under high network load, lots of packets will have arrived since the last run_io(), so that batch size will automatically be large. Q: Can the network server in JOS also take advantage of batching system calls like IX? A: I suspect JOS could be modified to support batching. For example, you could imagine sending and receiving multiple IPC messages per system call, or sending and receiving multiple packets to/from the NIC. This would reduce the CPU time spent in user/kernel crossings, which might improve performance.