Q: Is there any feasible way that an existing large system (say on the order of
Google Search, Facebook, etc.) can switch over to using a system based on
principles of IX?

A: One might not need to convert an entire system. For example, Facebook makes
heavy use of memcached. Memcached has a very simple interface and (as the paper
mentions) can be converted to use IX pretty easily, yielding much greater
performance. Client code would not have to be modified, so Facebook could
continue using all its other software with an IX-based memcached.

It may be that only very simple services would profit from using IX anyway (and
these are likely to be relatively easy to convert), because a complex service
would likely not be network-limited, and thus might not see much benefit from
using IX.


Q: What does it mean that middlebox dataplanes and IX run packets to completion?
Doesn't this mean that they will waste time waiting to sync with the network
controller?

A: In a traditional network stack, incoming data is buffered in queues at many
points during input and output. The NIC's DMA ring is one such queue, and the
TCP implementation in the kernel puts data in a buffer waiting for the
application to read() it. The application usually reads just one request at a
time, and write()s the reply to a socket; the kernel buffers the reply in TCP
output buffering. At some point later the TCP code in the kernel creates a
TCP/IP packet with the content, and puts that packet in an outgoing IP
queue. Then at a later point the NIC device drives moves the packet to the
outgoing NIC DMA ring. This is a convenient arrangement, since it allows
different modules to process data independently. But it's not efficient: there
are lots queues that have to be locked, and lots of chances for data to sit
around for a while and be evicted from the CPU's data cache.

In contrast, IX takes a small batch of packets from the NIC incoming DMA queue
and causes them all to be completely processed (including application processing
and generation of replies) before moving to the next batch. Fewer queues are
involved (just the NIC DMA rings), which eliminates some costs, and packet data
is more likely to stay in the CPU data cache. More subtly,
process-each-batch-to-completion helps ensure that IX polls NIC DMA queues for
input at a rate that minimizes latency and precisely matches the system's
capacity to serve requests, which turns out to be hard to achieve otherwise.

Run-to-completion requires very different software structure from the
traditional UNIX structure of independent protocol processing modules connected
by queues or buffers. The traditional queues/buffers scheme allow the modules to
execute asynchronously -- each can have its own policy for when it reads,
processes, and replies to requests. For example, an application could read input
in one thread, process it (for example read from the disk) in another thread,
and send the reply in a third thread. This kind of flexibility is not possible
in a run-to-completion scheme: there really can only be one thread, that steps
through the different processing modules one at a time in a fixed order. That's
fine for simple situations, but not very flexible.


Q: According to the paper, the reason that the networking stacks of regular
operating systems like Linux can't compete with specialized systems like IX is
that they weren't designed for that kind of high-intensity usage -- huge numbers
of small packets on relatively small networks with many-gigabit connections. But
Linux is used ubiquitously in data centers and similar environments and is a
constantly-changing and improving system. Would it not be more effective in the
long run to integrate these changes into Linux directly rather than implementing
them as some separate system (one that isn't likely to be used in production).

A: Yes -- the ideal situation would be for a general-purpose operating system to
also be very efficient.

Over the years Linux has adopted many techniques to improve its network stack
efficiency. It has some support for polling, RSS NICs, and multi-core execution
of the network stack, for example. However, there are some techniques in IX
which appear to be an awkward fit in a general-purpose kernel's network
stack. Run-to-completion would require big changes to every inter-layer
interface. Serious polling would require dedicated CPUs, which are a problem if
there might be lots of programs running (window systems, text editors,
compilers, games, &c).  Lock-free parallel processing on a multi-core processor
requires cooperation at many levels, to ensure that all layers involved are
running on the same core, and to ensure that the NIC correctly predicts which
core that is. Maybe all this could be integrated into the Linux stack with as
great efficiency as IX, but no-one has figured out how to do it yet.


Q: While IX decreases latency and increases throughput, I'm unclear what
tradeoff is made. That is, is there something that is lost?

A: IX is specialized to dedicated servers, using TCP, with many clients,
processing short requests/responses, and not requiring much work per request. It
is separate from the rest of Linux; for example, IX sockets don't appear as
Linux file descriptors. It requires dedicated cores and NIC queues that might be
difficult to manage in a more general purpose environment, e.g. on a workstation
running lots of different programs.


Q: The paper stresses that they uses adaptive batching, to choose batching only
when there’s congestion. Do commodity OSes commonly try to apply batching all
the time?

A: It's hard for most existing kernels to do batching throughout the
kernel/application code because the system call API (and inter-module
interfaces within the kernel) don't have much support for batching. For
example, if the application calls read() on a network socket, the
definition of read() says that the kernel can only give it data that has
arrived on that socket, not a big batch of data from many different
sockets.

On the other hand, there are smaller opportunities for batching even in
traditional kernels. For example, when the packet arrival rate is high
the NIC will typically interrupt much less often than packets arrive
(i.e. batching multiple packets per interrupt).


Q: How does zero-copy communication work? How is a computer able to directly
access memory of another computer without security consequences? What is the
zero-copy API referring to?

A: In ordinary UNIX, the read() system call copies incoming data from a kernel
buffer into user memory, and write() copies data from user memory into kernel
memory. If you're sending lots of data at high speed, these copies can decrease
throughput.

What the paper means by "zero-copy" is that network messages aren't copied
between kernel and user space. IX achieves zero-copy by using the MMU to map
physical pages for packet buffers into both the application address space and
into IX's address space. The NIC DMAs into that memory, then IX tells the
application (via the return from run_io()) where to look for the incoming data
in the shared memory. So the only copies involved are for DMA.

Q: Why did they have to use a different API than POSIX API? Is it just for the
zero-copy benefit?

A: You are right that one reason for a different API is to get zero copy.
Another reason is for batching -- the run_io() system call returns lots
of incoming packets from many clients, and the application thread's next
run_io() call call can give IX lots of reply packets to send to clients.


Q: What does it mean for an API to be commutative and why is that important for
making the syscall implementations synchronization-free?

A: Two system calls commute if their specifications show that they don't
interact -- that neither affects the result of the other. The fact that
two calls commute means they can be implemented without sharing any
data, without waiting for each other's locks, &c. And that in turn means
that simultaneous calls on different cores will execute in parallel,
which is good for performance.


Q: The paper starts by defining the control plane as kernel management and
scheduling, and a dataplane as network processing. Later on the paper says that
the dataplane specifically runs the network stack and application logic.
However, when describing the isolation IX OS provides, it says that it separates
the control plane, the dataplane, and untrusted user code. What is the untrusted
user code if not the application code handled by the dataplane, and why is it
accounted for separately?

A: The application runs at CPL=3, with a page table that gives it access only to
its own memory and some packet buffers shared with IX. IX runs at CPL=0, and
sets up the page table so that the application cannot read or write any of IX's
memory (except the shared packet buffers).

It's true that the paper sometimes uses "dataplane" to refer to the combination
of IX and the application, and sometimes to refer to IX alone; that's confusing.


Q: In IX the user-applications can't send raw-packets at all? only use IX
tcp-stack?

A: Correct. In terms of Figure 1(a), the application code (httpd or memcached)
performs system calls into IX to ask it to send and receive TCP data. Really the
only system call is run_io(), which carries a batch of output data and returns a
batch of input. IX (running at CPL=0) can directly send and receive packets, by
manipulating the DMA ring that the NIC looks at.


Q: If that's the case - according to "4.3 Dataplane API and Operation" IX
directly exposes flow control conditions to the application. If the application
only has the ability to control the data, and not the internals of the protocol,
why is this informtaion useful?

A: If you call write() on a socket on ordinary Linux/POSIX, the write() copies
the data to a buffer in the kernel, and returns. Or if the buffer is full, the
write() blocks in the kernel until TCP has sent some data and freed up the
corresponding space in the buffer. In IX, the sendv() either sends the data (in
TCP packets) immediately, or returns an error; it doesn't buffer the data for
sending later. Maybe the most crucial reason for this behavior is that the
run_io() system call has to return pretty much immediately with the next batch
of incoming events, so it can't afford to block waiting for data to be sent or
buffer space to open up. Keep in mind that the real system call is run_io(), and
that it both hands a batch of data to IX for it to send on many different
connections, and returns a batch of new incoming events from many connections.


Q: Since there's one dataplane per application, each with memory isolation, how
does IX handle multiple different applications sending and receiving packets at
one time? Do all the dataplanes receive the same packets (all of them)? Is there
any conflict with sending packets out between the applications?  Is each
dataplane equivalent in capability of its API? (e.g. do some applications have
dataplanes that lack some capabilities, depending on the app?)

A: IX uses a feature of modern NICs called RSS to tell the NIC to deliver each
application's packets to just the IX responsible for that application. The RSS
feature in the NIC would use the port number to decide which application. The
NIC supports multiple DMA queues, one per application.

Each IX dataplane is identical, with the same API.


Q: Why is the common case that most timers are canceled before they expired?

A: The timers are there to detect when a transmitted TCP packet has been lost --
the sender expects an ACK packet from the receiver within a short time, and if a
timeout expires without the ACK arriving, the sender assumes the original packet
was lost and re-sends it. But loss rates are usually very low, so most ACKs will
arrive quickly, and the TCP code will cancel the timer for the corresponding
sent packet.


Q: In what cases would you want to use a background thread versus an elastic
thread?

A: I think elastic threads can only be used to process network input and (for
requests that can be served quickly) send replies. Everything else should be in
a background thread so it doesn't hold up input processing.


Q: Suppose I was running a MySQL application that needs to perform a disk read
for a particular query. Isn't this bad to do in an elastic thread since it would
completely freeze reading from the rx queue? In other words, since elastic
threads are pinned to cores as well as rx/tx queues (I think), doesn't that
means that high-latency operations within them severely impact performance?

A: I agree, blocking operations such as reading the disk should be done in a
background thread. Presumably the request would arrive in an elastic thread,
which would hand off the request to a background thread for the disk read; when
the background thread finished, it would send a reply.

IX seems aimed at services that handle millions of requests per second.  If you
have a server than only handles a few thousand requests per second, then IX
probably won't solve any problems for you. Throughputs that high mean that each
request has to be very cheap to process. If more than about 0.01% (100 out of
each million) of the requests require a disk read, that alone will probably
prevent you from achieving the level of performance that would make IX
attractive.

Q: Are there any disadvantages to having larger pages? Why do contemporary OSes
use page sizes that are < 2MB?

A: Large pages can waste a lot of memory. If a process needs to allocate only a
few thousand bytes of memory, it's a shame for the O/S to be forced to allocate
it a minimum of 2 MB.

Modern x86 kernels use both page sizes, and try to adapt which they
allocate based on process behavior. If a process allocates huge amounts
of contiguous memory, the kernel will give it 2 MB pages; if the process
allocates little bits of memory here and there in its address space, the
kernel will give it 4K pages. It turns out to be hard to get all the
details right; here's a paper:

http://static.usenix.org/event/osdi02/tech/full_papers/navarro/navarro.pdf


Q: How does batching improve instruction cache locality and branch prediction
accuracy?

A: Let's suppose the alternative is to process each incoming packet to
completion, before moving on to the next received packet. Then the system would
do NIC driver processing for one packet; then TCP processing for one packet;
then application processing for one packet.  If the total code to do all
processing doesn't fit in the instruction cache, then each packet would
encounter instruction cache misses for each kind of processing. Since branch
prediction also uses a limited-size cache indexed by instruction address, you'd
also get branch prediction cache misses.

If a system uses batching, then it does each kind of processing on a batch of
packets before moving onto the next kind of processing. So for example the first
packet in a batch would get instruction cache misses during TCP processing, but
the rest of the packets in the batch would not.


Q: By batching system calls, won't that affect latency?

A: IX doesn't wait when batching. When the application calls run_io(), and there
is even a single packet waiting in the NIC receive DMA queue, run_io() will
return immediately with the waiting packet(s). So IX won't increase latency
under light load.

Under heavy load (lots of packets waiting to be processed), IX reduces latency
because it is more efficient than Linux -- IX can complete the batch faster than
Linux would be able to process the same set of packets.


Q: What is the convergence time of this adaptive batching? While the OS
is still figuring out optimal batch sizes, since there are no
intermediate buffers, a lot of packets would be dropped, so this is a
relevant question.

A: My understanding is that IX doesn't explicitly adjust the batch size.
Instead, when the application calls run_io(), IX returns whatever packets are
waiting in the NIC DMA queue (up to some fixed maximum limit). Under low network
load, only a few packets will have arrived since the application's last call to
run_io(), so the batch will naturally be small. Under high network load, lots of
packets will have arrived since the last run_io(), so that batch size will
automatically be large.


Q: Can the network server in JOS also take advantage of batching system calls
like IX?

A: I suspect JOS could be modified to support batching. For example, you could
imagine sending and receiving multiple IPC messages per system call, or sending
and receiving multiple packets to/from the NIC. This would reduce the CPU time
spent in user/kernel crossings, which might improve performance.