# Parallel & Distributed Operating Systems Group

## 2018

Noria: dynamic, partially-stateful data-flow for high-performance web applications Jon Gjengset, Malte Schwarzkopf, Jonathan Behrens, Lara Timbó Araújo, Martin Ek, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. OSDI 2018.

We introduce partially-stateful data-flow, a new streaming data-flow model that supports eviction and reconstruction of data-flow state on demand. By avoiding state explosion and supporting live changes to the data-flow graph, this model makes data-flow viable for building long-lived, low-latency applications, such as web applications. Our implementation, Noria, simplifies the back-end infrastructure for read-heavy web applications while improving their performance.

A Noria application supplies a relational schema and a set of parameterized queries, which Noria compiles into a data-flow program that pre-computes results for reads and incrementally applies writes. Noria makes it easy to write high-performance applications without manual performance tuning or complex-to-maintain caching layers. Partial statefulness helps Noria limit its in-memory state without prior data-flow systems' restriction to windowed state, and helps Noria adapt its data-flow to schema and query changes while on-line. Unlike prior data-flow systems, Noria also shares state and computation across related queries, eliminating duplicate work.

On a real web application's queries, our prototype scales to 5x higher load than a hand-optimized MySQL baseline. Noria also outperforms a typical MySQL/memcached stack and the materialized views of a commercial database. It scales to tens of millions of reads and millions of writes per second over multiple servers, outperforming a state-of-the-art streaming data-flow system.

@inproceedings{noria:osdi18,
title        = {Noria: dynamic, partially-stateful data-flow for
high-performance web applications},
author       = {Jon Gjengset and Malte Schwarzkopf and Jonathan
Behrens and Lara Timb{\'o} Ara{\'u}jo and Martin Ek and Eddie
Kohler and M. Frans Kaashoek and Robert Morris},
year         = 2018,
pages        = {213--231},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
month        = oct,
}
Karaoke: Distributed private messaging immune to passive traffic analysis David Lazar, Yossi Gilad, and Nickolai Zeldovich. OSDI 2018.

Karaoke is a system for low-latency metadata-private communication. Karaoke provides differential privacy guarantees, and scales better with the number of users than prior such systems (Vuvuzela and Stadium). Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users. Overall, Karaoke’s latency is 5x to 10x better than Vuvuzela and Stadium.

@inproceedings{karaoke:osdi18,
title        = {Karaoke: Distributed Private Messaging Immune to
Passive Traffic Analysis},
author       = {David Lazar and Yossi Gilad and Nickolai Zeldovich},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
year         = 2018,
month        = oct,
}
Proving confidentiality in a file system using DiskSec Atalay Ileri, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2018.

SFSCQ is the first file system with a machine-checked proof of security. To develop, specify, and prove SFSCQ, this paper introduces DiskSec, a novel approach for reasoning about confidentiality of storage systems, such as a file system. DiskSec addresses the challenge of specifying confidentiality using the notion of data noninterference to find a middle ground between strong and precise information-flow-control guarantees and the weaker but more practical discretionary access control. DiskSec factors out reasoning about confidentiality from other properties (such as functional correctness) using a notion of sealed blocks. Sealed blocks enforce that the file system treats confidential file blocks as opaque in the bulk of the code, greatly reducing the effort of proving data noninterference. An evaluation of SFSCQ shows that its theorems preclude security bugs that have been found in real file systems, that DiskSec imposes little performance overhead, and that SFSCQ's incremental development effort, on top of DiskSec and DFSCQ, on which it is based, is moderate.

@inproceedings{disksec:osdi18,
title        = {Proving confidentiality in a file system using
{DiskSec}},
author       = {Atalay Ileri and Tej Chajed and Adam Chlipala and M.
Frans Kaashoek and Nickolai Zeldovich},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
year         = 2018,
month        = oct,
}
Verifying concurrent software using movers in CSPEC Tej Chajed, M. Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich. OSDI 2018.

Writing concurrent systems software is error-prone, because multiple processes or threads can interleave in many ways, and it is easy to forget about a subtle corner case. This paper introduces CSPEC, a framework for formal verification of concurrent software, which ensures that no corner cases are missed. The key challenge is to reduce the number of interleavings that developers must consider. CSPEC uses mover types to re-order commutative operations so that usually it's enough to reason about only sequential executions rather than all possible interleavings. CSPEC also makes proofs easier by making them modular using layers, and by providing a library of reusable proof patterns. To evaluate CSPEC, we implemented and proved the correctness of CMAIL, a simple concurrent Maildir-like mail server that speaks SMTP and POP3. The results demonstrate that CSPEC's movers and patterns allow reasoning about sophisticated concurrency styles in CMAIL.

@inproceedings{cspec:osdi18,
title        = {Verifying concurrent software using movers in
{CSPEC}},
author       = {Tej Chajed and M. Frans Kaashoek and Butler Lampson
and Nickolai Zeldovich},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
year         = 2018,
month        = oct,
}
The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, and Robert T. Morris. OSDI 2018.

This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.

The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.

On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.

@inproceedings{biscuit:osdi18,
title        = {The benefits and costs of writing a {POSIX} kernel
in a high-level language},
author       = {Cody Cutler and M. Frans Kaashoek and Robert T.
Morris},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
year         = 2018,
month        = oct,
}
A revised and verified proof of the scalable commutativity rule Lillian Tsai, Eddie Kohler, M. Frans Kaashoek, and Nickolai Zeldovich.

This paper explains a flaw in the published proof of the Scalable Commutativity Rule (SCR), presents a revised and formally verified proof of the SCR in the Coq proof assistant, and discusses the insights and open questions raised from our experience proving the SCR.

@misc{tsai-scr-proof,
title        = {A Revised and Verified Proof of the Scalable
Commutativity Rule},
author       = {Lillian Tsai and Eddie Kohler and M. Frans Kaashoek
and Nickolai Zeldovich},
year         = 2018,
month        = sep,
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
howpublished = {arXiv:1809.09550},
}
Multiverse databases for secure web applications Lara Timbó Araújo. Master's thesis, Massachusetts Institute of Technology, February 2018.

Most modern web applications authenticate users and enforce security policies in the ap- plication logic. Therefore, buggy applications can easily leak sensitive data. MultiverseDB addresses this problem in a new database architecture, where each user has her own private view of the database and declarative security policies restrict data-flow into a user’s private universe.

To support multi-user universes, MultiverseDB builds on ideas of streaming data-flow systems and low-overhead materialized views. When a new user session starts, the system creates data-flow nodes to support the user queries and automatically inserts special nodes to enforce security policies at universe boundaries. MultiverseDB provides fast reads by storing the pre-computed results of user queries with policies already applied in incremen- tally maintained materialized views. To reduce space overheads created by these views and avoid redundant processing, MultiverseDB reuses views and allows system administrators to specify security groups for users subjected to the same security policies.

@mastersthesis{larat-meng,
title        = {Multiverse Databases for Secure Web Applications},
author       = {Lara Timb\'{o} Ara\'{u}jo},
school       = {Massachusetts Institute of Technology},
year         = 2018,
month        = feb,
}
Veil: Private browsing semantics without browser-side assistance Frank Wang, James Mickens, and Nickolai Zeldovich. NDSS 2018.

All popular web browsers offer a “private browsing mode.” After a private session terminates, the browser is supposed to remove client-side evidence that the session occurred. Unfortunately, browsers still leak information through the file system, the browser cache, the DNS cache, and on-disk reflections of RAM such as the swap file.

Veil is a new deployment framework that allows web developers to prevent these information leaks, or at least reduce their likelihood. Veil leverages the fact that, even though developers do not control the client-side browser implementation, developers do control 1) the content that is sent to those browsers, and 2) the servers which deliver that content. Veil web sites collectively store their content on Veil’s blinding servers instead of on individual, site-specific servers. To publish a new page, developers pass their HTML, CSS, and JavaScript files to Veil’s compiler; the compiler transforms the URLs in the content so that, when the page loads on a user’s browser, URLs are derived from a secret user key. The blinding service and the Veil page exchange encrypted data that is also protected by the user’s key. The result is that Veil pages can safely store encrypted content in the browser cache; furthermore, the URLs exposed to system interfaces like the DNS cache are unintelligible to attackers who do not possess the user’s key. To protect against post-session inspection of swap file artifacts, Veil uses heap walking (which minimizes the likelihood that secret data is paged out), content mutation (which garbles in-memory artifacts if they do get swapped out), and DOM hiding (which prevents the browser from learning site-specific HTML, CSS, and JavaScript content in the first place). Veil pages load on unmodified commodity browsers, allowing developers to provide stronger semantics for private browsing without forcing users to install or reconfigure their machines. Veil provides these guarantees even if the user does not visit a page using a browser’s native privacy mode; indeed, Veil’s protections are stronger than what the browser alone can provide.

@inproceedings{veil:ndss18,
title        = {Veil: Private Browsing Semantics Without
Browser-side Assistance},
author       = {Frank Wang and James Mickens and Nickolai Zeldovich},
booktitle    = {Proceedings of the {N}etwork and {D}istributed
{S}ystem {S}ecurity {S}ymposium ({NDSS} '18)},
year         = 2018,
month        = feb,
}

## 2017

Algorand: Scaling byzantine agreements for cryptocurrencies Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zeldovich. SOSP 2017.

Algorand is a new cryptocurrency that confirms transactions with latency on the order of a minute while scaling to many users. Algorand ensures that users never have divergent views of confirmed transactions, even if some of the users are malicious and the network is temporarily partitioned. In contrast, existing cryptocurrencies allow for temporary forks and therefore require a long time, on the order of an hour, to confirm transactions with high confidence.

Algorand uses a new Byzantine Agreement (BA) protocol to reach consensus among users on the next set of transactions. To scale the consensus to many users, Algorand uses a novel mechanism based on Verifiable Random Functions that allows users to privately check whether they are selected to participate in the BA to agree on the next set of transactions, and to include a proof of their selection in their network messages. In Algorand's BA protocol, users do not keep any private state except for their private keys, which allows Algorand to replace participants immediately after they send a message. This mitigates targeted attacks on chosen participants after their identity is revealed.

We implement Algorand and evaluate its performance on 1,000 EC2 virtual machines, simulating up to 500,000 users. Experimental results show that Algorand confirms transactions in under a minute, achieves 125× Bitcoin's throughput, and incurs almost no penalty for scaling to more users.

@inproceedings{algorand:sosp17,
title        = {Algorand: Scaling Byzantine Agreements for
Cryptocurrencies},
author       = {Yossi Gilad and Rotem Hemo and Silvio Micali and
Georgios Vlachos and Nickolai Zeldovich},
pages        = {51--68},
booktitle    = {Proceedings of the 26th ACM Symposium on Operating
Systems Principles (SOSP 2017)},
year         = 2017,
month        = oct,
}
Stadium: A distributed metadata-private messaging system Nirvan Tyagi, Yossi Gilad, Derek Leung, Matei Zaharia, and Nickolai Zeldovich. SOSP 2017.

Private communication over the Internet remains a challenging problem. Even if messages are encrypted, it is hard to deliver them without revealing metadata about which pairs of users are communicating. Scalable anonymity systems, such as Tor, are susceptible to traffic analysis attacks that leak metadata. In contrast, the largest-scale systems with metadata privacy require passing all messages through a small number of providers, requiring a high operational cost for each provider and limiting their deployability in practice.

This paper presents Stadium, a point-to-point messaging system that provides metadata and data privacy while scaling its work efficiently across hundreds of low-cost providers operated by different organizations. Much like Vuvuzela, the current largest-scale metadata-private system, Stadium achieves its provable guarantees through differential privacy and the addition of noisy cover traffic. The key challenge in Stadium is limiting the information revealed from the many observable traffic links of a highly distributed system, without requiring an overwhelming amount of noise. To solve this challenge, Stadium introduces techniques for distributed noise generation and differentially private routing as well as a verifiable parallel mixnet design where the servers collaboratively check that others follow the protocol. We show that Stadium can scale to support 4× more users than Vuvuzela using servers that cost an order of magnitude less to operate than Vuvuzela nodes.

@inproceedings{stadium:sosp17,
System},
author       = {Nirvan Tyagi and Yossi Gilad and Derek Leung and
Matei Zaharia and Nickolai Zeldovich},
pages        = {423--440},
booktitle    = {Proceedings of the 26th ACM Symposium on Operating
Systems Principles (SOSP 2017)},
year         = 2017,
month        = oct,
}
Verifying a high-performance crash-safe file system using a tree specification Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay Ileri, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2017.

DFSCQ is the first file system that (1) provides a precise specification for fsync and fdatasync, which allow applications to achieve high performance and crash safety, and (2) provides a machine-checked proof that its implementation meets this specification. DFSCQ's specification captures the behavior of sophisticated optimizations, including log-bypass writes, and DFSCQ's proof rules out some of the common bugs in file-system implementations despite the complex optimizations.

The key challenge in building DFSCQ is to write a specification for the file system and its internal implementation without exposing internal file-system details. DFSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, which roughly follows the behavior of Linux ext4. This specification uses a notion of tree sequences—logical sequences of file-system tree states—for succinct description of the possible states after a crash and to describe how data writes can be reordered with respect to metadata updates. This helps application developers prove the crash safety of their own applications, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory.

An evaluation shows that DFSCQ achieves 103 MB/s on large file writes to an SSD and durably creates small files at a rate of 1,618 files per second. This is slower than Linux ext4 (which achieves 295 MB/s for large file writes and 4,977 files/s for small file creation) but much faster than two recent verified file systems, Yggdrasil and FSCQ. Evaluation results from application-level benchmarks, including TPC-C on SQLite, mirror these microbenchmarks.

@inproceedings{dfscq:sosp17,
title        = {Verifying a high-performance crash-safe file system
using a tree specification},
author       = {Haogang Chen and Tej Chajed and Alex Konradi and
Stephanie Wang and Atalay Ileri and Adam Chlipala and M. Frans
Kaashoek and Nickolai Zeldovich},
booktitle    = {Proceedings of the 26th ACM Symposium on Operating
Systems Principles (SOSP 2017)},
year         = 2017,
month        = oct,
}
Scaling a file system to many cores using an operation log Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2017.

It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, even for commutative operations like creating different files in the same directory, current file systems introduce cache-line conflicts when updating an in-memory copy of the on-disk directory block, which limits scalability.

ScaleFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without cache conflicts and hence scale perfectly. ScaleFS logs operations in a per-core log so that it can delay propagating updates to the disk representation (and the cache-line conflicts involved in doing so) until an fsync. The fsync call merges the per-core logs and applies the operations to disk. ScaleFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache-line conflicts, absorption of logged operations, and dependency tracking across operations.

Experiments with a prototype of ScaleFS show that its implementation has no cache conflicts for 99% of test cases of commutative operations generated by Commuter, scales well on an 80-core machine, and provides on-disk performance that is comparable to that of Linux ext4.

@inproceedings{scalefs:sosp17,
title        = {Scaling a file system to many cores using an
operation log},
author       = {Srivatsa S. Bhat and Rasha Eqbal and Austin T.
Clements and M. Frans Kaashoek and Nickolai Zeldovich},
booktitle    = {Proceedings of the 26th ACM Symposium on Operating
Systems Principles (SOSP 2017)},
year         = 2017,
month        = oct,
}
Quboid: A workstation for safer web interaction Amol M. Bhave. Master's thesis, Massachusetts Institute of Technology, September 2017.
@mastersthesis{amol-meng,
title        = {{Quboid}: A workstation for safer Web interaction},
author       = {Amol M. Bhave},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = sep,
}
The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. CACM 60(8), August 2017.

Developing software that scales on multicore processors is an inexact science dominated by guesswork, measure- ment, and expensive cycles of redesign and reimplementation. Current approaches are workload-driven and, hence, can reveal scalability bottlenecks only for known workloads and available software and hardware. This paper introduces an interface-driven approach to building scalable software. This approach is based on the scalable commutativity rule, which, informally stated, says that whenever interface operations commute, they can be implemented in a way that scales. We formalize this rule and prove it correct for any machine on which conflict-free operations scale, such as current cache-coherent multicore machines. The rule also enables a better design process for scalable software: programmers can now reason about scalability from the earli- est stages of interface definition through software design, implementation, and evaluation.

@article{commutativity:cacm,
title        = {The Scalable Commutativity Rule: Designing Scalable
Software for Multicore Processors},
author       = {Austin T. Clements and M. Frans Kaashoek and
Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
journal      = {CACM},
volume       = 60,
number       = 8,
month        = aug,
year         = 2017,
}
Performance Optimization of the VDFS Verified File System Alex Konradi. Master's thesis, Massachusetts Institute of Technology, June 2017.
@mastersthesis{akonradi-meng,
title        = {Performance {Optimization} of the {VDFS} {Verified}
{File} {System}},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = jun,
}
Compiling Gallina to Go for the FSCQ file system Daniel Ziegler. Master's thesis, Massachusetts Institute of Technology, June 2017.

Over the last decade, systems software verification has become increasingly practical. Many verified systems have been written in the language of a proof assistant, proved correct, and then made runnable using code extraction. However, due to the rigidity of extraction and the overhead of the target languages, the resulting code’s CPU performance can suffer, with limited opportunity for optimization. This thesis contributes CoqGo, a proof-producing compiler from Coq's Gallina language to Go. We created Go′, a stylized semantics of Go which enforce linearity, and implemented proofproducing compilation tactics from Gallina to Go′ plus a straightforward translation from Go′ to Go. Applying a prototype of CoqGo, we compiled a system call in the FSCQ file system, with minimal changes to FSCQ's source code. Taking advantage of the increased control given by CoqGo, we implemented three optimizations, bringing the system call’s CPU performance to 19% faster than the extracted version.

@mastersthesis{ziegler-meng,
title        = {Compiling {Gallina} to {Go} for the {FSCQ} File
System},
author       = {Daniel Ziegler},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = jun,
}
CoqIOA: A formalization of IO Automata in the Coq Proof Assistant Anish Athalye. Master's thesis, Massachusetts Institute of Technology, June 2017.

Implementing distributed systems correctly is difficult. Designing correct distributed systems protocols is challenging because designs must account for concurrent operation and handle network and machine failures. Implementing these protocols is challenging as well: it is difficult to avoid subtle bugs in implementations of complex protocols. Formal verification is a promising approach to ensuring distributed systems are free of bugs, but verification is challenging and time-consuming. Unfortunately, current approaches to mechanically verifying distributed systems in proof assistants using deductive verification do not allow for modular reasoning, which could greatly reduce the effort required to implement verified distributed systems by enabling reuse of code and proofs.

This thesis presents CoqIOA, a framework for reasoning about distributed systems in a compositional way. CoqIOA builds on the theory of input/output automata to support specification, proof, and composition of systems within the proof assistant. The framework's implementation of the theory of IO automata, including refinement, simulation relations, and composition, are all machine-checked in the Coq proof assistant. An evaluation of CoqIOA demonstrates that the framework enables compositional reasoning about distributed systems within the proof assistant.

@mastersthesis{aathalye-meng,
title        = {{CoqIOA}: A Formalization of {IO} {Automata} in the
{Coq} {Proof} {Assistant}},
author       = {Anish Athalye},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = jun,
}
Designing multicore scalable filesystems with durability and crash consistency Srivatsa S. Bhat. Master's thesis, Massachusetts Institute of Technology, June 2017.

It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, data structures that are on separate cache lines in memory (e.g., directory entries) are grouped together in a transaction log when the file system writes them to disk. This grouping results in cache line conflicts, thereby limiting scalability.

McoreFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without conflicts and hence scale perfectly. McoreFS logs operations in a per-core log so that it can delay propagating updates to the disk representation until an fsync. The fsync call merges the per-core logs and applies the operations to disk. McoreFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache line conflicts, absorption of logged operations, and dependency tracking across operations.

Experiments with a prototype of McoreFS show that its implementation is conflict-free for 99% of test cases involving commutative operations generated by Commuter, scales well on an 80-core machine, and provides disk performance that matches or exceeds that of Linux ext4.

@mastersthesis{srivatsa-sm,
title        = {Designing multicore scalable filesystems with
durability and crash consistency},
author       = {Srivatsa S. Bhat},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = jun,
}
Certifying a file system using Crash Hoare Logic: Correctness in the presence of crashes Tej Chajed, Haogang Chen, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich, and Daniel Ziegler. CACM 60(4), April 2017.

FSCQ is the first file system with a machine-checkable proof that its implementation meets a specification, even in the presence of fail-stop crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover its state correctly without losing data.

To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

@article{fscq:cacm,
title        = {Certifying a File System Using {Crash Hoare Logic}:
Correctness in the Presence of Crashes},
author       = {Tej Chajed and Haogang Chen and Adam Chlipala and M.
Frans Kaashoek and Nickolai Zeldovich and Daniel Ziegler},
journal      = {CACM},
volume       = 60,
number       = 4,
month        = apr,
year         = 2017,
}
Splinter: Practical private queries on public data Frank Wang, Catherine Yun, Shafi Goldwasser, Vinod Vaikuntanathan, and Matei Zaharia. NSDI 2017.

Many online services let users query public datasets such as maps, flight prices, or restaurant reviews. Unfortunately, the queries to these services reveal highly sensitive information that can compromise users’ privacy. This paper presents Splinter, a system that protects users’ queries on public data and scales to realistic applications. A user splits her query into multiple parts and sends each part to a different provider that holds a copy of the data. As long as any one of the providers is honest and does not collude with the others, the providers cannot determine the query. Splinter uses and extends a new cryptographic primitive called Function Secret Sharing (FSS) that makes it up to an order of magnitude more efficient than prior systems based on Private Information Retrieval and garbled circuits. We develop protocols extending FSS to new types of queries, such as MAX and TOPK queries. We also provide an optimized implementation of FSS using AES-NI instructions and multicores. Splinter achieves end-to-end latencies below 1.6 seconds for realistic workloads including a Yelp clone, flight search, and map routing.

@inproceedings{splinter:nsdi17,
title        = {Splinter: Practical Private Queries on Public Data},
author       = {Frank Wang and Catherine Yun and Shafi Goldwasser
and Vinod Vaikuntanathan and Matei Zaharia},
booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '17)},
year         = 2017,
month        = mar,
}
Verifying an I/O-concurrent file system Tej Chajed. Master's thesis, Massachusetts Institute of Technology, February 2017.

Systems software is a good target for verification due to its prevalent usage and its complexity, which can lead to tricky bugs that are hard to test for. One source of complexity in systems software is concurrency, but thus far verification techniques have struggled to enable large-scale verification of concurrent systems. This thesis contributes a verified file system, CIO-FSCQ, with I/O concurrency: if a file system call experiences a miss in the buffer cache and starts a disk I/O, the file system overlaps the I/O with the execution of another file system call.

CIO-FSCQ re-uses the implementation, specifications, and proofs of an existing verified sequential file, FSCQ, and turns it into an I/O-concurrent file system. This re-use is enabled by CIO-FSCQ's optimistic system calls. An optimistic system call runs sequentially if all the data it needs is in the buffer cache. If some data is not in the cache, CIO-FSCQ issues I/Os to retrieve the data from disk and returns an error code. In the miss case, a system call wrapper reverts any partial changes and yields the processor so that another system call can run in parallel with the I/O. CIO-FSCQ retries the system call later, at which point the data is likely in the buffer cache. A directory-isolation protocol guarantees that FSCQ's specifications and proofs can be re-used even if optimistic system calls are retried. An evaluation of IO-FSCQ shows that it speeds up a simple file-system workload by overlapping disk I/O with computation, and that the effort of building and verifying CIO-FSCQ is small compared to the effort of verifying FSCQ.

@mastersthesis{tchajed-sm,
title        = {Verifying an {I/O}-Concurrent File System},
author       = {Tej Chajed},
school       = {Massachusetts Institute of Technology},
year         = 2017,
month        = feb,
}

## 2016

Oort: User-centric cloud storage with global queries Tej Chajed, Jon Gjengset Jon, M. Frans Kaashoek, James Mickens, Robert Morris, and Nickolai Zeldovich. MIT CSAIL technical report, December 2016.
@techreport{oort:tr16,
title        = {Oort: User-Centric Cloud Storage with Global Queries},
author       = {Tej Chajed and Jon Gjengset Jon and M. Frans
Kaashoek and James Mickens and Robert Morris and Nickolai
Zeldovich},
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
number       = {MIT-CSAIL-TR-2016-015},
year         = 2016,
month        = dec,
}
Firmament: fast, centralized cluster scheduling at scale Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. OSDI 2016.

Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization.

This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at sub-second placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations.

Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20× over Quincy [22], a prior centralized scheduler using the same MCMF optimization. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used centralized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6x.

@inproceedings{firmament:osdi16,
title        = {Firmament: fast, centralized cluster scheduling at
scale},
author       = {Ionel Gog and Malte Schwarzkopf and Adam Gleave and
Robert N. M. Watson and Steven Hand},
pages        = {99--115},
booktitle    = {Proceedings of the 12th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)},
year         = 2016,
month        = nov,
}
Alpenhorn: Bootstrapping secure communication without leaking metadata David Lazar, and Nickolai Zeldovich. OSDI 2016.

Alpenhorn is the first system for initiating an encrypted connection between two users that provides strong privacy and forward secrecy guarantees for metadata (i.e., information about which users connected to each other) and that does not require out-of-band communication other than knowing the other user's Alpenhorn username (email address). This resolves a significant shortcoming in all prior works on private messaging, which assume an out-of-band key distribution mechanism.

Alpenhorn's design builds on three ideas. First, Alpenhorn provides each user with an address book of friends that the user can call to establish a connection. Second, when a user adds a friend for the first time, Alpenhorn ensures the adversary does not learn the friend's identity, by using identity-based encryption in a novel way to privately determine the friend's public key. Finally, when calling a friend, Alpenhorn ensures forward secrecy of metadata by storing pairwise shared secrets in friends' address books, and evolving them over time, using a new keywheel construction. Alpenhorn relies on a number of servers, but operates in an anytrust model, requiring just one of the servers to be honest.

We implemented a prototype of Alpenhorn, and integrated it into the Vuvuzela private messaging system (which did not previously provide privacy or forward secrecy of metadata when initiating conversations). Experimental results show that Alpenhorn can scale to many users, supporting 10 million users on three Alpenhorn servers with an average call latency of 150 seconds and a client bandwidth overhead of 3.7 KB/sec.

@inproceedings{alpenhorn:osdi16,
title        = {Alpenhorn: Bootstrapping Secure Communication
author       = {David Lazar and Nickolai Zeldovich},
pages        = {571--586},
booktitle    = {Proceedings of the 12th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)},
year         = 2016,
month        = nov,
}
Certifying a crash-safe file system Haogang Chen. Ph.D. thesis, Massachusetts Institute of Technology, September 2016.

File systems are a cornerstone for storing and retrieving permanent data, yet they are complex enough to have bugs that might cause data loss, especially in the face of system crashes.

FSCQ is the first file system that (1) provides a precise specification for the core subset of POSIX file-system APIs; and the APIs include fsync and fdatasync, which allow applications to achieve high I/O performance and crash safety, and that (2) provides a machine-checked proof that its I/O-efficient implementation meets this precise specification. FSCQ’s proofs avoid crash-safety bugs that have plagued file systems, such as forgetting to insert a disk-write barrier between writing the data from the log and writing the log’s commit block. FSCQ’s specification also allows applications to prove their own crash safety, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory. As a result, applications on FSCQ can provide strong guarantees: they will not lose data under any sequence of crashes.

To state FSCQ’s theorems, FSCQ introduces the Crash Hoare Logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, the thesis developed, specified, and proved the correctness of the FSCQ file system. FSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, based on Linux ext4’s behavior. FSCQ also introduces disk sequences and disk relations to help formalize the metadata-prefix specification. The evaluation shows that FSCQ enables end-to-end verification of application crash safety, and that FSCQ’s optimizations achieve I/O performance on par with that of Linux ext4.

@phdthesis{hchen-phd,
title        = {Certifying a Crash-safe File System},
author       = {Haogang Chen},
school       = {Massachusetts Institute of Technology},
year         = 2016,
month        = sep,
}
Certifying checksum-based logging in the RapidFSCQ crash-safe filesystem Stephanie Wang. Master's thesis, Massachusetts Institute of Technology, June 2016.

As more and more software is written every day, so too are bugs. Formal verification is a way of using mathematical methods to prove that a program has no bugs. However, if formal verification is to see widespread use, it must be able to compete with unverified software in performance. Unfortunately, many of the optimizations that we take for granted in unverified software depend on assumptions that are difficult to verify. One such optimization is data checksums in logging systems, used to improve I/O efficiency while still ensuring data integrity after a crash.

This thesis explores a novel method of modeling the probabilistic guarantees of a hash function. This method is then applied to the logging system underlying RapidFSCQ, a certified crash-safe filesystem, to support formally verified checksums. An evaluation of RapidFSCQ shows that it enables end-to-end verification of application and filesystem crash safety, and that RapidFSCQ's optimizations, including checksumming, achieve I/O performance on par with Linux ext4. Thus, this thesis contributes a formal model of hash function behavior with practical application to certified computer systems.

@mastersthesis{stephanie-meng,
title        = {Certifying Checksum-Based Logging in the {RapidFSCQ}
Crash-Safe Filesystem},
author       = {Stephanie Wang},
school       = {Massachusetts Institute of Technology},
year         = 2016,
month        = jun,
}
A differential approach to undefined behavior detection Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama. CACM 60(3), March 2016.

This paper studies undefined behavior arising in systems programming languages such as C/C++. Undefined behavior bugs lead to unpredictable and subtle systems behavior, and their effects can be further amplified by compiler optimizations. Undefined behavior bugs are present in many systems, including the Linux kernel and the Postgres database. The consequences range from incorrect functionality to missing security checks.

This paper proposes a formal and practical approach, which finds undefined behavior bugs by finding “unstable code” in terms of optimizations that leverage undefined behavior. Using this approach, we introduce a new static checker called Stack that precisely identifies undefined behavior bugs. Applying Stack to widely used systems has uncovered 161 new bugs that have been confirmed and fixed by developers.

@article{kstack:cacm,
title        = {A Differential Approach to Undefined Behavior
Detection},
author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek
and Armando Solar-Lezama},
journal      = {CACM},
volume       = 60,
number       = 3,
month        = mar,
year         = 2016,
}
Sieve: Cryptographically enforced access control for user data in untrusted clouds Frank Wang, James Mickens, Nickolai Zeldovich, and Vinod Vaikuntanathan. NSDI 2016.

Modern web services rob users of low-level control over cloud storage — a user's single logical data set is scattered across multiple storage silos whose access controls are set by web services, not users. The consequence is that users lack the ultimate authority to determine how their data is shared with other web services.

In this paper, we introduce Sieve, a new platform which selectively (and securely) exposes user data to web services. Sieve has a user-centric storage model: each user uploads encrypted data to a single cloud store, and by default, only the user knows the decryption keys. Given this storage model, Sieve defines an infrastructure to support rich, legacy web applications. Using attribute-based encryption, Sieve allows users to define intuitively understandable access policies that are cryptographically enforceable. Using key homomorphism, Sieve can reencrypt user data on storage providers in situ, revoking decryption keys from web services without revealing new keys to the storage provider. Using secret sharing and two-factor authentication, Sieve protects cryptographic secrets against the loss of user devices like smartphones and laptops. The result is that users can enjoy rich, legacy web applications, while benefiting from cryptographically strong controls over which data a web service can access.

@inproceedings{sieve:nsdi16,
title        = {Sieve: Cryptographically Enforced Access Control for
User Data in Untrusted Clouds},
author       = {Frank Wang and James Mickens and Nickolai Zeldovich
and Vinod Vaikuntanathan},
booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '16)},
year         = 2016,
month        = mar,
}

## 2015

Vuvuzela: Scalable private messaging resistant to traffic analysis Jelle van den Hooff, David Lazar, Matei Zaharia, and Nickolai Zeldovich. SOSP 2015.

Private messaging over the Internet has proven challenging to implement, because even if message data is encrypted, it is difficult to hide metadata about who is communicating in the face of traffic analysis. Systems that offer strong privacy guarantees, such as Dissent, scale to only several thousand clients, because they use techniques with superlinear cost in the number of clients (e.g., each client broadcasts their message to all other clients). On the other hand, scalable systems, such as Tor, do not protect against traffic analysis, making them ineffective in an era of pervasive network monitoring.

Vuvuzela is a new scalable messaging system that offers strong privacy guarantees, hiding both message data and metadata. Vuvuzela is secure against adversaries that observe and tamper with all network traffic, and that control all nodes except for one server. Vuvuzela's key insight is to minimize the number of variables observable by an attacker, and to use differential privacy techniques to add noise to all observable variables in a way that provably hides information about which users are communicating. Vuvuzela has a linear cost in the number of clients, and experiments show that it can achieve a throughput of 68,000 messages per second for 1 million users with a 37-second end-to-end latency on commodity servers.

@inproceedings{vuvuzela:sosp15,
title        = {Vuvuzela: Scalable Private Messaging Resistant to
Traffic Analysis},
author       = {Jelle van den Hooff and David Lazar and Matei
Zaharia and Nickolai Zeldovich},
booktitle    = {Proceedings of the 25th ACM Symposium on Operating
Systems Principles (SOSP 2015)},
year         = 2015,
month        = oct,
}
Using Crash Hoare Logic for certifying the FSCQ file system Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2015.

FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ's theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data.

To state FSCQ's theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ's design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ's specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

@inproceedings{fscq:sosp15,
title        = {Using {Crash} {Hoare} {Logic} for certifying the
{FSCQ} file system},
author       = {Haogang Chen and Daniel Ziegler and Tej Chajed and
Adam Chlipala and M. Frans Kaashoek and Nickolai Zeldovich},
booktitle    = {Proceedings of the 25th ACM Symposium on Operating
Systems Principles (SOSP 2015)},
year         = 2015,
month        = oct,
}
Reducing pause times with clustered collection Cody Cutler, and Robert Morris. ISMM 2015.

Each full garbage collection in a program with millions of objects can pause the program for multiple seconds. Much of this work is typically repeated, as the collector re-traces parts of the object graph that have not changed since the last collection. Clustered Collection reduces full collection pause times by eliminating much of this repeated work.

Clustered Collection identifies clusters: regions of the object graph that are reachable from a single “head” object, so that reachability of the head implies reachability of the whole cluster. As long as it is not written, a cluster need not be re-traced by successive full collections. The main design challenge is coping with program writes to clusters while ensuring safe, complete, and fast collections. In some cases program writes require clusters to be dissolved, but in most cases Clustered Collection can handle writes without having to re-trace the affected cluster. Clustered Collection chooses clusters likely to suffer few writes and to yield high savings from re-trace avoidance.

Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site (which suffers from significant garbage collection pauses) and a Twitter-like application show that Clustered Collection decreases full collection pause times by a factor of three and six respectively. This improvement is possible because both applications have gigabytes of live data, modify only a small fraction of it, and usually write in ways that do not result in cluster dissolution. Identifying clusters takes more time than a full collection, but happens much less frequently than full collection.

@inproceedings{cc:ismm15,
title        = {Reducing Pause Times with Clustered Collection},
author       = {Cody Cutler and Robert Morris},
booktitle    = {Proceedings of the 15th {I}nternational {S}ymposium
on {M}emory {M}anagement ({ISMM15})},
year         = 2015,
month        = jun,
}
Parallel execution for conflicting transactions Neha Narula. Ph.D. thesis, Massachusetts Institute of Technology, June 2015.

Multi-core in-memory databases only obtain parallel performance when transactions do not conflict. Conflicting transactions are executed one at a time in order to ensure that they have serializable effects. Sequential execution on contended data leaves cores idle and reduces throughput. In other parallel programming contexts—not serializable transactions—techniques have been developed that can reduce contention on shared variables using per-core state. This thesis asks the question, can these techniques apply to a general serializable database?

This work introduces a new concurrency control technique, phase reconciliation, that uses per-core state to greatly reduce contention on popular database records for many important workloads. Phase reconciliation uses the idea of synchronized phases to amortize the cost of combining per-core data and to extract parallelism.

Doppel, our phase reconciliation database, repeatedly cycles through joined and partitioned phases. Joined phases use traditional concurrency control and allow any transaction to execute. When workload contention causes unnecessary sequential execution, Doppel switches to a split phase. During a split phase, commutative operations on popular records act on per-core state, and thus proceed in parallel on different cores. By explicitly using phases, phase reconciliation realizes two important performance benefits: First, it amortizes the potentially high costs of aggregating per-core state over many transactions. Second, it can dynamically split data or not based on observed contention, handling challenging, varying workloads. Doppel achieves higher performance because it parallelizes transactions on popular data that would be run sequentially by conventional concurrency control.

Phase reconciliation helps most when there are many updates to a few popular database records. On an 80-core machine, its throughput is up to 38x higher than conventional concurrency control protocols on microbenchmarks, and up to 3x on a larger application, at the cost of increased latency for some transactions.

@phdthesis{neha-phd,
title        = {Parallel Execution for Conflicting Transactions},
author       = {Neha Narula},
school       = {Massachusetts Institute of Technology},
year         = 2015,
month        = jun,
}
Amber: Decoupling user data from web applications Tej Chajed, Jon Gjengset, Jelle van den Hooff, M. Frans Kaashoek, James Mickens, Robert Morris, and Nickolai Zeldovich. HotOS XV 2015.
User-generated content is becoming increasingly common on the Web, but current web applications isolate their users' data, enabling only restricted sharing and cross-service integration. We believe users should be able to share their data seamlessly between their applications and with other users. To that end, we propose Amber, an architecture that decouples users' data from applications, while providing applications with powerful global queries to find user data. We demonstrate how multi-user applications, such as e-mail, can use these global queries to efficiently collect and monitor relevant data created by other users. Amber puts users in control of which applications they use with their data and with whom it is shared, and enables a new class of applications by removing the artificial partitioning of users' data by application.
@inproceedings{amber:hotos15,
title        = {Amber: Decoupling User Data from Web Applications},
author       = {Tej Chajed and Jon Gjengset and Jelle van den Hooff
and M. Frans Kaashoek and James Mickens and Robert Morris and
Nickolai Zeldovich},
booktitle    = {Proceedings of the 15th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS XV})},
year         = 2015,
month        = may,
}
Specifying crash safety for storage systems Haogang Chen, Daniel Ziegler, Adam Chlipala, M. Frans Kaashoek, Eddie Kohler, and Nickolai Zeldovich. HotOS XV 2015.

@inproceedings{fscq:hotos15,
title        = {Specifying Crash Safety for Storage Systems},
author       = {Haogang Chen and Daniel Ziegler and Adam Chlipala
and M. Frans Kaashoek and Eddie Kohler and Nickolai Zeldovich},
booktitle    = {Proceedings of the 15th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS XV})},
year         = 2015,
month        = may,
}
Hare: a file system for non-cache-coherent multicores Charles Gruenwald III, Filippo Sironi, M. Frans Kaashoek, and Nickolai Zeldovich. EuroSys 2015.

Hare is a new file system that provides a POSIX-like interface on multicore processors without cache coherence. Hare allows applications on different cores to share files, directories, and file descriptors. The challenge in designing Hare is to support the shared abstractions faithfully enough to run applications that run on traditional shared-memory operating systems, with few modifications, and to do so while scaling with an increasing number of cores.

To achieve this goal, Hare must support features (such as shared file descriptors) that traditional network file systems don't support, as well as implement them in a way that scales (e.g., shard a directory across servers to allow concurrent operations in that directory). Hare achieves this goal through a combination of new protocols (including a 3-phase commit protocol to implement directory operations correctly and scalably) and leveraging properties of non-cache-coherent multiprocessors (e.g., atomic low-latency message delivery and shared DRAM).

An evaluation on a 40-core machine demonstrates that Hare can run many challenging Linux applications (including a mail server and a Linux kernel build) with minimal or no modifications. The results also show these applications achieve good scalability on Hare, and that Hare's techniques are important to achieving scalability.

@inproceedings{gruenwald:hare,
author       = {Gruenwald III, Charles and Sironi, Filippo and
Kaashoek, M. Frans and Zeldovich, Nickolai},
title        = {Hare: a file system for non-cache-coherent
multicores},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2015)},
year         = 2015,
month        = apr,
}
The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. ACM Trans. Comput. Syst. 32(4), January 2015.

What opportunities for multicore scalability are latent in software interfaces, such as system call APIs? Can scalability challenges and opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions, we introduce the scalable commutativity rule: whenever interface operations commute, they can be implemented in a way that scales. This rule is useful throughout the development process for scalable multicore software, from the interface design through implementation, testing, and evaluation.

This article formalizes the scalable commutativity rule. This requires defining a novel form of commutativity, SIM commutativity, that lets the rule apply even to complex and highly stateful software interfaces.

We also introduce a suite of software development tools based on the rule. Our Commuter tool accepts high-level interface models, generates tests of interface operations that commute and hence could scale, and uses these tests to systematically evaluate the scalability of implementations. We apply Commuter to a model of 18 POSIX file and virtual memory system operations. Using the resulting 26,238 scalability tests, Commuter highlights Linux kernel problems previously observed to limit application scalability and identifies previously unknown bottlenecks that may be triggered by future workloads or hardware.

Finally, we apply the scalable commutativity rule and Commuter to the design and implementation sv6, a new POSIX-like operating system. sv6’s novel file and virtual memory system designs enable it to scale for 99% of the tests generated by Commuter. These results translate to linear scalability on an 80-core x86 machine for applications built on sv6’s commutative operations.

@article{commutativity:tocs,
title        = {The Scalable Commutativity Rule: Designing Scalable
Software for Multicore Processors},
author       = {Austin T. Clements and M. Frans Kaashoek and
Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
journal      = {ACM Trans. Comput. Syst.},
volume       = 32,
number       = 4,
month        = jan,
year         = 2015,
pages        = {10:1--10:47},
}

## 2014

VerSum: Verifiable computations over large public logs Jelle van den Hooff, M. Frans Kaashoek, and Nickolai Zeldovich. CCS 2014.

VerSum allows lightweight clients to outsource expensive computations over large and frequently changing data structures, such as the Bitcoin or Namecoin blockchains, or a Certificate Transparency log. VerSum clients ensure that the output is correct by comparing the outputs from multiple servers. VerSum assumes that at least one server is honest, and crucially, when servers disagree, VerSum uses an efficient conflict resolution protocol to determine which server(s) made a mistake and thus obtain the correct output.

VerSum's contribution lies in achieving low server-side overhead for both incremental re-computation and conflict resolution, using three key ideas: (1) representing the computation as a functional program, which allows memoization of previous results; (2) recording the evaluation trace of the functional program in a carefully designed computation history to help clients determine which server made a mistake; and (3) introducing a new authenticated data structure for sequences, called SeqHash, that makes it efficient for servers to construct summaries of computation histories in the presence of incremental re-computation. Experimental results with an implementation of VerSum show that VerSum can be used for a variety of computations, that it can support many clients, and that it can easily keep up with Bitcoin's rate of new blocks with transactions.

@inproceedings{versum:ccs14,
title        = {{VerSum}: Verifiable computations over large public
logs},
author       = {Jelle van den Hooff and M. Frans Kaashoek and
Nickolai Zeldovich},
booktitle    = {Proceedings of the 21st {ACM} Conference on Computer
and Communications Security ({CCS 2014})},
year         = 2014,
month        = nov,
}
Nail: A practical tool for parsing and generating data formats Julian Bangert, and Nickolai Zeldovich. OSDI 2014.

Nail is a tool that greatly reduces the programmer effort for safely parsing and generating data formats defined by a grammar. Nail introduces several key ideas to achieve its goal. First, Nail uses a protocol grammar to define not just the data format, but also the internal object model of the data. Second, Nail eliminates the notion of semantic actions, used by existing parser generators, which reduces the expressive power but allows Nail to both parse data formats and generate them from the internal object model, by establishing a semantic bijection between the data format and the object model. Third, Nail introduces dependent fields and stream transforms to capture protocol features such as size and offset fields, checksums, and compressed data, which are impractical to express in existing protocol languages. Using Nail, we implement an authoritative DNS server in C in under 300 lines of code and grammar, and an unzip program in C in 220 lines of code and grammar, demonstrating that Nail makes it easy to parse complex real-world data formats. Performance experiments show that a Nail-based DNS server can outperform the widely used BIND DNS server on an authoritative workload, demonstrating that systems built with Nail can achieve good performance.

@inproceedings{nail:osdi14,
title        = {Nail: A Practical Tool for Parsing and Generating
Data Formats},
author       = {Julian Bangert and Nickolai Zeldovich},
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
year         = 2014,
month        = oct,
}
Identifying information disclosure in web applications with retroactive auditing Haogang Chen, Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2014.

Rail is a framework for building web applications that can precisely identify inappropriately disclosed data after a vulnerability is discovered. To do so, Rail introduces retroactive disclosure auditing: re-running the application with previous inputs once the vulnerability is fixed to determine what data should have been disclosed. A key challenge for Rail is to reconcile state divergence between the original and replay executions, so that the differences between executions precisely correspond to inappropriately disclosed data. Rail provides application developers with APIs to address this challenge, by identifying sensitive data, assigning semantic names to non-deterministic inputs, and tracking dependencies.

Results from a prototype of Rail built on top of the Meteor framework show that Rail can quickly and precisely identify data disclosure from complex attacks, including programming bugs, administrative mistakes, and stolen passwords. Rail incurs up to 22% throughput overhead and 0.5 KB storage overhead per request. Porting three existing web applications required fewer than 25 lines of code changes per application.

@inproceedings{rail:osdi14,
title        = {Identifying information disclosure in web
applications with retroactive auditing},
author       = {Haogang Chen and Taesoo Kim and Xi Wang and Nickolai
Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
year         = 2014,
month        = oct,
}
Jitk: A trustworthy in-kernel interpreter infrastructure Xi Wang, David Lazar, Nickolai Zeldovich, Adam Chlipala, and Zachary Tatlock. OSDI 2014.

Modern operating systems run multiple interpreters in the kernel, which enable user-space applications to add new functionality or specialize system policies. The correctness of such interpreters is critical to the overall system security: bugs in interpreters could allow adversaries to compromise user-space applications and even the kernel.

Jitk is a new infrastructure for building in-kernel interpreters that guarantee functional correctness as they compile user-space policies down to native instructions for execution in the kernel. To demonstrate Jitk, we implement two interpreters in the Linux kernel, BPF and INET-DIAG, which are used for network and system call filtering and socket monitoring, respectively. To help application developers write correct filters, we introduce a high-level rule language, along with a proof that Jitk correctly translates high-level rules all the way to native machine code, and demonstrate that this language can be integrated into OpenSSH with tens of lines of code. We built a prototype of Jitk on top of the CompCert verified compiler and integrated it into the Linux kernel. Experimental results show that Jitk is practical, fast, and trustworthy.

@inproceedings{jitk:osdi14,
title        = {Jitk: A Trustworthy In-Kernel Interpreter
Infrastructure},
author       = {Xi Wang and David Lazar and Nickolai Zeldovich and
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
year         = 2014,
month        = oct,
}
Phase reconciliation for contended in-memory transactions Neha Narula, Cody Cutler, Eddie Kohler, and Robert Morris. OSDI 2014.

Multicore main-memory database performance can collapse when many transactions contend on the same data. Contending transactions are executed serially—either by locks or by optimistic concurrency control aborts—in order to ensure that they have serializable effects. This leaves many cores idle and performance poor. We introduce a new concurrency control technique, phase reconciliation, that solves this problem for many important workloads. Doppel, our phase reconciliation database, repeatedly cycles through joined, split, and reconciliation phases.

Joined phases use traditional concurrency control and allow any transaction to execute. When workload contention causes unnecessary serial execution, Doppel switches to a split phase. There, updates to contended items modify per-core state, and thus proceed in parallel on different cores. Not all transactions can execute in a split phase; for example, all modifications to a contended item must commute. A reconciliation phase merges these per-core states into the global store, producing a complete database ready for joined-phase transactions. A key aspect of this design is determining which items to split, and which operations to allow on split items.

Phase reconciliation helps most when there are many updates to a few popular database records. Its throughput is up to 38x higher than conventional concurrency control protocols on microbenchmarks, and up to 3x higher on a larger application, at the cost of increased latency for some transactions.

@inproceedings{doppel:osdi14,
title        = {Phase Reconciliation for Contended In-Memory
Transactions},
author       = {Neha Narula and Cody Cutler and Eddie Kohler and
Robert Morris},
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
year         = 2014,
month        = oct,
}
A differential approach to undefined behavior detection Xi Wang. Ph.D. thesis, Massachusetts Institute of Technology, September 2014.

This thesis studies undefined behavior arising in systems programming languages such as C/C++. Undefined behavior bugs lead to unpredictable and subtle systems behavior, and their effects can be further amplified by compiler optimizations. Undefined behavior bugs are present in many systems, including the Linux kernel and the Postgres database. The consequences range from incorrect functionality to missing security checks.

This thesis proposes a formal and practical approach, which finds undefined behavior bugs by finding “unstable code” in terms of optimizations that leverage undefined behavior. Using this approach, we introduce a new static checker called Stack that precisely identifies undefined behavior bugs. Applying Stack to widely used systems has uncovered 161 new bugs that have been confirmed and fixed by developers.

@phdthesis{xi-phd,
title        = {A Differential Approach to Undefined Behavior
Detection},
author       = {Xi Wang},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = sep,
}
ScaleFS: A multicore-scalable file system Rasha Eqbal. Master's thesis, Massachusetts Institute of Technology, August 2014.

It is difficult to achieve durability and crash consistency in file systems along with multicore scalability. Commutative file system operations, which should scale according to the Scalable Commutativity Property, conflict on shared resources like coarse-grained locks and pages present in the page cache or buffer cache. Furthermore, data structures that are on separate cache lines in memory (e.g., directory entries) are grouped together when the file system writes them to disk for durability. This grouping results in additional conflicts.

This thesis introduces a new design approach that decouples the in-memory file system from the on-disk file system, using per core operation logs. This facilitates the use of highly concurrent data structures for the in-memory representation, which is essential for commutative operations to proceed conflict free and hence scale perfectly. The in-memory representation does not propagate updates to the disk representation immediately, instead it simply logs the operation in a per core logical log. A sync or an fsync call processes these operations and applies them to the disk. Techniques based on time stamping linearization points of file system operations ensure crash consistency, and dependency tracking ensures good disk performance.

A prototype file system, ScaleFS, implements this new approach and techniques. Experiments using Commuter and ScaleFS show that the implementation is conflict free for 99% of test cases involving commutative operations.

@mastersthesis{rashae-sm-thesis,
title        = {{ScaleFS}: A Multicore-Scalable File System},
author       = {Rasha Eqbal},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = aug,
}
Why does cryptographic software fail? A case study and open problems David Lazar, Haogang Chen, Xi Wang, and Nickolai Zeldovich. APSys 2014.

Mistakes in cryptographic software implementations often undermine the strong security guarantees offered by cryptography. This paper presents a systematic study of cryptographic vulnerabilities in practice, an examination of state-of-the-art techniques to prevent such vulnerabilities, and a discussion of open problems and possible future research directions. Our study covers 269 cryptographic vulnerabilities reported in the CVE database from January 2011 to May 2014. The results show that just 17% of the bugs are in cryptographic libraries (which often have devastating consequences), and the remaining 83% are misuses of cryptographic libraries by individual applications. We observe that preventing bugs in different parts of a system requires different techniques, and that no effective techniques exist to deal with certain classes of mistakes, such as weak key generation.

@inproceedings{cryptobugs:apsys14,
title        = {Why does cryptographic software fail? {A} case study
and open problems},
author       = {David Lazar and Haogang Chen and Xi Wang and
Nickolai Zeldovich},
booktitle    = {Proceedings of 5th ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2014)},
year         = 2014,
month        = jun,
}
Providing a shared file system in the Hare POSIX multikernel Charles Gruenwald III. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.
Hare is a new multikernel operating system that provides a single system image for multicore processors without cache coherence. Hare allows applications on different cores to share files, directories, file descriptors, sockets, and processes. The main challenge in designing Hare is to support shared abstractions faithfully enough to run applications that run on traditional shared-memory operating systems with few modifications, and to do so while scaling with an increasing number of cores. To achieve this goal, Hare must support shared abstractions (e.g., file descriptors shared between processes) that appear consistent to processes running on any core, but without relying on hardware cache coherence between cores. Moreover, Hare must implement these abstractions in a way that scales (e.g., sharded directories across servers to allow concurrent operations in that directory). Hare achieves this goal through a combination of new protocols (e.g., a 3-phase commit protocol to implement directory operations correctly and scalably) and leveraging properties of non-cache coherent multiprocessors (e.g., atomic low-latency message delivery and shared DRAM). An evaluation on a 40-core machine demonstrates that Hare can run many challenging Linux applications (including a mail server and a Linux kernel build) with minimal or no modifications. The results also show these applications achieve good scalability on Hare, and that Hare's techniques are important to achieving scalability.
@phdthesis{charlesg3-phd,
title        = {Providing a Shared File System in the {Hare} {POSIX}
Multikernel},
author       = {Charles Gruenwald III},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = jun,
}
The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.

What fundamental opportunities for multicore scalability are latent in software interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this dissertation introduces the scalable commutativity rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building scalable multicore software starting with interface design and carrying on through implementation, testing, and evaluation.

This dissertation formalizes the scalable commutativity rule and defines a novel form of commutativity named SIM commutativity that makes it possible to fruitfully apply the rule to complex and highly stateful software interfaces.

To help developers apply the rule, this dissertation introduces an automated method embodied in a new tool named Commuter, which accepts high-level interface models, generates tests of operations that commute and hence could scale, and uses these tests to systematically evaluate the scalability of implementations. We apply Commuter to a model of 18 POSIX file and virtual memory system operations. Using the resulting 26,238 scalability tests, Commuter systematically pinpoints many problems in the Linux kernel that past work has observed to limit application scalability and identifies previously unknown bottlenecks that may be triggered by future hardware or workloads.

Finally, this dissertation applies the scalable commutativity rule and Commuter to the design and implementation of a new POSIX-like operating system named sv6. sv6's novel file and virtual memory system designs enable it to scale for 99% of the tests generated by Commuter. These results translate to linear scalability on an 80-core x86 machine for applications built on sv6's commutative operations.

@phdthesis{aclements-phd,
title        = {The scalable commutativity rule: Designing scalable
software for multicore processors},
author       = {Austin T. Clements},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = jun,
}
Automatic intrusion recovery with system-wide history Taesoo Kim. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.

Compromises of our computer systems are inevitable. New software vulnerabilities are discovered and exploited daily, but even if the software is bug-free, administrators may inadvertently make mistakes in configuring permissions, or unaware users may click on buttons in application installers with little understanding of its consequences. Unfortunately, recovering from those inevitable compromises leads to days and weeks of wasted effort by users or system administrators, with no conclusive guarantee that all traces of the attack have been cleaned up.

This dissertation presents , an automatic recovery system that repairs a computer after an adversary compromises it, by undoing the adversary's changes while preserving legitimate user actions, with minimal user involvement. During normal operation, records an action history graph to describe the system's execution, enabling to trace the adversary's changes and their effects. During repair, uses the action history graph to undo an unwanted action and its indirect effects by first rolling back its direct effects, and then re-executing legitimate actions that were influenced by that change. To minimize re-execution and user involvement, uses predicates to selectively re-execute only actions that were semantically affected by the adversary's changes, uses refinement to represent high level semantics into the action history graph, and uses compensating actions to handle external effects.

An evaluation of a prototype of for Linux with 2 real-world attacks, 2 synthesized challenge attacks, and 6 attacks from previous work, shows that can handle a wide range of real attacks with minimal user involvement, and preserve user's changes by efficiently re-executing parts of an action history graph. These benefits come at the cost of 35–127% in execution time overhead and of 4–150 GB of log space per day, depending on the workload. For example, a HotCRP paper submission web site incurs 35% slowdown and generates 4 GB of logs per day under the workload from 30 minutes prior to the SOSP 2007 deadline. We believe those overheads are acceptable in systems whose integrity is critical in their operations.

@phdthesis{taesoo-phd,
title        = {Automatic intrusion recovery with system-wide
history},
author       = {Taesoo Kim},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = jun,
}
Reducing pause times with Clustered Collection Cody Cutler. Master's thesis, Massachusetts Institute of Technology, June 2014.

Clustered Collection reduces garbage collection pauses in programs with large amounts of live data. A full collection of millions of live objects can pause the program for multiple seconds. Much of this work, however, is repeated from one collection to the next, particularly for programs that modify only a small fraction of their object graphs between collections.

Clustered Collection reduces redundant work by identifying regions of the object graph which, once traced, need not be traced by subsequent collections. Each of these regions, or “clusters,” consists of objects reachable from a single head object. If the collector can reach a cluster's head object, it skips over the cluster, and resumes tracing at the pointers that leave the cluster. If a cluster's head object is not reachable, or an object within a cluster has been written, the cluster collector may have to trace within the cluster. Clustered Collection is complete despite not tracing within clusters: it frees all unreachable objects.

Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site show that Clustered Collection decreases full collection pause times by a factor of three. Hacker News works well with Clustered Collection because it keeps gigabytes of data in memory but modifies only a small fraction of that data. Other experiments demonstrate the ability of Clustered Collection to tolerate certain kinds of writes, and quantify the cost of finding clusters.

@mastersthesis{ccutler-sm-thesis,
title        = {Reducing pause times with {Clustered Collection}},
author       = {Cody Cutler},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = jun,
}
Scaling address-space operations on Linux with TSX Christopher R. Johnson. Master's thesis, Massachusetts Institute of Technology, June 2014.

Concurrent programming is important due to increasing core counts, but scalable concurrency control is difficult and error prone to implement. Hardware Transactional Memory (HTM) addresses this problem by providing hardware support for concurrently executing arbitrary read-modify-write memory transactions. Intel released Transactional Synchronization eXtensions (TSX), a HTM implementation, in select processors to support scalable concurrency control.

This thesis contributes a case study in applying TSX to the Linux virtual memory system, which currently serializes address-space operations with a lock. TSX should provide scalability by supporting concurrent address-space operations. Achieving scalability with TSX, however, turned out to be difficult due to transactional aborts. This thesis details how to identify and resolve abort problems, and it describes the necessary modifications to make address-space operations scale in Linux.

This thesis also describes a new TLB shootdown algorithm, TxShootDown, which removes TLB shootdown from a transactional critical section while avoiding races due to concurrent address-space operations.

@mastersthesis{txvm:crjohns-thesis,
title        = {Scaling Address-Space Operations on {Linux} with
{TSX}},
author       = {Christopher R. Johnson},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = jun,
}
Fast bug finding in lock-free data structures with CB-DPOR Jelle van den Hooff. Master's thesis, Massachusetts Institute of Technology, May 2014.

This thesis describes CB-DPOR, an algorithm for quickly finding bugs in lock-free data structures. CB-DPOR is a combination of the CHESS and DPOR model checking algorithms. CB-DPOR performs similar to the concurrently developed preemption-bounded BPOR algorithm.

Codex is a tool for finding bugs in lock-free data structures. Codex implements CB-DPOR and this thesis demonstrates how to use Codex to find bugs. This thesis describes new bugs in open-source lock-free data structures, and compares the performance of CB-DPOR with the earlier model checking algorithms CHESS, DPOR, and PCT. CB-DPOR find bugs one to two orders of magnitude faster than earlier algorithms.

@mastersthesis{codex:jelle-meng,
title        = {Fast Bug Finding in Lock-Free Data Structures with
{CB-DPOR}},
author       = {Jelle van den Hooff},
school       = {Massachusetts Institute of Technology},
year         = 2014,
month        = may,
}
Building web applications on top of encrypted data using Mylar Raluca Ada Popa, Emily Stark, Jonas Helfer, Steven Valdez, Nickolai Zeldovich, M. Frans Kaashoek, and Hari Balakrishnan. NSDI 2014.

Web applications rely on servers to store and process confidential information. However, anyone who gains access to the server (e.g., an attacker, a curious administrator, or a government) can obtain all of the data stored there. This paper presents Mylar, a platform for building web applications, which protects data confidentiality against attackers with full access to servers. Mylar stores sensitive data encrypted on the server, and decrypts that data only in users' browsers. Mylar addresses three challenges in making this approach work. First, Mylar allows the server to perform keyword search over encrypted documents, even if the documents are encrypted with different keys. Second, Mylar allows users to share keys and encrypted data securely in the presence of an active adversary. Finally, Mylar ensures that client-side application code is authentic, even if the server is malicious. Results with a prototype of Mylar built on top of the Meteor framework are promising: porting 6 applications required changing just 36 lines of code on average, and the performance overheads are modest, amounting to a 17% throughput loss and a 50 ms latency increase for sending a message in a chat application.

@inproceedings{mylar:nsdi14,
title        = {Building web applications on top of encrypted data
using {Mylar}},
author       = {Raluca Ada Popa and Emily Stark and Jonas Helfer and
Steven Valdez and Nickolai Zeldovich and M. Frans Kaashoek and
Hari Balakrishnan},
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '14)},
year         = 2014,
month        = apr,
}
Easy freshness with Pequod cache joins Bryan Kate, Eddie Kohler, Michael Kester, Neha Narula, Yandong Mao, and Robert Morris. NSDI 2014.
Pequod is a distributed application-level key-value cache that supports declaratively defined, incrementally maintained, dynamic, partially-materialized views. These views, which we call cache joins, can simplify application development by shifting the burden of view maintenance onto the cache. Cache joins define relationships among key ranges; using cache joins, Pequod calculates views on demand, incrementally updates them as required, and in many cases improves performance by reducing client communication. To build Pequod, we had to design a view abstraction for volatile, relationless key-value caches and make it work across servers in a distributed system. Pequod performs as well as other inmemory key-value caches and, like those caches, outperforms databases with view support.
@inproceedings{pequod:nsdi2014,
title        = {Easy Freshness with {Pequod} Cache Joins},
author       = {Bryan Kate and Eddie Kohler and Michael Kester and
Neha Narula and Yandong Mao and Robert Morris},
booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '14)},
year         = 2014,
month        = apr,
}

## 2013

Towards optimization-safe systems: Analyzing the impact of undefined behavior Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama. SOSP 2013.

This paper studies an emerging class of software bugs called optimization-unstable code: code that is unexpectedly discarded by compiler optimizations due to undefined behavior in the program. Unstable code is present in many systems, including the Linux kernel and the Postgres database. The consequences of unstable code range from incorrect functionality to missing security checks.

To reason about unstable code, this paper proposes a novel model, which views unstable code in terms of optimizations that leverage undefined behavior. Using this model, we introduce a new static checker called Stack that precisely identifies unstable code. Applying Stack to widely used systems has uncovered 160 new bugs that have been confirmed and fixed by developers.

@inproceedings{stack:sosp13,
title        = {Towards Optimization-Safe Systems: Analyzing the
Impact of Undefined Behavior},
author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek
and Armando Solar-Lezama},
booktitle    = {Proceedings of the 24th ACM Symposium on Operating
Systems Principles (SOSP 2013)},
year         = 2013,
month        = nov,
}
Asynchronous intrusion recovery for interconnected web services Ramesh Chandra, Taesoo Kim, and Nickolai Zeldovich. SOSP 2013.

Recovering from attacks in an interconnected system is difficult, because an adversary that gains access to one part of the system may propagate to many others, and tracking down and recovering from such an attack requires significant manual effort. Web services are an important example of an interconnected system, as they are increasingly using protocols such as OAuth and REST APIs to integrate with one another. This paper presents Aire, an intrusion recovery system for such web services. Aire addresses several challenges, such as propagating repair across services when some servers may be unavailable, and providing appropriate consistency guarantees when not all servers have been repaired yet. Experimental results show that Aire can recover from four realistic attacks, including one modeled after a recent Facebook OAuth vulnerability; that porting existing applications to Aire requires little effort; and that Aire imposes a 19–30% CPU overhead and 6–9 KB/request storage cost for Askbot, an existing web application.

@inproceedings{aire:sosp13,
title        = {Asynchronous Intrusion Recovery for Interconnected
Web Services},
author       = {Ramesh Chandra and Taesoo Kim and Nickolai Zeldovich},
booktitle    = {Proceedings of the 24th ACM Symposium on Operating
Systems Principles (SOSP 2013)},
year         = 2013,
month        = nov,
}
The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. SOSP 2013.

What fundamental opportunities for scalability are latent in interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this paper introduces the following rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building more scalable software starting from interface design and carrying on through implementation, testing, and evaluation.

To help developers apply the rule, a new tool named Commuter accepts high-level interface models and generates tests of operations that commute and hence could scale. Using these tests, Commuter can evaluate the scalability of an implementation. We apply Commuter to 18 POSIX calls and use the results to guide the implementation of a new research operating system kernel called sv6. Linux scales for 68% of the 13,664 tests generated by Commuter for these calls, and Commuter finds many problems that have been observed to limit application scalability. sv6 scales for 99% of the tests.

@inproceedings{commutativity:sosp13,
title        = {The Scalable Commutativity Rule: Designing Scalable
Software for Multicore Processors},
author       = {Austin T. Clements and M. Frans Kaashoek and
Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
booktitle    = {Proceedings of the 24th ACM Symposium on Operating
Systems Principles (SOSP 2013)},
year         = 2013,
month        = nov,
}
Optimizing communication bottlenecks in multiprocessor operating system kernels Silas Boyd-Wickizer. Ph.D. thesis, Massachusetts Institute of Technology, September 2013.

One difficulty of programming multicore processors is achieving performance that scales with the number of cores in the system. A common performance optimization is to increase inter-core parallelism. If the application is sufficiently parallelized, developers might hope that performance would scale as core count increases. Unfortunately for some applications, such as operating system kernels, parallelization reveals inter-core communication as a performance bottleneck. When data is updated on one core and read or written on other cores, the cache coherence protocol serializes accesses to the data. The result is that each access to the shared data can take hundreds to thousands of cycles, depending on how many cores are contending for the data.

This dissertation focuses on optimizing communication bottlenecks caused by update-heavy workloads, where a data structure is frequently updated but rarely read. Such data structures are commonly used for operating system kernel bookkeeping, such as LRU lists, reverse maps in virtual memory, and file system notification queues. This dissertation identifies bottlenecks in the Linux kernel caused by update-heavy data structures, presents a general approach for optimizing communication in update-heavy data structures, and presents a library called OpLog that embodies this approach and helps developers achieve good scalability for update-heavy data structures. OpLog achieves scalability by logging update operations in per-core logs, and combining the logs only when required by a read to the data structure. Measurements on a 48-core AMD server show that applying OpLog to update-heavy data structures in the Linux kernel significantly improves application performance under certain workloads.

@phdthesis{sbw-phd,
title        = {Optimizing Communication Bottlenecks in
Multiprocessor Operating System Kernels},
author       = {Silas Boyd-Wickizer},
school       = {Massachusetts Institute of Technology},
year         = 2013,
month        = sep,
}
Processing analytical queries over encrypted data Stephen Tu, M. Frans Kaashoek, Sam Madden, and Nickolai Zeldovich. VLDB 2013.

Monomi is a system for securely executing analytical workloads over sensitive data on an untrusted database server. Monomi works by encrypting the entire database and running queries over the encrypted data. Monomi introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, including per-row precomputation, space-efficient encryption, grouped homomorphic addition, and pre-filtering. Since these optimizations are good for some queries but not others, Monomi introduces a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime. A prototype of Monomi running on top of Postgres can execute most of the queries from the TPC-H benchmark with a median overhead of only 1.24× (ranging from 1.03× to 2.33×) compared to an un-encrypted Postgres database where a compromised server would reveal all data.

@inproceedings{monomi:vldb13,
title        = {Processing Analytical Queries over Encrypted Data},
author       = {Stephen Tu and M. Frans Kaashoek and Sam Madden and
Nickolai Zeldovich},
booktitle    = {Proceedings of the 39th International Conference on
Very Large Data Bases (VLDB 2013)},
year         = 2013,
month        = aug,
}
Security bugs in embedded interpreters Haogang Chen, Cody Cutler, Taesoo Kim, Yandong Mao, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2013.

Because embedded interpreters offer flexibility and performance, they are becoming more prevalent, and can be found at nearly every level of the software stack. As one example, the Linux kernel defines languages to describe packet filtering rules and uses embedded interpreters to filter packets at run time. As another example, the RAR archive format allows embedding bytecode in compressed files to describe reversible transformations for decompression. This paper presents an analysis of common pitfalls in embedded interpreter implementations, which can lead to security vulnerabilities, and their impact. We hope that these results are useful both in augmenting existing embedded interpreters and in aiding developers in building new, more secure embedded interpreters.

@inproceedings{vm:apsys13,
title        = {Security Bugs in Embedded Interpreters},
author       = {Haogang Chen and Cody Cutler and Taesoo Kim and
Yandong Mao and Xi Wang and Nickolai Zeldovich and M. Frans
Kaashoek},
booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2013)},
year         = 2013,
month        = jul,
}
Optimizing RAM-latency dominated applications Yandong Mao, Cody Cutler, and Robert Morris. APSys 2013.

Many apparently CPU-limited programs are actually bottlenecked by RAM fetch latency, often because they follow pointer chains in working sets that are much bigger than the CPU’s on-chip cache. For example, garbage collectors that identify live objects by tracing inter-object pointers can spend much of their time stalling due to RAM fetches.

We observe that for such workloads, programmers should view RAM much as they view disk. The two situations share not just high access latency, but also a common set of approaches to coping with that latency. Relatively general-purpose techniques such as batching, sorting, and "I/O" concurrency work to hide RAM latency much as they do for disk.

This paper studies several RAM-latency dominated programs and shows how we apply general-purpose approaches to hide RAM latency. The evaluation shows that these optimizations improve performance by a factor up to 1.4×. Counter-intuitively, even though these programs are not limited by CPU cycles, we found that adding more cores can yield better performance.

@inproceedings{ram-latency:apsys13,
title        = {Optimizing {RAM}-latency Dominated Applications},
author       = {Yandong Mao and Cody Cutler and Robert Morris},
booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2013)},
year         = 2013,
month        = jul,
}
Optimizing unit test execution in large software programs using dependency analysis Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. APSys 2013.

Tao is a system that optimizes the execution of unit tests in large software programs and reduces the programmer wait time from minutes to seconds. Tao is based on two key ideas: First, Tao focuses on efficiency, unlike past work that focused on avoiding false negatives. Tao implements simple and fast function-level dependency tracking that identifies tests to run on a code change; any false negatives missed by this dependency tracking are caught by running the entire test suite on a test server once the code change is committed. Second, to make it easy for programmers to adopt Tao, it incorporates the dependency information into the source code repository. This paper describes an early prototype of Tao and demonstrates that Tao can reduce unit test execution time in two large Python software projects by over 96% while incurring few false negatives.

@inproceedings{tao:apsys13,
title        = {Optimizing Unit Test Execution in Large Software
Programs using Dependency Analysis},
author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2013)},
year         = 2013,
month        = jul,
}
Practical and effective sandboxing for non-root users Taesoo Kim, and Nickolai Zeldovich. USENIX 2013.

Mbox is a lightweight sandboxing mechanism for non-root users in commodity OSes. Mbox's sandbox usage model executes a program in the sandbox and prevents the program from modifying the host filesystem by layering the sandbox filesystem on top of the host filesystem. At the end of program execution, the user can examine changes in the sandbox filesystem and selectively commit them back to the host filesystem. Mbox implements this by interposing on system calls and provides a variety of useful applications: installing system packages as a non-root user, running unknown binaries safely without network accesses, checkpointing the host filesystem instantly, and setting up a virtual development environment without special tools. Our performance evaluation shows that Mbox imposes CPU overheads of 0.1–45.2% for various workloads. In this paper, we present Mbox's design, efficient techniques for interposing on system calls, our experience avoiding common system call interposition pitfalls, and Mbox's performance evaluation.

@inproceedings{mbox:usenix13,
title        = {Practical and Effective Sandboxing for Non-root
Users},
author       = {Taesoo Kim and Nickolai Zeldovich},
booktitle    = {Proceedings of the 2013 USENIX Annual Technical
Conference (USENIX '13)},
year         = 2013,
month        = jun,
}
Graduate admissions at MIT & comparison-based rank aggregation: A case study Katherine Szeto. Master's thesis, Massachusetts Institute of Technology, June 2013.

Admission to the Graduate program in EECS at MIT is done via an all-electronic system. Applicants submit materials through a web interface and faculty "reviewers" read the applications and make notes via a different interface. Among other comments, reviewers provide a numerical score between 1 and 4 to each application. Admissions decisions are reached via a lengthy process involving discussion among faculty, which is often guided by the numerical scores of the applications. Past data show the scores of the applicants still under consideration during the final stages of the process are almost all 4's. Because of this uniformity, the scores do not provide much resolution into the differences in quality of the applications, and are therefore not as helpful as they could be in guiding admissions decisions.

In this work, we present the use of an additional scoring system that is based on pairwise comparisons between applicants made by faculty reviewers. We present the design we created for this scheme and the code written to implement it, and analyze the data we obtained when the code went live during the 2012–2013 admissions cycle.

@mastersthesis{szeto-meng,
Rank Aggregation: A Case Study},
author       = {Katherine Szeto},
school       = {Massachusetts Institute of Technology},
year         = 2013,
month        = jun,
}
Finding linearization violations in lock-free concurrent data structures Sebastien Dabdoub. Master's thesis, Massachusetts Institute of Technology, May 2013.

Finding bugs in lock-free concurrent programs is hard. This is due in part to the difficulty of reasoning about the correctness of concurrent algorithms and the timing-sensitive nature of concurrent programs. One of the most widely used tools for reasoning about the correctness of concurrent algorithms is the linearization property. This thesis presents a tool for automatic dynamic checking of concurrent programs under the Total-Store-Order (TSO) memory model and a methodology for finding linearization violations automatically with the tool.

@mastersthesis{codex:sdabdoub-meng,
title        = {Finding Linearization Violations in Lock-Free
Concurrent Data Structures},
author       = {Sebastien Dabdoub},
school       = {Massachusetts Institute of Technology},
year         = 2013,
month        = may,
}
RadixVM: Scalable address spaces for multithreaded applications (revised 2014-08-05) Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. EuroSys 2013.

@inproceedings{radixvm:eurosys13,
applications (revised 2014-08-05)},
author       = {Austin T. Clements and M. Frans Kaashoek and
Nickolai Zeldovich},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2013)},
year         = 2013,
month        = apr,
}

## 2012

Improving integer security for systems with KINT Xi Wang, Haogang Chen, Zhihao Jia, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2012.
@inproceedings{kint:osdi12,
title        = {Improving Integer Security for Systems with {KINT}},
author       = {Xi Wang and Haogang Chen and Zhihao Jia and Nickolai
Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of the 10th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '12)},
year         = 2012,
month        = oct,
}
Efficient patch-based auditing for web application vulnerabilities Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. OSDI 2012.

Poirot is a system that, given a patch for a newly discovered security vulnerability in a web application, helps administrators detect past intrusions that exploited the vulnerability. Poirot records all requests to the server during normal operation, and given a patch, re-executes requests using both patched and unpatched software, and reports to the administrator any request that executes differently in the two cases. A key challenge with this approach is the cost of re-executing all requests, and Poirot introduces several techniques to reduce the time required to audit past requests, including filtering requests based on their control flow and memoization of intermediate results across different requests.

A prototype of Poirot for PHP accurately detects attacks on older versions of MediaWiki and HotCRP, given subsequently released patches. Poirot's techniques allow it to audit past requests 12–51× faster than the time it took to originally execute the same requests, for patches to code executed by every request, under a realistic MediaWiki workload.

@inproceedings{poirot:osdi12,
title        = {Efficient Patch-based Auditing for Web Application
Vulnerabilities},
author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
booktitle    = {Proceedings of the 10th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '12)},
year         = 2012,
month        = oct,
}
Undefined behavior: What happened to my code? Xi Wang, Haogang Chen, Alvin Cheung, Zhihao Jia, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2012.
@inproceedings{ub:apsys12,
title        = {Undefined Behavior: What Happened to My Code?},
author       = {Xi Wang and Haogang Chen and Alvin Cheung and Zhihao
Jia and Nickolai Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of 3rd ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2012)},
year         = 2012,
month        = jul,
}
Recovering from intrusions in distributed systems with Dare Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. APSys 2012.

Dare is a system that recovers system integrity after intrusions that spread between machines in a distributed system. Dare extends the rollback-and-reexecute recovery model of Retro to distributed system recovery by solving the following challenges: tracking dependencies across machines, repairing network connections, minimizing distributed repair, and dealing with long-running daemon processes. This paper describes an early prototype of Dare, presents some preliminary results, and discusses open problems.

@inproceedings{dare:apsys12,
title        = {Recovering from Intrusions in Distributed Systems
with {Dare}},
author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
booktitle    = {Proceedings of 3rd ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2012)},
year         = 2012,
month        = jul,
}
Non-scalable locks are dangerous Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. Linux Symposium 2012.
@inproceedings{locks:linuxsymp,
title        = {Non-scalable locks are dangerous},
author       = {Silas Boyd-Wickizer and M. Frans Kaashoek and Robert
Morris and Nickolai Zeldovich},
booktitle    = {Proceedings of the Linux Symposium},
year         = 2012,
month        = jul,
}
Executing web application queries on a partitioned database Neha Narula, and Robert Morris. USENIX Webapps 2012.

Partitioning data over multiple storage servers is an attractive way to increase throughput for web-like workloads. However, there is often no one partitioning that yields good performance for all queries, and it can be challenging for the web developer to determine how best to execute queries over partitioned data.

This paper presents Dixie, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. Dixie focuses on increasing inter-query parallel speedup by involving as few servers as possible in each query. One way it does this is by supporting tables with multiple copies partitioned on different columns, in order to expand the set of queries that can be satisified from a single server. Dixie automatically transforms SQL queries to execute over a partitioned database, using a cost model and plan generator that exploit multiple table copies.

We evaluate Dixie on a database and query stream taken from Wikipedia, partitioned across ten MySQL servers. By adding one copy of a 13 MB table and using Dixie's query optimizer, we achieve a throughput improvement of 3.2X over a single optimized partitioning of each table and 8.5X over the same data on a single server. On specific queries Dixie with table copies increases throughput linearly with the number of servers, while the best single-table-copy partitioning achieves little scaling. For a large class of joins, which traditional wisdom suggests requires tables partitioned on the join keys, Dixie can find higher-performance plans using other partitionings.

@inproceedings{dixie:usenixwebapps2012,
title        = {Executing Web Application Queries on a Partitioned
Database},
author       = {Neha Narula and Robert Morris},
booktitle    = {Proceedings of the 3rd USENIX Conference on Web
Application Development (USENIX Webapps '12)},
year         = 2012,
month        = jun,
}
Improving network connection locality on multicore systems Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. EuroSys 2012.

Incoming and outgoing processing for a given TCP connection often execute on different cores: an incoming packet is typically processed on the core that receives the interrupt, while outgoing data processing occurs on the core running the relevant user code. As a result, accesses to read/write connection state (such as TCP control blocks) often involve cache invalidations and data movement between cores' caches. These can take hundreds of processor cycles, enough to significantly reduce performance.

We present a new design, called Affinity-Accept, that causes all processing for a given TCP connection to occur on the same core. Affinity-Accept arranges for the network interface to determine the core on which application processing for each new connection occurs, in a lightweight way; it adjusts the card's choices only in response to imbalances in CPU scheduling. Measurements show that for the Apache web server serving static files on a 48-core AMD system, Affinity-Accept reduces time spent in the TCP stack by 30% and improves overall throughput by 24%.

@inproceedings{affinity-accept:eurosys12,
title        = {Improving Network Connection Locality on Multicore
Systems},
author       = {Aleksey Pesterev and Jacob Strauss and Nickolai
Zeldovich and Robert T. Morris},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2012)},
year         = 2012,
month        = apr,
}
Cache craftiness for fast multicore key-value storage Yandong Mao, Eddie Kohler, and Robert Morris. EuroSys 2012.
@inproceedings{masstree:eurosys12,
title        = {Cache Craftiness for Fast Multicore Key-Value
Storage},
author       = {Yandong Mao and Eddie Kohler and Robert Morris},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2012)},
year         = 2012,
month        = apr,
}
UFlood: High-throughput flooding over wireless mesh networks Jayashree Subramanian, Robert Morris, and Hari Balakrishnan. Infocom 2012.

This paper proposes UFlood, a flooding protocol for wireless mesh networks. UFlood targets situations such as software updates where all nodes need to receive the same large file of data, and where limited radio range requires forwarding. UFlood's goals are high throughput and low airtime, defined respectively as rate of completion of a flood to the slowest receiving node and total time spent transmitting. The key to achieving these goals is good choice of sender for each transmission opportunity. The best choice evolves as a flood proceeds in ways that are difficult to predict.

UFlood's core new idea is a distributed heuristic to dynamically choose the senders likely to lead to all nodes receiving the flooded data in the least time. The mechanism takes into account which data nearby receivers already have as well as inter-node channel quality. The mechanism includes a novel bit-rate selection algorithm that trades off the speed of high bit-rates against the larger number of nodes likely to receive low bit-rates. Unusually, UFlood uses both random network coding to increase the usefulness of each transmission and detailed feedback about what data each receiver already has; the feedback is critical in deciding which node's coded transmission will have the most benefit to receivers. The required feedback is potentially voluminous, but UFlood includes novel techniques to reduce its cost.

The paper presents an evaluation on a 25-node 802.11 test-bed. UFlood achieves 150% higher throughput than MORE, a high-throughput flooding protocol, using 65% less airtime. UFlood uses 54% less airtime than MNP, an existing efficient protocol, and achieves 300% higher throughput.

@inproceedings{uflood:infocom12,
title        = {{UFlood}: High-Throughput Flooding over Wireless
Mesh Networks},
author       = {Jayashree Subramanian and Robert Morris and Hari
Balakrishnan},
booktitle    = {Proceedings of the 31st Infocom},
year         = 2012,
month        = mar,
}
Scalable address spaces using RCU balanced trees Austin T. Clements, Frans Kaashoek, and Nickolai Zeldovich. ASPLOS 2012.

@inproceedings{rcuvm:asplos12,
title        = {Scalable Address Spaces Using {RCU} Balanced Trees},
author       = {Austin T. Clements and Frans Kaashoek and Nickolai
Zeldovich},
booktitle    = {Proceedings of the 17th International Conference on
Architectural Support for Programming Languages and Operating
Systems ({ASPLOS})},
year         = 2012,
month        = mar,
}

## 2011

CPHash: a cache-partitioned hash table Zviad Metreveli, Nickolai Zeldovich, and Frans Kaashoek. MIT CSAIL technical report, November 2011.
@techreport{cphash:tr,
title        = {{CPHash}: a cache-partitioned hash table},
author       = {Zviad Metreveli and Nickolai Zeldovich and Frans
Kaashoek},
number       = {MIT-CSAIL-TR-2011-051},
year         = 2011,
month        = nov,
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
}
Software fault isolation with API integrity and multi-principal modules Yandong Mao, Haogang Chen, Dong Zhou, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. SOSP 2011.
@inproceedings{lxfi:sosp11,
title        = {Software fault isolation with {API} integrity and
multi-principal modules},
author       = {Yandong Mao and Haogang Chen and Dong Zhou and Xi
Wang and Nickolai Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of the 23rd ACM Symposium on Operating
Systems Principles (SOSP 2011)},
year         = 2011,
month        = oct,
}
Intrusion recovery for database-backed web applications Ramesh Chandra, Taesoo Kim, Meelap Shah, Neha Narula, and Nickolai Zeldovich. SOSP 2011.

Warp is a system that helps users and administrators of web applications recover from intrusions such as SQL injection, cross-site scripting, and clickjacking attacks, while preserving legitimate user changes. Warp repairs from an intrusion by rolling back parts of the database to a version before the attack, and replaying subsequent legitimate actions. Warp allows administrators to retroactively patch security vulnerabilities – i.e., apply new security patches to past executions – to recover from intrusions without requiring the administrator to track down or even detect attacks. Warp's time-travel database allows fine-grained rollback of database rows, and enables repair to proceed concurrently with normal operation of a web application. Finally, Warp captures and replays user input at the level of a browser's DOM, to recover from attacks that involve a user's browser. For a web server running MediaWiki, Warp requires no application source code changes to recover from a range of common web application vulnerabilities with minimal user input at a cost of 24–27% in throughput and 2–3.2 GB/day in storage.

@inproceedings{warp:sosp11,
title        = {Intrusion Recovery for Database-backed Web
Applications},
author       = {Ramesh Chandra and Taesoo Kim and Meelap Shah and
Neha Narula and Nickolai Zeldovich},
booktitle    = {Proceedings of the 23rd ACM Symposium on Operating
Systems Principles (SOSP 2011)},
year         = 2011,
month        = oct,
}
Linux kernel vulnerabilities: State-of-the-art defenses and open problems Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2011.
@inproceedings{vulnerabilities:chen,
title        = {Linux kernel vulnerabilities: State-of-the-art
defenses and open problems},
author       = {Haogang Chen and Yandong Mao and Xi Wang and Dong
Zhou and Nickolai Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of 2nd ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2011)},
year         = 2011,
month        = jul,
}
Retroactive auditing Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2011.
@inproceedings{rad:apsys11,
title        = {Retroactive Auditing},
author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek},
booktitle    = {Proceedings of 2nd ACM SIGOPS Asia-Pacific Workshop
on Systems (APSys 2011)},
year         = 2011,
month        = jul,
}
Eyo: Device-transparent personal storage Jacob Strauss, Justin Mazzola Paluska, Chris Lesniewski-Laas, Bryan Ford, Robert Morris, and Frans Kaashoek. USENIX 2011.

Users increasingly store data collections such as digital photographs on multiple personal devices, each of which typically offers a storage management interface oblivious to the contents of the user's other devices. As a result, collections become disorganized and drift out of sync. This paper presents Eyo, a novel personal storage system that provides device transparency: a user can think in terms of "file X", rather than "file X on device Y ", and will see the same set of files on all personal devices. Eyo allows a user to view and manage the entire collection of objects from any of their devices, even from disconnected devices and devices with too little storage to hold all the object content. Eyo synchronizes these collections across any network topology, including direct peer-to-peer links. Eyo provides applications with a storage API with first-class access to object version history in order to resolve update conflicts automatically. Experiments with several applications Eyo usingmedia players, a photo editor, a podcast manager, and an interface emailshow that device transparency requires only minor application changes, and matches the storage and bandwidth capabilities of typical portable devices.

@inproceedings{eyo:usenix11,
title        = {Eyo: Device-Transparent Personal Storage},
author       = {Jacob Strauss and Justin Mazzola Paluska and Chris
Lesniewski-Laas and Bryan Ford and Robert Morris and Frans
Kaashoek},
booktitle    = {Proceedings of the 2011 USENIX Annual Technical
Conference (USENIX '11)},
year         = 2011,
month        = jun,
}
A software approach to unifying multicore caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. MIT CSAIL technical report, June 2011.
@techreport{mfc:tr11,
title        = {A Software Approach to Unifying Multicore Caches},
author       = {Silas Boyd-Wickizer and M. Frans Kaashoek and Robert
Morris and Nickolai Zeldovich},
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
number       = {MIT-CSAIL-TR-2011-032},
year         = 2011,
month        = jun,
}
CPHash: A cache-partitioned hash table with LRU eviction Zviad Metreveli. Master's thesis, Massachusetts Institute of Technology, June 2011.
In this thesis we introduce CPHash – a scalable fixed size hash table that supports eviction using an LRU list, and CPServer – a scalable in memory key/value cache server that uses CPHash to implement its hash table. CPHash uses computation migration to avoid transferring data between cores. Experiments on a 48 core machine show that CPHash has 2 to 3 times higher throughput than a hash table implemented using scalable fine-grained locks. CPServer achieves 1.2 to 1.7 times higher throughput than a key/value cache server that uses a hash table with scalable fine-grained locks and 1.5 to 2.6 times higher throughput than Memcached.
@mastersthesis{cphash:zviad-meng-thesis,
title        = {{CPHash}: A Cache-Partitioned Hash Table with {LRU}
Eviction},
school       = {Massachusetts Institute of Technology},
year         = 2011,
month        = jun,
}

## 2010

Intrusion recovery using selective re-execution Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2010.

Retro repairs a desktop or server after an adversary compromises it, by undoing the adversary's changes while preserving legitimate user actions, with minimal user involvement. During normal operation, Retro records an action history graph, which is a detailed dependency graph describing the system's execution. Retro uses refinement to describe graph objects and actions at multiple levels of abstraction, which allows for precise dependencies. During repair, Retro uses the action history graph to undo an unwanted action and its indirect effects by first rolling back its direct effects, and then re-executing legitimate actions that were influenced by that change. To minimize user involvement and re-execution, Retro uses predicates to selectively re-execute only actions that were semantically affected by the adversary's changes, and uses compensating actions to handle external effects.

An evaluation of a prototype of Retro for Linux with 2 real-world attacks, 2 synthesized challenge attacks, and 6 attacks from previous work, shows that Retro can often repair the system without user involvement, and avoids false positives and negatives from previous solutions. These benefits come at the cost of 35–127% in execution time overhead and of 4–150 GB of log space per day, depending on the workload. For example, a HotCRP paper submission web site incurs 35% slowdown and generates 4 GB of logs per day under the workload from 30 minutes prior to the SOSP 2007 deadline.

@inproceedings{retro:osdi10,
title        = {Intrusion recovery using selective re-execution},
author       = {Taesoo Kim and Xi Wang and Nickolai Zeldovich and M.
Frans Kaashoek},
booktitle    = {Proceedings of the 9th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '10)},
year         = 2010,
month        = oct,
}
An analysis of Linux scalability to many cores Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. OSDI 2010.

This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48-core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques—this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.

@inproceedings{linux:osdi10,
title        = {An Analysis of {Linux} Scalability to Many Cores},
author       = {Silas Boyd-Wickizer and Austin T. Clements and
Yandong Mao and Aleksey Pesterev and M. Frans Kaashoek and Robert
Morris and Nickolai Zeldovich},
booktitle    = {Proceedings of the 9th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '10)},
year         = 2010,
month        = oct,
}
Design and applications of a secure and decentralized distributed hash table Christopher T. Lesniewski-Laas. Ph.D. thesis, Massachusetts Institute of Technology, September 2010.
@phdthesis{ctl-phd,
title        = {Design and Applications of a Secure and
Decentralized Distributed Hash Table},
author       = {Christopher T. Lesniewski-Laas},
school       = {Massachusetts Institute of Technology},
year         = 2010,
month        = sep,
}
Device-transparent personal storage Jacob Strauss. Ph.D. thesis, Massachusetts Institute of Technology, September 2010.
@phdthesis{jastr-phd,
title        = {Device-Transparent Personal Storage},
author       = {Jacob Strauss},
school       = {Massachusetts Institute of Technology},
year         = 2010,
month        = sep,
}
Making Linux protection mechanisms egalitarian with UserFS Taesoo Kim, and Nickolai Zeldovich. USENIX Security 2010.

UserFS provides egalitarian OS protection mechanisms in Linux. UserFS allows any user—not just the system administrator—to allocate Unix user IDs, to use chroot, and to set up firewall rules in order to confine untrusted code. One key idea in UserFS is representing user IDs as files in a /proc-like file system, thus allowing applications to manage user IDs like any other files, by setting permissions and passing file descriptors over Unix domain sockets. UserFS addresses several challenges in making user IDs egalitarian, including accountability, resource allocation, persistence, and UID reuse. We have ported several applications to take advantage of UserFS; by changing just tens to hundreds of lines of code, we prevented attackers from exploiting application-level vulnerabilities, such as code injection or missing ACL checks in a PHP-based wiki application. Implementing UserFS requires minimal changes to the Linux kernel-a single 3,000-line kernel module-and incurs no performance overhead for most operations, making it practical to deploy on real systems.

@inproceedings{userfs:sec19,
title        = {Making {Linux} Protection Mechanisms Egalitarian
with {UserFS}},
author       = {Taesoo Kim and Nickolai Zeldovich},
booktitle    = {Proceedings of the 19th {USENIX} {S}ecurity
{S}ymposium},
year         = 2010,
month        = aug,
}
Separating web applications from user data storage with BStore Ramesh Chandra, Priya Gupta, and Nickolai Zeldovich. USENIX WebApps 2010.
This paper presents BStore, a framework that allows developers to separate their web application code from user data storage. With BStore, storage providers implement a standard file system API, and applications access user data through that same API without having to worry about where the data might be stored. A file system manager allows the user and applications to combine multiple file systems into a single namespace, and to control what data each application can access. One key idea in BStore's design is the use of tags on files, which allows applications both to organize data in different ways, and to delegate fine-grained access to other applications. We have implemented a prototype of BStore in Javascript that runs in unmodified Firefox and Chrome browsers. We also implemented three file systems and ported three different applications to BStore. Our prototype incurs an acceptable performance overhead of less than 5% on a 10Mbps network connection, and porting existing client-side applications to BStore required small amounts of source code changes.
@inproceedings{bstore:webapps10,
author       = {Ramesh Chandra and Priya Gupta and Nickolai
Zeldovich},
title        = {Separating Web Applications from User Data Storage
with {BS}tore},
booktitle    = {Proceedings of the 2010 USENIX Conference on Web
Application Development (USENIX WebApps '10)},
year         = 2010,
month        = jun,
}
Tolerating malicious device drivers in Linux Silas Boyd-Wickizer, and Nickolai Zeldovich. USENIX 2010.
@inproceedings{sud:usenix10,
title        = {Tolerating Malicious Device Drivers in {Linux}},
author       = {Silas Boyd-Wickizer and Nickolai Zeldovich},
booktitle    = {Proceedings of the 2010 USENIX Annual Technical
Conference (USENIX '10)},
year         = 2010,
month        = jun,
}
WhanuaSIP: A secure peer-to-peer communication platform Raymond Chen. Master's thesis, Massachusetts Institute of Technology, June 2010.
This thesis presents a novel mechanism for achieving secure and reliable peer-to-peer communications on the Internet. WhanauSIP merges a Sybil-proof distributed hash table with mature SIP technology to enable instant messaging, audio chat, and video conferencing that is resilient to censoring, eavesdropping, and forgery. Performance and security evaluations performed on the PlanetLab network demonstrate that the majority of resource lookups return within 5 seconds. These results indicate that WhanauSIP delivers practical performance with respect to call session initialization latency for VoIP telephony. Furthermore, the tests demonstrated that lookup performance was minimally affected during a Sybil cluster ID attack, illustrating the network's resilience to malicious adversaries. The thesis delivers three software packages for public use: a general Whanau distributed hash table implementation, a WhanauSIP gateway, and a desktop IM/VoIP client.
@mastersthesis{whanuasip:raymond-meng-thesis,
title        = {{WhanuaSIP}: A Secure Peer-to-Peer Communication
Platform},
author       = {Raymond Chen},
school       = {Massachusetts Institute of Technology},
year         = 2010,
month        = jun,
}
Optimizing MapReduce for multicore architectures Yandong Mao, Robert Morris, and Frans Kaashoek. MIT CSAIL technical report, May 2010.
@techreport{metis:tr,
title        = {Optimizing {MapReduce} for Multicore Architectures},
author       = {Yandong Mao and Robert Morris and Frans Kaashoek},
number       = {MIT-CSAIL-TR-2010-020},
year         = 2010,
month        = may,
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
}
Whānau: A Sybil-proof distributed hash table Chris Lesniewski-Laas, and M. Frans Kaashoek. NSDI 2010.

Whānau is a novel routing protocol for distributed hash tables (DHTs) that is efficient and strongly resistant to the Sybil attack. Whānau uses the social connections between users to build routing tables that enable Sybil-resistant lookups. The number of Sybils in the social network does not affect the protocol's performance, but links between honest users and Sybils do. When there are n well-connected honest nodes, Whānau can tolerate up to O(n/log n) such "attack edges". This means that an adversary must convince a large fraction of the honest users to make a social connection with the adversary's Sybils before any lookups will fail.

Whānau uses ideas from structured DHTs to build routing tables that contain O(√n log n) entries per node. It introduces the idea of layered identifiers to counter clustering attacks, a class of Sybil attacks challenging for previous DHTs to handle. Using the constructed tables, lookups provably take constant time. Simulation results, using social network graphs from LiveJournal, Flickr, YouTube, and DBLP, confirm the analytic results. Experimental results on PlanetLab confirm that the protocol can handle modest churn.

@inproceedings{whanau:nsdi10,
title        = {Wh\=anau: A {Sybil}-Proof Distributed Hash Table},
author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
booktitle    = {Proceedings of the 7th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '10)},
year         = 2010,
month        = apr,
}
Locating cache performance bottlenecks using data profiling Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. EuroSys 2010.

Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.

DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16--57% throughput improvement on a range of memcached and Apache workloads.

@inproceedings{dprof:eurosys10,
title        = {Locating Cache Performance Bottlenecks Using Data
Profiling},
author       = {Aleksey Pesterev and Nickolai Zeldovich and Robert
T. Morris},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2010)},
year         = 2010,
month        = apr,
}

## 2009

Device transparency: a new model for mobile storage Jacob Strauss, Chris Lesniewski-Laas, Justin Mazzola Paluska, Bryan Ford, Robert Morris, and Frans Kaashoek. HotStorage 2009.

This paper proposes a new storage model, device transparency, in which users view and manage their entire data collection from any of their devices, even from disconnected storage-limited devices holding only a subset of the entire collection.

@inproceedings{eyo:hotstorage09,
title        = {Device Transparency: a New Model for Mobile Storage},
author       = {Jacob Strauss and Chris Lesniewski-Laas and Justin
Mazzola Paluska and Bryan Ford and Robert Morris and Frans
Kaashoek},
booktitle    = {Proceedings of the {W}orkshop on {H}ot {T}opics in
{S}torage and {F}ile {S}ystems ({HotStorage} '09)},
year         = 2009,
month        = oct,
}
Improving application security with data flow assertions Alexander Yip, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. SOSP 2009.

Resin is a new language runtime that helps prevent security vulnerabilities, by allowing programmers to specify application-level data flow assertions. Resin provides policy objects, which programmers use to specify assertion code and metadata; data tracking, which allows programmers to associate assertions with application data, and to keep track of assertions as the data flow through the application; and filter objects, which programmers use to define data flow boundaries at which assertions are checked. Resin's runtime checks data flow assertions by propagating policy objects along with data, as that data moves through the application, and then invoking filter objects when data crosses a data flow boundary, such as when writing data to the network or a file.

Using Resin, Web application programmers can prevent a range of problems, from SQL injection and cross-site scripting, to inadvertent password disclosure and missing access control checks. Adding a Resin assertion to an application requires few changes to the existing application code, and an assertion can reuse existing code and data structures. For instance, 23 lines of code detect and prevent three previously-unknown missing access control vulnerabilities in phpBB, a popular Web forum application. Other assertions comprising tens of lines of code prevent a range of vulnerabilities in Python and PHP applications. A prototype of Resin incurs a 33% CPU overhead running the HotCRP conference management application.

@inproceedings{resin:sosp09,
title        = {Improving Application Security with Data Flow
Assertions},
author       = {Alexander Yip and Xi Wang and Nickolai Zeldovich and
M. Frans Kaashoek},
booktitle    = {Proceedings of the 22th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '09)},
year         = 2009,
month        = oct,
}
Whanaungatanga: Sybil-proof routing with social networks Chris Lesniewski-Laas, and M. Frans Kaashoek. MIT CSAIL technical report, September 2009.
@techreport{whanau:tr09,
title        = {Whanaungatanga: {Sybil}-proof routing with social
networks},
author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
number       = {MIT-CSAIL-TR-2009-045},
year         = 2009,
month        = sep,
}
Flexible, wide-area storage for distributed systems using semantic cues Jeremy Stribling. Ph.D. thesis, Massachusetts Institute of Technology, September 2009.
@phdthesis{strib-phd,
title        = {Flexible, Wide-Area Storage for Distributed Systems
Using Semantic Cues},
author       = {Jeremy Stribling},
school       = {Massachusetts Institute of Technology},
year         = 2009,
month        = sep,
}
Improving web site security with data flow management Alexander Yip. Ph.D. thesis, Massachusetts Institute of Technology, September 2009.
@phdthesis{yipal-phd,
title        = {Improving Web Site Security with Data Flow
Management},
author       = {Alexander Yip},
school       = {Massachusetts Institute of Technology},
year         = 2009,
month        = sep,
}
In defense of wireless carrier sense Micah Brodsky, and Robert Morris. SIGCOMM 2009.

Carrier sense is often used to regulate concurrency in wireless medium access control (MAC) protocols, balancing interference protection and spatial reuse. Carrier sense is known to be imperfect, and many improved techniques have been proposed. Is the search for a replacement justified? This paper presents a theoretical model for average case two-sender carrier sense based on radio propagation theory and Shannon capacity. Analysis using the model shows that carrier sense performance is surprisingly close to optimal for radios with adaptive bitrate. The model suggests that hidden and exposed terminals usually cause modest reductions in throughput rather than dramatic decreases. Finally, it is possible to choose a fixed sense threshold which performs well across a wide range of scenarios, in large part due to the role of the noise floor. Experimental results from an indoor 802.11 testbed support these claims.

@inproceedings{cs:sigcomm09,
title        = {In Defense of Wireless Carrier Sense},
author       = {Micah Brodsky and Robert Morris},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '09 Conference},
year         = 2009,
month        = aug,
}
Reinventing scheduling for multicore systems Silas Boyd-Wickizer, Robert Morris, and M. Frans Kaashoek. HotOS 2009.
@inproceedings{o2:hotos09,
title        = {Reinventing Scheduling for Multicore Systems},
author       = {Silas Boyd-Wickizer and Robert Morris and M. Frans
Kaashoek},
booktitle    = {Proceedings of the 12th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-XII})},
year         = 2009,
month        = may,
}
Efficient file distribution in a flexible, wide-area file system Irene Zhang. Master's thesis, Massachusetts Institute of Technology, May 2009.

WheelFS is a wide-area distributed file system designed to help applications cope with the challenges of sharing data over the wide-area network. A wide range of applications can use WheelFS as a storage layer because applications can control various trade-offs in WheelFS, such as consistency versus availability, using \textit{semantic cues}. One key feature that many applications require from any storage system is efficient file distribution. The storage system needs to be able to serve files quickly, even large or popular ones, and allow users and applications to quickly browse files. Wide-area links with high latency and low throughput make achieving these goals difficult for most distributed storage systems.

This thesis explores using prefetching, a traditional file system optimization technique, in wide-area file systems for more efficient file distribution. This thesis focuses on \textit{Tread}, a prefetcher for WheelFS. Tread includes several types of prefetching to improve the performance of reading files and directories in WheelFS: read-ahead prefetching, whole file prefetching, directory prefetching and a prefetching optimization for WheelFS's built-in cooperative caching. To makes the best use of scarce wide-area resources, Tread adaptively rate-limits prefetching and gives applications control over what and how prefetching is done using WheelFS's semantic cues.

Experiments show that Tread can reduce the time to read a 10MB file in WheelFS by 40% and the time to list a directory with 100 entries by more than 80%. In addition, experiments on Planetlab show that using prefetching with cooperative caching to distribute a 10MB file to 270 clients reduces the average latency for each client to read the file by almost 45%.

@mastersthesis{wheelfs:irene-meng,
title        = {Efficient File Distribution in a Flexible, Wide-area
File System},
author       = {Irene Zhang},
school       = {Massachusetts Institute of Technology},
year         = 2009,
month        = may,
}
Securing wide-area storage in WheelFS Xavid Pretzer. Master's thesis, Massachusetts Institute of Technology, May 2009.
@mastersthesis{wheelfs:xavid-thesis,
title        = {Securing Wide-area Storage in {WheelFS}},
author       = {Xavid Pretzer},
school       = {Massachusetts Institute of Technology},
year         = 2009,
month        = may,
}
Flexible, wide-area storage for distributed systems with WheelFS Jeremy Stribling, Yair Sovran, Irene Zhang, Xavid Pretzer, Jinyang Li, M. Frans Kaashoek, and Robert Morris. NSDI 2009.

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement.

WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS's file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

@inproceedings{wheelfs:nsdi09,
title        = {Flexible, Wide-Area Storage for Distributed Systems
with {WheelFS}},
author       = {Jeremy Stribling and Yair Sovran and Irene Zhang and
Xavid Pretzer and Jinyang Li and M. Frans Kaashoek and Robert
Morris},
booktitle    = {Proceedings of the 6th {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '09)},
year         = 2009,
month        = apr,
}
Privacy-preserving browser-side scripting with BFlow Alexander Yip, Neha Narula, Maxwell Krohn, and Robert Morris. EuroSys 2009.

Some web sites provide interactive extensions using browser scripts, often without inspecting the scripts to verify that they are benign and bug-free. Others handle users' confidential data and display it via the browser. Such new features contribute to the power of online services, but their combination would allow attackers to steal confidential data. This paper presents BFlow, a security system that uses information flow control to allow the combination while preventing attacks on data confidentiality.

BFlow allows untrusted JavaScript to compute with, render, and store confidential data, while preventing leaks of that data. BFlow tracks confidential data as it flows within the browser, between scripts on a page and between scripts and web servers. Using these observations and assistance from participating web servers, BFlow prevents scripts that have seen confidential data from leaking it, all without disrupting the JavaScript communication techniques used in complex web pages. To achieve these ends, BFlow augments browsers with a new "protection zone" abstraction.

We have implemented a BFlow browser reference monitor and server support. To evaluate BFlow's confidentiality protection and flexibility, we have built a BFlow-protected blog that supports Blogger's third party JavaScript extensions. BFlow is compatible with every legitimate Blogger extension that we have found, yet it prevents malicious extensions from leaking confidential data.

@inproceedings{bflow:eurosys09,
title        = {Privacy-Preserving Browser-Side Scripting With
{BFlow}},
author       = {Alexander Yip and Neha Narula and Maxwell Krohn and
Robert Morris},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2009)},
year         = 2009,
month        = mar,
}
Ksplice: Automatic rebootless kernel updates Jeff Arnold, and M. Frans Kaashoek. EuroSys 2009.
@inproceedings{ksplice:eurosys,
title        = {Ksplice: Automatic rebootless kernel updates},
author       = {Jeff Arnold and M. Frans Kaashoek},
booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
2009)},
year         = 2009,
month        = mar,
}

## 2008

Corey: An operating system for many cores Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. OSDI 2008.
@inproceedings{corey:osdi08,
title        = {Corey: An Operating System for Many Cores},
author       = {Silas Boyd-Wickizer and Haibo Chen and Rong Chen and
Yandong Mao and Frans Kaashoek and Robert Morris and Aleksey
Pesterev and Lex Stein and Ming Wu and Yuehua Dai and Yang Zhang
and Zheng Zhang},
pages        = {43--57},
booktitle    = {Proceedings of the 8th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '08)},
year         = 2008,
month        = dec,
}
An extension-oriented compiler Russ Cox. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.
@phdthesis{rsc-phd,
title        = {An Extension-Oriented Compiler},
author       = {Russ Cox},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = sep,
}
Information flow control for secure web sites Maxwell Krohn. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.
@phdthesis{krohn-phd,
title        = {Information Flow Control for Secure Web Sites},
author       = {Maxwell Krohn},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = sep,
}
UIA: A global connectivity architecture for mobile personal devices Bryan Alexander Ford. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.
@phdthesis{ford-phd,
title        = {{UIA}: A Global Connectivity Architecture for Mobile
Personal Devices},
author       = {Bryan Alexander Ford},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = sep,
}
Vx32: Lightweight user-level sandboxing on the x86 Bryan Ford, and Russ Cox. USENIX 2008.

Code sandboxing is useful for many purposes, but most sandboxing techniques require kernel modifications, do not completely isolate guest code, or incur substantial performance costs. Vx32 is a multipurpose user-level sandbox that enables any application to load and safely execute one or more guest plug-ins, confining each guest to a system call API controlled by the host application and to a restricted memory region within the host’s address space. Vx32 runs guest code efficiently on several widespread operating systems without kernel extensions or special privileges; it protects the host program from both reads and writes by its guests; and it allows the host to restrict the instruction set available to guests. The key to vx32’s combination of portability, flexibility, and efficiency is its use of x86 segmentation hardware to sandbox the guest’s data accesses, along with a lightweight instruction translator to sandbox guest instructions.

We evaluate vx32 using microbenchmarks and whole system benchmarks, and we examine four applications based on vx32: an archival storage system, an extensible public-key infrastructure, an experimental user-level operating system running atop another host OS, and a Linux system call jail. The first three applications export custom APIs independent of the host OS to their guests, making their plug-ins binary-portable across host systems. Compute-intensive workloads for the first two applications exhibit between a 30% slowdown and a 30% speedup on vx32 relative to native execution; speedups result from vx32’s instruction translator improving the cache locality of guest code. The experimental user-level operating system allows the use of the guest OS’s applications alongside the host’s native applications and runs faster than whole-system virtual machine monitors such as VMware and QEMU. The Linux system call jail incurs up to 80% overhead but requires no kernel modifications and is delegation-based, avoiding concurrency vulnerabilities present in other interposition mechanisms.

@inproceedings{vx32:usenix08,
title        = {Vx32: Lightweight User-level Sandboxing on the x86},
author       = {Bryan Ford and Russ Cox},
booktitle    = {Proceedings of the 2008 USENIX Annual Technical
Conference (USENIX '08)},
year         = 2008,
month        = jun,
}
Storing and managing data in a distributed hash table Emil Sit. Ph.D. thesis, Massachusetts Institute of Technology, June 2008.
@phdthesis{pt:sit,
title        = {Storing and Managing Data in a Distributed Hash
Table},
author       = {Emil Sit},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = jun,
}
Ksplice: An automatic system for rebootless kernel security updates Jeffrey Brian Arnold. Master's thesis, Massachusetts Institute of Technology, May 2008.

Ksplice allows system administrators to apply security patches to their operating system kernels without having to reboot. Based on a source code patch and the kernel source code to be patched, Ksplice applies the patch to the corresponding running kernel, without requiring work from a programmer. To be fully automatic, Ksplice's design is limited to patches that do not introduce semantic changes to data structures, but a study of all significant x86-32 Linux security patches from May 2005 to December 2007 finds that only eight patches of 50 make semantic changes. An evaluation with Debian and kernel.org Linux kernels shows that Ksplice can automatically apply the remaining 42 patches, which means that 84% of the Linux kernel vulnerabilities from this interval can be corrected by Ksplice without the need for rebooting.

@mastersthesis{ksplice:jbarnold-meng,
title        = {{Ksplice}: An Automatic System for Rebootless Kernel
author       = {Jeffrey Brian Arnold},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = may,
}
UsenetDHT: A low-overhead design for Usenet Emil Sit, Robert Morris, and M. Frans Kaashoek. NSDI 2008.

Usenet is a popular distributed messaging and file sharing service: servers in Usenet flood articles over an overlay network to fully replicate articles across all servers. However, replication of Usenet's full content requires that each server pay the cost of receiving (and storing) over 1 Tbyte/day. This paper presents the design and implementation of UsenetDHT, a Usenet system that allows a set of cooperating sites to keep a shared, distributed copy of Usenet articles. UsenetDHT consists of client-facing Usenet NNTP front-ends and a distributed hash table (DHT) that provides shared storage of articles across the wide area. This design allows participating sites to partition the storage burden, rather than replicating all Usenet articles at all sites.

UsenetDHT requires a DHT that maintains durability despite transient and permanent failures, and provides high storage performance. These goals can be difficult to provide simultaneously: even in the absence of failures, verifying adequate replication levels of large numbers of objects can be resource intensive, and interfere with normal operations. This paper introduces Passing Tone, a new replica maintenance algorithm for DHash that minimizes the impact of monitoring replication levels on memory and disk resources by operating with only pairwise communication. Passing Tone's implementation provides performance by using data structures that avoid disk accesses and enable batch operations.

Microbenchmarks over a local gigabit network demonstrate that the total system throughput scales linearly as servers are added, providing 5.7 Mbyte/s of write bandwidth and 7 Mbyte/s of read bandwidth per server. UsenetDHT is currently deployed on a 12-server network at 7 sites running Passing Tone over the wide-area: this network supports our research laboratory's live 2.5 Mbyte/s Usenet feed and 30.6 Mbyte/s of synthetic read traffic. These results suggest a DHT-based design may be a viable way to redesign Usenet and globally reduce costs.

@inproceedings{usenetdht:nsdi08,
title        = {{UsenetDHT}: A Low-Overhead Design for {Usenet}},
author       = {Emil Sit and Robert Morris and M. Frans Kaashoek},
booktitle    = {Proceedings of the 5rd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '08)},
year         = 2008,
month        = apr,
}
An offline foundation for online accountable pseudonyms Bryan Ford, and Jacob Strauss. SocialNets 2008.

Online anonymity often appears to undermine accountability, offering little incentive for civil behavior, but accountability failures usually result not from anonymity itself but from the disposability of virtual identities. A user banned for misbehavior--- e.g., spamming from a free E-mail account or stuffing an online ballot box--- can simply open other accounts or connect from other IP addresses. Instead of curtailing freedom of expression by giving up anonymity, online services and communities should support accountable pseudonyms: virtual personas that can provide both anonymity and accountability. We propose Pseudonym parties, a scheme for creating accountable pseudonyms, which combine in-person social occasions (parties) with technical infrastructure (a pseudonymous sign-on service) to enforce the rule that one real person gets one virtual persona on any participating online service. Pseudonym parties enable the user to adopt different personas in different online spaces without revealing the connection between them, while ensuring that each user has only one accountable pseudonym in each space. Pseudonym parties can be started incrementally in a fully decentralized fashion, can run on volunteer labor with minimal funds, and may even be fun.

@inproceedings{accountable:socialnets08,
title        = {An Offline Foundation for Online Accountable
Pseudonyms},
author       = {Bryan Ford and Jacob Strauss},
booktitle    = {Proceedings of the First International Workshop on
Social Network Systems (SocialNets 2008)},
year         = 2008,
month        = apr,
}
A Sybil-proof one-hop DHT Chris Lesniewski-Laas. SocialNets 2008.

Decentralized systems, such as structured overlays, are subject to the Sybil attack, in which an adversary creates many false identities to increase its influence. This paper describes a one-hop distributed hash table which uses the social links between users to strongly resist the Sybil attack. The social network is assumed to be fast mixing, meaning that a random walk in the honest part of the network quickly approaches the uniform distribution. As in the related SybilLimit system, with a social network of n honest nodes and m honest edges, the protocol can tolerate up to o(n/log n) attack edges (social links from honest nodes to compromised nodes). The routing tables contain O(√m log m) entries per node and are constructed efficiently by a distributed protocol. This is the first sublinear solution to this problem. Preliminary simulation results are presented to demonstrate the approach's effectiveness.

@inproceedings{sybil:socialnets08,
title        = {A {Sybil}-proof one-hop {DHT}},
author       = {Chris Lesniewski-Laas},
booktitle    = {Proceedings of the First International Workshop on
Social Network Systems (SocialNets 2008)},
year         = 2008,
month        = apr,
}
Xoc, an extension-oriented compiler for systems programming Russ Cox, Tom Bergan, Austin T. Clements, Frans Kaashoek, and Eddie Kohler. ASPLOS 2008.

Today's system programmers go to great lengths to extend the languages in which they program. For instance, system-specific compilers find errors in Linux and other systems, and add support for specialized control flow to Qt and event-based programs. These compilers are difficult to build and cannot always understand each other's language changes. However, they can greatly improve code understandability and correctness, advantages that should be accessible to all programmers.

We describe an extension-oriented compiler for C called xoc. An extension-oriented compiler, unlike a conventional extensible compiler, implements new features via many small extensions that are loaded together as needed. Xoc gives extension writers full control over program syntax and semantics while hiding many compiler internals. Xoc programmers concisely define powerful compiler extensions that, by construction, can be combined; even some parts of the base compiler, such as GNU C compatibility, are structured as extensions.

Xoc is based on two key interfaces. Syntax patterns allow extension writers to manipulate language fragments using concrete syntax. Lazy computation of attributes allows extension writers to use the results of analyses by other extensions or the core without needing to worry about pass scheduling.

Extensions built using xoc include xsparse, a 345-line extension that mimics Sparse, Linux's C front end, and xlambda, a 170-line extension that adds function expressions to C. An evaluation of xoc using these and 13 other extensions shows that xoc extensions are typically more concise than equivalent extensions written for conventional extensible compilers and that it is possible to compose extensions.

@inproceedings{xoc:asplos08,
author       = {Russ Cox and Tom Bergan and Austin T. Clements and
Frans Kaashoek and Eddie Kohler},
title        = {Xoc, an Extension-Oriented Compiler for Systems
Programming},
pages        = {244--254},
booktitle    = {Proceedings of the 13th International Conference on
Architectural Support for Programming Languages and Operating
Systems ({ASPLOS})},
year         = 2008,
month        = mar,
}
A comparison of designs for extensible and extension-oriented compilers Austin T. Clements. Master's thesis, Massachusetts Institute of Technology, February 2008.

Today's system programmers go to great lengths to extend the languages in which they program. For instance, system-specific compilers find errors in Linux and other systems, and add support for specialized control flow to Qt and event-based programs. These compilers are difficult to build and cannot always understand each other's language changes. However, they can greatly improve code understandability and correctness, advantages that should be accessible to all programmers.

This thesis considers four extensible and extension-oriented compilers: CIL, Polyglot, xtc, and Xoc. These four compilers represent four distinctly different approaches to the problem of bridging the gap between language design and system implementation. Taking an extension author's point of view, this thesis compares the design of each compiler's extension interface in terms of extension structure, syntactic analysis, semantic analysis, and rewriting.

To perform the comparison, this thesis uses three extensions implemented variously in the four compilers: a bitwise rotation operator, function expressions, and lock checking. These extensions are designed to span a representative space of analysis and rewriting needs.

Based on this comparison, this thesis identifies the following implications of the design decisions of each extension interface: the expressiveness, understandability, and correctness of extension implementations can benefit from domain specific languages and language features tailored to the extension interface; compiler-managed scheduling trades loss of control for automatic extension composability; unifying internal and external program representation improves ease of use and extension composability, but gives up potentially useful control over the internal representation; concrete syntax patterns provide a natural interface to internal program representation, but must be more powerful than simple tree matching to be practical; grammars, types, and syntax interfaces have a natural correspondence; and accounting for semantic information in the final output enables hygienic rewriting, which can simplify extensions.

@mastersthesis{xoc:aclements-meng,
title        = {A Comparison of Designs for Extensible and
Extension-Oriented Compilers},
author       = {Austin T. Clements},
school       = {Massachusetts Institute of Technology},
year         = 2008,
month        = feb,
}

## 2007

A World Wide Web without walls Maxwell Krohn, Alex Yip, Micah Brodsky, Robert Morris, and Michael Walfish. HotNets 2007.
@inproceedings{w5:hotnets07,
author       = {Maxwell Krohn and Alex Yip and Micah Brodsky and
Robert Morris and Michael Walfish},
title        = {A {World Wide Web} Without Walls},
booktitle    = {Proceedings of the Sixth {W}orkshop on {H}ot
{T}opics in {N}etworks ({HotNets-VI})},
year         = 2007,
month        = nov,
organization = {{ACM SIGCOMM}},
}
Alpaca: Extensible authorization for distributed services Chris Lesniewski-Laas, Bryan Ford, Jacob Strauss, Robert Morris, and M. Frans Kaashoek. CCS 2007.

Traditional Public Key Infrastructures (PKI) have not lived up to their promise because there are too many ways to define PKIs, too many cryptographic primitives to build them with, and too many administrative domains with incompatible roots of trust. Alpaca is an authentication and authorization framework that embraces PKI diversity by enabling one PKI to "plug in" another PKI's credentials and cryptographic algorithms, allowing users of the latter to authenticate themselves to services using the former using their existing, unmodified certificates. Alpaca builds on Proof-Carrying Authorization (PCA), expressing a credential as an explicit proof of a logical claim. Alpaca generalizes PCA to express not only delegation policies but also the cryptographic primitives, credential formats, and namespace structures needed to use foreign credentials directly. To achieve this goal, Alpaca introduces a method of creating and naming new principals which behave according to arbitrary rules, a modular approach to logical axioms, and a domain-specific language specialized for reasoning about authentication. We have implemented Alpaca as a Python module that assists applications in generating proofs (e.g., in a client requesting access to a resource), and in verifying those proofs via a compact 800-line TCB (e.g., in a server providing that resource). We present examples demonstrating Alpaca's extensibility in scenarios involving inter-organization PKI interoperability and secure remote PKI upgrade.

@inproceedings{alpaca:ccs07,
title        = {Alpaca: Extensible Authorization for Distributed
Services},
author       = {Chris Lesniewski-Laas and Bryan Ford and Jacob
Strauss and Robert Morris and M. Frans Kaashoek},
booktitle    = {Proceedings of the 14th {ACM} Conference on Computer
and Communications Security ({CCS-2007})},
year         = 2007,
month        = oct,
}
Information flow control for standard OS abstractions Maxwell Krohn, Alexander Yip, Micah Brodsky, Natan Cliffer, M. Frans Kaashoek, Eddie Kohler, and Robert Morris. SOSP 2007.
@inproceedings{KYBCKMM07,
author       = {Maxwell Krohn and Alexander Yip and Micah Brodsky
and Natan Cliffer and M. Frans Kaashoek and Eddie Kohler and
Robert Morris},
title        = {Information Flow Control for Standard {OS}
Abstractions},
booktitle    = {Proceedings of the 21st {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '07)},
year         = 2007,
month        = oct,
}
Structured streams: a new transport abstraction Bryan Ford. SIGCOMM 2007.

Internet applications currently have a choice between stream and datagram transport abstractions. Datagrams efficiently support small transactions and streams are suited for long-running conversations, but neither abstraction adequately supports applications like HTTP that exhibit a mixture of transaction sizes, or applications like FTP and SIP that use multiple transport instances. Structured Stream Transport (SST) enhances the traditional stream abstraction with a hierarchical hereditary structure, allowing applications to create lightweight child streams from any existing stream. Unlike TCP streams, these lightweight streams incur neither 3-way handshaking delays on startup nor TIME-WAIT periods on close. Each stream offers independent data transfer and flow control, allowing different transactions to proceed in parallel without head-of-line blocking, but all streams share one congestion control context. SST supports both reliable and best-effort delivery in a way that semantically unifies datagrams with streams and solves the classic “large datagram” problem, where a datagram's loss probability increases exponentially with fragment count. Finally, an application can prioritize its streams relative to each other and adjust priorities dynamically through out-of-band signaling. A user-space prototype shows that SST is TCP-friendly to within 2%, and performs comparably to a user-space TCP and to within 10% of kernel TCP on a WiFi network.

@inproceedings{sst:sigcomm07,
title        = {Structured Streams: a New Transport Abstraction},
author       = {Bryan Ford},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '07 Conference},
year         = 2007,
month        = aug,
}
Events can make sense Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. USENIX 2007.
@inproceedings{krohn:tame,
author       = {Maxwell Krohn and Eddie Kohler and M. Frans Kaashoek},
title        = {Events Can Make Sense},
booktitle    = {Proceedings of the 2007 USENIX Annual Technical
Conference (USENIX '07)},
year         = 2007,
month        = jun,
}
Don't give up on distributed file systems Jeremy Stribling, Emil Sit, M. Frans Kaashoek, Jinyang Li, and Robert Morris. IPTPS 2007.

Wide-area distributed applications often reinvent the wheel for their storage needs, each incorporating its own special-purpose storage manager to cope with distribution, intermittent failures, limited bandwidth, and high latencies. This paper argues that a distributed file system could provide a reusable solution to these problems by coupling a standard interface with a design suited to wide-area distribution. For concreteness, this paper presents such a file system, called WheelFS, which allows applications to control consistency through the use of semantic cues, and minimizes communication costs by adhering to the slogan read globally, write locally. WheelFS could simplify distributed experiments, CDNs, and Grid applications.

@inproceedings{wheelfs:iptps07,
title        = {Don't Give Up on Distributed File Systems},
author       = {Jeremy Stribling and Emil Sit and M. Frans Kaashoek
and Jinyang Li and Robert Morris},
booktitle    = {Proceedings of the 6th International Workshop on
Peer-to-Peer Systems (IPTPS07)},
year         = 2007,
month        = feb,
}

## 2006

Persistent personal names for globally connected mobile devices Bryan Ford, Jacob Strauss, Chris Lesniewski-Laas, Sean Rhea, Frans Kaashoek, and Robert Morris. OSDI 2006.

The Unmanaged Internet Architecture (UIA) provides zero-configuration connectivity among mobile devices through personal names. Users assign personal names through an ad hoc device introduction process requiring no central allocation. Once assigned, names bind securely to the global identities of their target devices independent of network location. Each user manages one namespace, shared among all the user's devices and always available on each device. Users can also name other users to share resources with trusted acquaintances. Devices with naming relationships automatically arrange connectivity when possible, both in ad hoc networks and using global infrastructure when available. A UIA prototype demonstrates these capabilities using optimistic replication for name resolution and group management and a routing algorithm exploiting the user's social network for connectivity.

Demo Video:

• AVI: 20MB, H.264 codec; playable with VLC or MPlayer for example.
• DVD: 262MB, MPEG-2; burn this ISO image onto a DVD and pop it into a DVD player.
@inproceedings{uia:osdi06,
title        = {Persistent Personal Names for Globally Connected
Mobile Devices},
author       = {Bryan Ford and Jacob Strauss and Chris
Lesniewski-Laas and Sean Rhea and Frans Kaashoek and Robert Morris},
booktitle    = {Proceedings of the 7th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '06)},
year         = 2006,
month        = nov,
}
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, Frans Kaashoek, John Kubiatowicz, and Robert Morris. NSDI 2006.

This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.

The following insights in designing an efficient replication algorithm emerge from the paper's analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency.

Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

@inproceedings{carbonite:nsdi06,
title        = {Efficient Replica Maintenance for Distributed
Storage Systems},
author       = {Byung-Gon Chun and Frank Dabek and Andreas Haeberlen
and Emil Sit and Hakim Weatherspoon and Frans Kaashoek and John
Kubiatowicz and Robert Morris},
booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
year         = 2006,
month        = may,
}
Overcite: A distributed, cooperative citeseer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, and Robert Morris. NSDI 2006.

CiteSeer is a popular online resource for the computer science research community, allowing users to search and browse a large archive of research papers. CiteSeer is expensive: it generates 35 GB of network traffic per day, requires nearly one terabyte of disk storage, and needs significant human maintenance.

OverCite is a new digital research library system that aggregates donated resources at multiple sites to provide CiteSeer-like document search and retrieval. OverCite enables members of the community to share the costs of running CiteSeer. The challenge facing OverCite is how to provide scalable and load-balanced storage and query processing with automatic data management. OverCite uses a three-tier design: presentation servers provide an identical user interface to CiteSeer's; application servers partition and replicate a search index to spread the work of answering each query among several nodes; and a distributed hash table stores documents and meta-data, and coordinates the activities of the servers.

Evaluation of a prototype shows that OverCite increases its query throughput by a factor of seven with a nine-fold increase in the number of servers. OverCite requires more total storage and network bandwidth than centralized CiteSeer, but spreads these costs over all the sites. OverCite can exploit the resources of these sites to support new features such as document alerts and to scale to larger data sets.

@inproceedings{overcite:nsdi06,
title        = {OverCite: A Distributed, Cooperative CiteSeer},
author       = {Jeremy Stribling and Jinyang Li and Isaac G.
Councill and M. Frans Kaashoek and Robert Morris},
booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
year         = 2006,
month        = may,
}
Pastwatch: A distributed version control system Alexander Yip, Benjie Chen, and Robert Morris. NSDI 2006.
@inproceedings{pastwatch:nsdi06,
title        = {Pastwatch: A Distributed Version Control System},
author       = {Alexander Yip and Benjie Chen and Robert Morris},
booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
year         = 2006,
month        = may,
}
Asbestos: Operating system security for mobile devices Martijn Stevenson. Master's thesis, Massachusetts Institute of Technology, May 2006.
@mastersthesis{armasb:martijn-meng,
title        = {Asbestos: Operating System Security for Mobile
Devices},
author       = {Martijn Stevenson},
school       = {Massachusetts Institute of Technology},
year         = 2006,
month        = may,
}
Proactive replication for data durability Emil Sit, Andreas Haeberlen, Frank Dabek, Byung-Gon Chun, Hakim Weatherspoon, Frans Kaashoek, Robert Morris, and John Kubiatowicz. IPTPS 2006.
@inproceedings{tempo:iptps06,
title        = {Proactive replication for data durability},
author       = {Emil Sit and Andreas Haeberlen and Frank Dabek and
Byung-Gon Chun and Hakim Weatherspoon and Frans Kaashoek and
Robert Morris and John Kubiatowicz},
booktitle    = {Proceedings of the 5th International Workshop on
Peer-to-Peer Systems (IPTPS06)},
year         = 2006,
month        = feb,
}
User-relative names for globally connected personal devices Bryan Ford, Jacob Strauss, Chris Lesniewski-Laas, Sean Rhea, Frans Kaashoek, and Robert Morris. IPTPS 2006.

Nontechnical users who own increasingly ubiquitous network-enabled personal devices such as laptops, digital cameras, and smart phones need a simple, intuitive, and secure way to share information and services between their devices. User Information Architecture, or UIA, is a novel naming and peer-to-peer connectivity architecture addressing this need. Users assign UIA names by "introducing" devices to each other on a common local-area network, but these names remain securely bound to their target as devices migrate. Multiple devices owned by the same user, once introduced, automatically merge their namespaces to form a distributed personal cluster that the owner can access or modify from any of his devices. Instead of requiring users to allocate globally unique names from a central authority, UIA enables users to assign their own user-relative names both to their own devices and to other users. With UIA, for example, Alice can always access her iPod from any of her own personal devices at any location via the name ipod, and her friend Bob can access her iPod via a relative name like ipod.Alice.

@inproceedings{uia:iptps06,
title        = {User-Relative Names for Globally Connected Personal
Devices},
author       = {Bryan Ford and Jacob Strauss and Chris
Lesniewski-Laas and Sean Rhea and Frans Kaashoek and Robert Morris},
booktitle    = {Proceedings of the 5th International Workshop on
Peer-to-Peer Systems (IPTPS06)},
year         = 2006,
month        = feb,
}

## 2005

VXA: A virtual architecture for durable compressed archives Bryan Ford. FAST 2005.

Data compression algorithms change frequently, and obsolete decoders do not always run on new hardware and operating systems, threatening the long-term usability of content archived using those algorithms. Re-encoding content into new formats is cumbersome, and highly undesirable when lossy compression is involved. Processor architectures, in contrast, have remained comparatively stable over recent decades. VXA, an archival storage system designed around this observation, archives executable decoders along with the encoded content it stores. VXA decoders run in a specialized virtual machine that implements an OS-independent execution environment based on the standard x86 architecture. The VXA virtual machine strictly limits access to host system services, making decoders safe to run even if an archive contains malicious code. VXA's adoption of a "native" processor architecture instead of type-safe language technology allows reuse of existing "hand-optimized" decoders in C and assembly language, and permits decoders access to performance-enhancing architecture features such as vector processing instructions. The performance cost of VXA's virtualization is typically less than 15% compared with the same decoders running natively. The storage cost of archived decoders, typically 30-130KB each, can be amortized across many archived files sharing the same compression method.

@inproceedings{vxa:fast05,
title        = {{VXA}: A Virtual Architecture for Durable Compressed
Archives},
author       = {Bryan Ford},
booktitle    = {Proceedings of the 4th USENIX Conference on File and
Storage Technologies (FAST '05)},
year         = 2005,
month        = dec,
}
Routing tradeoffs in dynamic peer-to-peer networks Jinyang Li. Ph.D. thesis, Massachusetts Institute of Technology, November 2005.
@phdthesis{accordion:jinyang,
title        = {Routing Tradeoffs in Dynamic Peer-to-peer Networks},
author       = {Jinyang Li},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = nov,
}
A distributed hash table Frank Dabek. Ph.D. thesis, Massachusetts Institute of Technology, November 2005.
@phdthesis{dht:dabek,
title        = {A Distributed Hash Table},
author       = {Frank Dabek},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = nov,
}
Labels and event processes in the Asbestos operating system Petros Efstathopoulos, Maxwell Krohn, Steve VanDeBogart, Cliff Frey, David Ziegler, Eddie Kohler, David Mazières, Frans Kaashoek, and Robert Morris. SOSP 2005.
@inproceedings{EKVFZKMKM05,
author       = {Petros Efstathopoulos and Maxwell Krohn and Steve
VanDeBogart and Cliff Frey and David Ziegler and Eddie Kohler and
David Mazi\eres and Frans Kaashoek and Robert Morris},
title        = {Labels and Event Processes in the {Asbestos}
Operating System},
booktitle    = {Proceedings of the 20th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '05)},
year         = 2005,
month        = oct,
}
Sybil-resistant DHT routing George Danezis, Chris Lesniewski-Laas, M. Frans Kaashoek, and Ross Anderson. ESORICS 2005.

Distributed Hash Tables (DHTs) are very efficient distributed systems for routing, but at the same time vulnerable to disruptive nodes. Designers of such systems want them used in open networks, where an adversary can perform a sybil attack by introducing a large number of corrupt nodes in the network, considerably degrading its performance. We introduce a routing strategy that alleviates some of the effects of such an attack by making sure that lookups are performed using a diverse set of nodes. This ensures that at least some of the nodes queried are good, and hence the search makes forward progress. This strategy makes use of latent social information present in the introduction graph of the network.

@inproceedings{sybil:esorics05,
title        = {Sybil-resistant {DHT} routing},
author       = {George Danezis and Chris Lesniewski-Laas and M.
Frans Kaashoek and Ross Anderson},
booktitle    = {Proceedings of the 10th European Symposium On
Research In Computer Security},
year         = 2005,
month        = sep,
}
Integrity and access control in untrusted content distribution networks Kevin Fu. Ph.D. thesis, Massachusetts Institute of Technology, September 2005.
@phdthesis{fu-phd,
title        = {Integrity and access control in untrusted content
distribution networks},
author       = {Kevin Fu},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = sep,
}
SoftECC: A system for software memory integrity checking Dave Dopson. Master's thesis, Massachusetts Institute of Technology, September 2005.
@mastersthesis{softecc:ddopson-meng,
title        = {{SoftECC}: A System for Software Memory Integrity
Checking},
author       = {Dave Dopson},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = sep,
}
Overcite: A cooperative digital research library Jeremy Stribling. Master's thesis, Massachusetts Institute of Technology, September 2005.

CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources.

OverCite is a design for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture harnesses donated resources at many sites to provide document search and retrieval service to researchers worldwide. A preliminary evaluation of an initial OverCite prototype shows that it can service more queries per second than a centralized system, and that it increases total storage capacity by a factor of n/4 in a system of n nodes. OverCite can exploit these additional resources by supporting new features such as document alerts, and by scaling to larger data sets.

@mastersthesis{overcite:strib-ms,
title        = {OverCite: A Cooperative Digital Research Library},
author       = {Jeremy Stribling},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = sep,
}
Building fast and secure web services with OKWS Maxwell Krohn. Master's thesis, Massachusetts Institute of Technology, September 2005.
@mastersthesis{okws:krohn-ms,
author       = {Maxwell Krohn},
title        = {Building Fast and Secure Web Services With {OKWS}},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = sep,
}
SSL splitting: securely serving data from untrusted caches Chris Lesniewski-Laas, and M. Frans Kaashoek. Computer Networks 48(5), August 2005.

A popular technique for reducing the bandwidth load on Web servers is to serve the content from proxies. Typically these hosts are trusted by the clients and server not to modify the data that they proxy. SSL splitting is a new technique for guaranteeing the integrity of data served from proxies without requiring changes to Web clients. Instead of relaying an insecure HTTP connection, an SSL splitting proxy simulates a normal Secure Sockets Layer (SSL) connection with the client by merging authentication records from the server with data records from a cache. This technique reduces the bandwidth load on the server, while allowing an unmodified Web browser to verify that the data served from proxies is endorsed by the originating server. SSL splitting is implemented as a patch to the industry-standard OpenSSL library, with which the server is linked. In experiments replaying two-hour access.log traces taken from LCS Web sites over an ADSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@article{ssl-splitting:compnet05,
title        = {{SSL} splitting: securely serving data from
untrusted caches},
author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
journal      = {Computer Networks},
publisher    = {Elsevier},
volume       = 48,
number       = 5,
pages        = {763--779},
year         = 2005,
month        = aug,
}
Architecture and evaluation of an unplanned 802.11b mesh network John Bicket, Daniel Aguayo, Sanjit Biswas, and Robert Morris. MobiCom 2005.

This paper evaluates the ability of a wireless mesh architecture to provide high performance Internet access while demanding little deployment planning or operational management. The architecture considered in this paper has unplanned node placement (rather than planned topology), omni-directional antennas (rather than directional links), and multi-hop routing (rather than single-hop base stations). These design decisions contribute to ease of deployment, an important requirement for community wireless networks. However, this architecture carries the risk that lack of planning might render the network's performance unusably low.

For example, it might be necessary to place nodes carefully to ensure connectivity; the omni-directional antennas might provide uselessly short radio ranges; or the inefficiency of multi-hop forwarding might leave some users effectively disconnected.

The paper evaluates this unplanned mesh architecture with a case study of the Roofnet 802.11b mesh network. Roofnet consists of 37 nodes spread over four square kilometers of an urban area. The network provides users with usable performance despite lack of planning: the average inter-node throughput is 627 kbits/second, even though the average route has three hops.

The paper evaluates multiple aspects of the architecture: the effect of node density on connectivity and throughput; the characteristics of the links that the routing protocol elects to use; the usefulness of the highly connected mesh afforded by omni-directional antennas for robustness and throughput; and the potential performance of a single-hop network using the same nodes as Roofnet.

@inproceedings{roofnet:mobicom05,
title        = {Architecture and Evaluation of an Unplanned 802.11b
Mesh Network},
author       = {John Bicket and Daniel Aguayo and Sanjit Biswas and
Robert Morris},
booktitle    = {Proceedings of the 11th {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '05)},
year         = 2005,
month        = aug,
}
Opportunistic routing in multi-hop wireless networks Sanjit Biswas, and Robert Morris. SIGCOMM 2005.

This paper describes ExOR, an integrated routing and MAC protocol that increases the throughput of large unicast transfers in multi-hop wireless networks. ExOR chooses each hop of a packet's route after the transmission for that hop, so that the choice can reflect which intermediate nodes actually received the transmission. This deferred choice gives each transmission multiple opportunities to make progress. As a result ExOR can use long radio links with high loss rates, which would be avoided by traditional routing. ExOR increases a connection's throughput while using no more network capacity than traditional routing.

ExOR's design faces the following challenges. The nodes that receive each packet must agree on their identities and choose one forwarder. The agreement protocol must have low overhead, but must also be robust enough that it rarely forwards a packet zero times or more than once. Finally, ExOR must choose the forwarder with the lowest remaining cost to the ultimate destination.

Measurements of an implementation on a 38-node 802.11b test-bed show that ExOR increases throughput for most node pairs when compared with traditional routing. For pairs between which traditional routing uses one or two hops, ExOR's robust acknowledgments prevent un necessary retransmissions, increasing throughput by nearly 35%. For more distant pairs, ExOR takes advantage of the choice of forwarders to provide throughput gains of a factor of two to four.

@inproceedings{roofnet:exor-sigcomm05,
title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
author       = {Sanjit Biswas and Robert Morris},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '05 Conference},
year         = 2005,
month        = aug,
}
Make least privilege a right (not a privilege) Maxwell Krohn, Petros Efstathopoulos, Cliff Frey, Frans Kaashoek, Eddie Kohler, David Mazières, Robert Morris, Michelle Osborne, Steve VanDeBogart, and David Ziegler. HotOS 2005.
@inproceedings{KEFKKMMOVZ05,
author       = {Maxwell Krohn and Petros Efstathopoulos and Cliff
Frey and Frans Kaashoek and Eddie Kohler and David Mazi\eres and
Robert Morris and Michelle Osborne and Steve VanDeBogart and David
Ziegler},
title        = {Make Least Privilege a Right (Not a Privilege)},
booktitle    = {Proceedings of the 10th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-X})},
year         = 2005,
month        = jun,
}
Bandwidth-efficient management of DHT routing tables Jinyang Li, Jeremy Stribling, Robert Morris, and M. Frans Kaashoek. NSDI 2005.

Today an application developer using a distributed hash table (DHT) with n nodes must choose a DHT protocol from the spectrum between O(1) lookup protocols and O(log n) protocols. O(1) protocols achieve low latency lookups on small or low-churn networks because lookups take only a few hops, but incur high maintenance traffic on large or high-churn networks. O(log n) protocols incur less maintenance traffic on large or high-churn networks but require more lookup hops in small networks. Accordion is a new routing protocol that does not force the developer to make this choice: Accordion adjusts itself to provide the best performance across a range of network sizes and churn rates while staying within a bounded bandwidth budget.

The key challenges in the design of Accordion are the algorithms that choose the routing table's size and content. Each Accordion node learns of new neighbors opportunistically, in a way that causes the density of its neighbors to be inversely proportional to their distance in ID space from the node. This distribution allows Accordion to vary the table size along a continuum while still guaranteeing at most O(log n) lookup hops. The user-specified bandwidth budget controls the rate at which a node learns about new neighbors. Each node limits its routing table size by evicting neighbors that it judges likely to have failed. High churn (i.e., short node lifetimes) leads to a high eviction rate. The equilibrium between the learning and eviction processes determines the table size.

Simulations show that Accordion maintains an efficient lookup latency versus bandwidth tradeoff over a wider range of operating conditions than existing DHTs.

@inproceedings{accordion:nsdi05,
title        = {Bandwidth-efficient management of {DHT} routing
tables},
author       = {Jinyang Li and Jeremy Stribling and Robert Morris
and M. Frans Kaashoek},
booktitle    = {Proceedings of the 2nd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '05)},
year         = 2005,
month        = may,
}
Improving web availability for clients with MONET David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Rohit Rao. NSDI 2005.
@inproceedings{monet:nsdi05,
title        = {Improving Web Availability for Clients with {MONET}},
author       = {David G. Andersen and Hari Balakrishnan and M. Frans
Kaashoek and Rohit Rao},
booktitle    = {Proceedings of the 2nd {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '05)},
year         = 2005,
month        = may,
}
Mandatory security and performance of services in Asbestos David Patrick Ziegler. Master's thesis, Massachusetts Institute of Technology, May 2005.
@mastersthesis{asbestos:ziegler-meng,
title        = {Mandatory Security and Performance of Services in
{Asbestos}},
author       = {David Patrick Ziegler},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = may,
}
Acetone: A system call interface for Asbestos labels Clifford A. Frey. Master's thesis, Massachusetts Institute of Technology, May 2005.
@mastersthesis{acetone:frey-meng,
title        = {Acetone: A System Call Interface for {Asbestos}
Labels},
author       = {Clifford A. Frey},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = may,
}
Peer-to-peer communication across network address translators Bryan Ford, Pyda Srisuresh, and Dan Kegel. USENIX 2005.
@inproceedings{p2pnat:usenix05,
title        = {Peer-to-Peer Communication Across Network Address
Translators},
author       = {Bryan Ford and Pyda Srisuresh and Dan Kegel},
booktitle    = {Proceedings of the 2005 USENIX Annual Technical
Conference (USENIX '05)},
year         = 2005,
month        = apr,
}
A performance vs. cost framework for evaluating DHT design tradeoffs under churn Jinyang Li, Jeremy Stribling, Robert Morris, M. Frans Kaashoek, and Thomer M. Gil. Infocom 2005.

Protocols for distributed hash tables (DHTs) incorporate features to achieve low latency for lookup requests in the face of churn, continuous changes in membership. These protocol features can include a directed identifier space, parallel lookups, pro-active flooding of membership changes, and stabilization protocols for maintaining accurate routing. In addition, DHT protocols have parameters that can be tuned to achieve different tradeoffs between lookup latency and communication cost due to maintenance traffic. The relative importance of the features and parameters is not well understood, because most previous work evaluates protocols on static networks.

This paper presents a performance versus cost framework (PVC) that allows designers to compare the effects of different protocol features and parameter values. PVC views a protocol as consuming a certain amount of network bandwidth in order to achieve a certain lookup latency, and helps reveal the efficiency with which protocols use additional network resources to improve latency. To demonstrate the value of PVC, this paper simulates Chord, Kademlia, Kelips, OneHop, and Tapestry under different workloads and uses PVC to understand which features are more important under churn. PVC analysis shows that the key to efficiently using additional bandwidth is for a protocol to adjust its routing table size. It also shows that routing table stabilization is wasteful and can be replaced with opportunistic learning through normal lookup traffic. These insights combined demonstrate that PVC is a valuable tool for DHT designers.

@inproceedings{dhtcomparison:infocom05,
title        = {A performance vs. cost framework for evaluating
author       = {Jinyang Li and Jeremy Stribling and Robert Morris
and M. Frans Kaashoek and Thomer M. Gil},
booktitle    = {Proceedings of the 24th Infocom},
year         = 2005,
month        = mar,
}
Opportunistic routing in multi-hop wireless networks Sanjit Zubin Biswas. Master's thesis, Massachusetts Institute of Technology, March 2005.

This thesis describes ExOR, an integrated routing and MAC protocol for bulk transfers in multi-hop wireless networks. ExOR exploits the broadcast nature of radios by making forwarding decisions based on which nodes receive each transmission. The spatial diversity among receivers provides each transmission multiple opportunities to make progress in the face of packet losses. As a result ExOR can use long links with high loss rates, which would be avoided by unicast routing.

ExOR operates on batches of packets. The source node includes a list of candidate forwarders in each packet, prioritized by closeness to the destination. Receiving nodes buffer successfully received packets and await the end of the batch. The highest priority forwarder then broadcasts the packets in its buffer, including its copy of the batch map'' in each packet. The batch map contains the sender's best guess of the highest priority node to have received each packet. The remaining forwarders then transmit in order, sending only packets which were not acknowledged in the batch maps of higher priority nodes. The forwarders continue to cycle through the priority list until the destination has enough packets to recover the original data using forward error correction.

An evaluation of an implementation on a 38-node 802.11b test-bed shows that ExOR improves bulk data transfer throughput for most node pairs when compared with unicast routing. For pairs between which unicast uses one or two hops, ExOR's robust batch maps prevent unnecessary retransmissions, increasing throughput by nearly 50\%. For longer unicast routes, ExOR takes advantage of spatial diversity, providing gains of a factor of two to four when using a batch size of 10 packets.

@mastersthesis{roofnet:biswas-ms,
title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
author       = {Sanjit Zubin Biswas},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = mar,
}
Arpeggio: Metadata searching and content sharing with Chord Austin T. Clements, Dan R. K. Ports, and David R. Karger. IPTPS 2005.

Arpeggio is a peer-to-peer file-sharing network based on the Chord lookup primitive. Queries for data whose metadata matches a certain criterion are performed efficiently by using a distributed keyword-set index, augmented with index-side filtering. We introduce index gateways, a technique for minimizing index maintenance overhead. Because file data is large, Arpeggio employs subrings to track live source peers without the cost of inserting the data itself into the network. Finally, we introduce postfetching, a technique that uses information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization, and a content distribution system tuned to the requirements and capabilities of a peer-to-peer network.

@inproceedings{arpeggio:iptps05,
title        = {Arpeggio: Metadata Searching and Content Sharing
with {C}hord},
author       = {Austin T. Clements and Dan R. K. Ports and David R.
Karger},
booktitle    = {Proceedings of the 4th International Workshop on
Peer-to-Peer Systems (IPTPS05)},
year         = 2005,
month        = feb,
}
Overcite: A cooperative digital research library Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David R. Karger, Robert Morris, and Scott Shenker. IPTPS 2005.

CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a proposal for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture will harness resources at many sites, and thereby be able to support new features such as document alerts and scale to larger data sets.

@inproceedings{overcite:iptps05,
title        = {OverCite: A Cooperative Digital Research Library},
author       = {Jeremy Stribling and Isaac G. Councill and Jinyang
Li and M. Frans Kaashoek and David R. Karger and Robert Morris and
Scott Shenker},
booktitle    = {Proceedings of the 4th International Workshop on
Peer-to-Peer Systems (IPTPS05)},
year         = 2005,
month        = feb,
}
Bit-rate selection in wireless networks John Bicket. Master's thesis, Massachusetts Institute of Technology, February 2005.
@mastersthesis{roofnet:jbicket-ms,
title        = {Bit-rate Selection in Wireless Networks},
author       = {John Bicket},
school       = {Massachusetts Institute of Technology},
year         = 2005,
month        = feb,
}

## 2004

Middleboxes no longer considered harmful Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, and Scott Shenker. OSDI 2004.

Intermediate network elements, such as network address translators (NATs), firewalls, and transparent caches are now commonplace. The usual reaction in the network architecture community to these so-called middleboxes is a combination of scorn (because they violate important architectural principles) and dismay (because these violations make the Internet less flexible). While we acknowledge these concerns, we also recognize that middleboxes have become an Internet fact of life for important reasons. To retain their functions while eliminating their dangerous side-effects, we propose an extension to the Internet architecture, called the Delegation-Oriented Architecture (DOA), that not only allows, but also facilitates, the deployment of middleboxes. DOA involves two relatively modest changes to the current architecture: (a) a set of references that are carried in packets and serve as persistent host identifiers and (b) a way to resolve these references to delegates chosen by the referenced host.

@inproceedings{doa:osdi04,
title        = {Middleboxes No Longer Considered Harmful},
author       = {Michael Walfish and Jeremy Stribling and Maxwell
Krohn and Hari Balakrishnan and Robert Morris and Scott Shenker},
booktitle    = {Proceedings of the 6th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '04)},
year         = 2004,
month        = dec,
}
An empirical study of spam traffic and the use of DNS black lists Jaeyeon Jung, and Emil Sit. SIGCOMM 2004.
@inproceedings{empircal:imc04,
author       = {Jaeyeon Jung and Emil Sit},
title        = {An Empirical Study of Spam Traffic and the Use of
{DNS} Black Lists},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
Measurement Conference '04},
year         = 2004,
month        = oct,
}
Multiq: Automated detection of multiple bottleneck capacities along a path Sachin Katti, Dina Katabi, Charles Blake, Eddie Kohler, and Jacob Strauss. SIGCOMM 2004.
@inproceedings{multiq:imc04,
title        = {MultiQ: Automated Detection of Multiple Bottleneck
Capacities Along a Path},
author       = {Sachin Katti and Dina Katabi and Charles Blake and
Eddie Kohler and Jacob Strauss},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
Measurement Conference '04},
year         = 2004,
month        = oct,
}
User authentication and remote execution across administrative domains Michael Kaminsky. Ph.D. thesis, Massachusetts Institute of Technology, September 2004.
@phdthesis{kaminsky-phd,
title        = {User Authentication and Remote Execution Across
author       = {Michael Kaminsky},
school       = {Massachusetts Institute of Technology},
year         = 2004,
month        = sep,
}
Link-level measurements from an 802.11b mesh network Daniel Aguayo, John Bicket, Sanjit Biswas, Glenn Judd, and Robert Morris. SIGCOMM 2004.

This paper analyzes the causes of packet loss in a 38-node urban multi-hop 802.11b network. The patterns and causes of loss are important in the design of routing and error-correction protocols, as well as in network planning.

This paper makes the following observations. The distribution of inter-node loss rates is relatively uniform over the whole range of loss rates; there is no clear theshold separating "in range" and "out of range." Most links have relatively stable loss rates from one second to the next, though a small minotiry have very bursty losses at that time scale. Signal-to-noise ratio and distance have little predictive value for loss rate. The large number of links with intermediate loss rates is probably due to multi-path fading rather than attenuation or interference.

The prenomena discussed here are all well-known. The contributions of this paper are an understanding of their relative importance, of how they interact, and of the implications for MAC and routing protocol design.

@inproceedings{roofnet:sigcomm04,
title        = {Link-level Measurements from an 802.11b Mesh Network},
author       = {Daniel Aguayo and John Bicket and Sanjit Biswas and
Glenn Judd and Robert Morris},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '04 Conference},
year         = 2004,
month        = aug,
}
Vivaldi: A decentralized network coordinate system Frank Dabek, Russ Cox, Frans Kaashoek, and Robert Morris. SIGCOMM 2004.

This paper is posted as corrected on August 28, 2004. The original version of the paper incorrectly described how often nodes measure the RTT to another node. Nodes keep one RTT probe RPC outstanding (instead of sending one probe per second)

Large-scale Internet applications can benefit from an ability to predict round-trip times to other hosts without having to contact them first. Explicit measurements are often unattractive because the cost of measurement can outweigh the benefits of exploiting proximity information. Vivaldi is a simple, light-weight algorithm that assigns synthetic coordinates to hosts such that the distance between the coordinates of two hosts accurately predicts the communication latency between the hosts.

Vivaldi is fully distributed, requiring no fixed network infrastructure and no distinguished hosts. It is also efficient: a new host can compute good coordinates for itself after collecting latency information from only a few other hosts. Because it requires little communication, Vivaldi can piggy-back on the communication patterns of the application using it and scale to a large number of hosts.

An evaluation of Vivaldi using a simulated network whose latencies are based on measurements among 1740 Internet hosts shows that a 2-dimensional Euclidean model with height vectors embeds these hosts with low error (the median relative error in round-trip time prediction is 11 percent).

@inproceedings{vivaldi:sigcomm,
title        = {Vivaldi: A Decentralized Network Coordinate System},
author       = {Frank Dabek and Russ Cox and Frans Kaashoek and
Robert Morris},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '04 Conference},
year         = 2004,
month        = aug,
}
Building secure high-performance web services with OKWS Maxwell Krohn. USENIX 2004.
@inproceedings{krohn:okws,
title        = {Building Secure High-Performance Web Services with
{OKWS}},
author       = {Maxwell Krohn},
month        = jun,
booktitle    = {Proceedings of the 2004 USENIX Annual Technical
Conference (USENIX '04)},
year         = 2004,
}
REX: Secure, Extensible Remote Execution Michael Kaminsky, Eric Peterson, Daniel B. Giffin, Kevin Fu, David Mazières, and M. Frans Kaashoek. USENIX 2004.

This version of the paper contains corrected values for the round-trip time reported in Section 6.2 and the latencies reported in Table 2. The previous values were only one-way instead of round-trip.

The ubiquitous SSH package has demonstrated the importance of secure remote login and execution. As re- mote execution tools grow in popularity, users require new features and extensions, which are difficult to add to existing systems. REX is a remote execution utility with a novel architecture specifically designed for extensibility as well as security and transparent connection persistence in the face of network complexities such as NAT and dynamic IP addresses. To achieve extensibility, REX bases much of its functionality on a single new abstraction--emulated file descriptor passing across machines. This abstraction is powerful enough for users to extend REX's functionality in many ways without changing the core software or protocol.

REX addresses security in two ways. First, the implementation internally leverages file descriptor passing to split the server into several smaller programs, reducing both privileged and remotely exploitable code. Second, REX selectively delegates authority to processes running on remote machines that need to access other resources. The delegation mechanism lets users incrementally construct trust policies for remote machines. Finally, REX provides mechanisms for accessing servers without globally routable IP addresses, and for resuming sessions when a TCP connection aborts or an endpoint's IP address changes. Measurements of the system demonstrate that REX's architecture does not come at the cost of performance.

@inproceedings{rex:usenix04,
title        = {{REX}: {S}ecure, {E}xtensible {R}emote {E}xecution},
author       = {Michael Kaminsky and Eric Peterson and Daniel B.
Giffin and Kevin Fu and David Mazi{\e}res and M. Frans Kaashoek},
month        = jun,
pages        = {199--212},
booktitle    = {Proceedings of the 2004 USENIX Annual Technical
Conference (USENIX '04)},
year         = 2004,
}
High-throughput routing for multi-hop wireless networks Douglas S. J. De Couto. Ph.D. thesis, Massachusetts Institute of Technology, June 2004.
@phdthesis{decouto-phd,
title        = {High-Throughput Routing for Multi-Hop Wireless
Networks},
author       = {Douglas S. J. {De Couto}},
school       = {Massachusetts Institute of Technology},
year         = 2004,
month        = jun,
}
On-the-fly verification of rateless erasure codes for efficient content distribution Maxwell N. Krohn, Michael J. Freedman, and David Mazières. IEEE SP 2004.

The quality of peer-to-peer content distribution can suffer when malicious participants intentionally corrupt content. Some systems using simple block-by-block downloading can verify blocks with traditional cryptographic signatures and hashes, but these techniques do not apply well to more elegant systems that use rateless erasure codes for efficient multicast transfers. This paper presents a practical scheme, based on homomorphic hashing, that enables a downloader to perform on-the-fly verification of erasure-encoded blocks.

@inproceedings{otfvec,
title        = {On-the-Fly Verification of Rateless Erasure Codes
for Efficient Content Distribution},
author       = {Maxwell N. Krohn and Michael J. Freedman and David
Mazi{\e}res},
booktitle    = {Proceedings of the {IEEE} Symposium on Security and
Privacy},
month        = may,
year         = 2004,
}
Providing asynchronous file I/O for the Plan 9 operating system Jason Hickey. Master's thesis, Massachusetts Institute of Technology, May 2004.
@mastersthesis{plan9:jmhickey-meng,
title        = {Providing Asynchronous File {I/O} for the {Plan 9}
Operating System},
author       = {Jason Hickey},
school       = {Massachusetts Institute of Technology},
year         = 2004,
month        = may,
}
M&M: A passive toolkit for measuring, correlating, and tracking path characteristics Sachin Katti, Dina Katabi, Eddie Kohler, and Jacob Strauss. MIT CSAIL technical report, April 2004.
@techreport{mnm:katti04-tr,
author       = {Sachin Katti and Dina Katabi and Eddie Kohler and
Jacob Strauss},
title        = {{M\&M}: A Passive Toolkit for Measuring,
Correlating, and Tracking Path Characteristics},
institution  = {{MIT} Computer Science and Artificial Intelligence
Laboratory},
number       = {MIT-CSAIL-TR-945},
year         = 2004,
month        = apr,
}
Designing a DHT for low latency and high throughput Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, and Robert Morris. NSDI 2004.

Designing a wide-area distributed hash table (DHT) that provides high-throughput and low-latency network storage is a challenge. Existing systems have explored a range of solutions, including iterative routing, recursive routing, proximity routing and neighbor selection, erasure coding, replication, and server selection.

This paper explores the design of these techniques and their interaction in a complete system, drawing on the measured performance of a new DHT implementation and results from a simulator with an accurate Internet latency model. New techniques that resulted from this exploration include use of latency predictions based on synthetic coordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.

Measurements with 425 server instances running on 150 PlanetLab and RON hosts show that the latency optimizations reduce the time required to locate and fetch data by a factor of two. The throughput optimizations result in a sustainable bulk read throughput related to the number of DHT hosts times the capacity of the slowest access link; with 150 selected PlanetLab hosts, the peak aggregate throughput over multiple clients is 12.8 megabytes per second.

@inproceedings{dhash:nsdi,
title        = {Designing a {DHT} for low latency and high
throughput},
author       = {Frank Dabek and Jinyang Li and Emil Sit and James
Robertson and M. Frans Kaashoek and Robert Morris},
booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
{N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '04)},
year         = 2004,
month        = mar,
}
Comparing the performance of distributed hash tables under churn Jinyang Li, Jeremy Stribling, Thomer M. Gil, Robert Morris, and M. Frans Kaashoek. IPTPS 2004.

A protocol for a distributed hash table (DHT) incurs communication costs to keep up with churn---changes in membership---in order to maintain its ability to route lookups efficiently. This paper formulates a unified framework for evaluating cost and performance. Communication costs are combined into a single cost measure (bytes), and performance benefits are reduced to a single latency measure. This approach correctly accounts for background maintenance traffic and timeouts during lookup due to stale routing data, and also correctly leaves open the possibility of different preferences in the tradeoff of lookup time versus communication cost. Using the unified framework, this paper analyzes the effects of DHT parameters on the performance of four protocols under churn.

@inproceedings{dhtcomparison:iptps04,
title        = {Comparing the performance of distributed hash tables
under churn},
author       = {Jinyang Li and Jeremy Stribling and Thomer M. Gil
and Robert Morris and M. Frans Kaashoek},
booktitle    = {Proceedings of the 3rd International Workshop on
Peer-to-Peer Systems (IPTPS04)},
year         = 2004,
month        = feb,
}
UsenetDHT: A low overhead Usenet server Emil Sit, Frank Dabek, and James Robertson. IPTPS 2004.
@inproceedings{usenetdht:iptps04,
title        = {{UsenetDHT}: A Low Overhead {Usenet} Server},
author       = {Emil Sit and Frank Dabek and James Robertson},
booktitle    = {Proceedings of the 3rd International Workshop on
Peer-to-Peer Systems (IPTPS04)},
year         = 2004,
month        = feb,
}
Parsing expression grammars: A recognition-based syntactic foundation Bryan Ford. POPL 2004.
@inproceedings{parsing:popl04,
author       = {Bryan Ford},
title        = {Parsing Expression Grammars: A Recognition-Based
Syntactic Foundation},
booktitle    = {Proceedings of the Symposium on Principles of
Programming Languages},
year         = 2004,
month        = jan,
}

## 2003

Unmanaged internet protocol: Taming the edge network management crisis Bryan Ford. HotNets 2003.
@inproceedings{uip:hotnets03,
author       = {Bryan Ford},
title        = {Unmanaged Internet Protocol: Taming the Edge Network
Management Crisis},
booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
{T}opics in {N}etworks ({HotNets-II})},
year         = 2003,
month        = nov,
organization = {{ACM SIGCOMM}},
}
Practical, distributed network coordinates Russ Cox, Frank Dabek, Frans Kaashoek, Jinyang Li, and Robert Morris. HotNets 2003.

Vivaldi is a distributed algorithm that assigns synthetic coordinates to Internet hosts, so that the Euclidean distance between two hosts' coordinates predicts the network latency between them. Each node in Vivaldi computes its coordinates by simulating its position in a network of physical springs. Vivaldi is both distributed and efficient: no fixed infrastructure need be deployed and a new host can compute useful coordinates after collecting latency information from only a few other hosts. Vivaldi can rely on piggy-backing latency information on application traffic instead of generating extra traffic by sending its own probe packets.

This paper evaluates Vivaldi through simulations of 750 hosts, with a matrix of inter-host latencies derived from measurements between 750 real Internet hosts. Vivaldi finds synthetic coordinates that predict the measured latencies with a median relative error of 14 percent. The simulations show that a new host joining an existing Vivaldi system requires fewer than 10 probes to achieve this accuracy. Vivaldi is currently used by the Chord distributed hash table to perform proximity routing, replica selection, and retransmission timer estimation.

@inproceedings{hotnets:vivaldi,
title        = {Practical, Distributed Network Coordinates},
author       = {Russ Cox and Frank Dabek and Frans Kaashoek and
Jinyang Li and Robert Morris},
booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
{T}opics in {N}etworks ({HotNets-II})},
year         = 2003,
month        = nov,
organization = {{ACM SIGCOMM}},
}
Opportunistic routing in multi-hop wireless networks Sanjit Biswas, and Robert Morris. HotNets 2003.

This paper describes Extremely Opportunistic Routing (ExOR), a new unicast routing technique for multi-hop wireless networks. ExOR forwards each packet through a sequence of nodes, deferring the choice of each node in the sequence until after the previous node has transmitted the packet on its radio. ExOR then determines which node, of all the nodes that successfully received that transmission, is the node closest to the destination. That closest node transmits the packet. The result is that each hop moves the packet farther (on average) than the hops of the best possible pre-determined route.

The ExOR design addresses the challenge of choosing a forwarding node after transmission using a distributed algorithm. First, when a node transmits a packet, it includes in the packet a simple schedule describing the priority order in which the potential receivers should forward the packet. The node computes the schedule based on shared measurements of inter-node delivery rates. ExOR then uses a distributed slotted MAC protocol for acknowledgments to ensure that the receivers agree who the highest priority receiver was.

The efficacy of ExOR depends mainly on the rate at which the reception probability falls off with distance. Simulations based on measured radio characteristics suggest that ExOR reduces the total number of transmissions by nearly a factor of two over the best possible pre-determined route.

@inproceedings{roofnet:exor-hotnets03,
title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
author       = {Sanjit Biswas and Robert Morris},
booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
{T}opics in {N}etworks ({HotNets-II})},
year         = 2003,
month        = nov,
organization = {{ACM SIGCOMM}},
}
Measuring the effects of internet path faults on reactive routing Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek. SIGMETRICS 2003.
@inproceedings{sigmetrics:routing,
title        = {Measuring the effects of internet path faults on
reactive routing},
author       = {Nick Feamster and David G. Andersen and Hari
Balakrishnan and M. Frans Kaashoek},
pages        = {126--137},
booktitle    = {Proceedings of the 2003 Conference on Measurement
and Modelling of Computer Systems},
year         = 2003,
month        = nov,
organization = {{ACM SIGMETRICS}},
}
Scalable Internet routing on topology-independent node identities Bryan Ford. MIT LCS technical report, October 2003.
@techreport{uip:ford03-tr,
author       = {Bryan Ford},
title        = {Scalable {Internet} Routing on Topology-Independent
Node Identities},
institution  = {{MIT} Laboratory for Computer Science},
number       = {MIT-LCS-TR-926},
year         = 2003,
month        = oct,
}
A Measurement Study of Available Bandwidth Estimation Tools Jacob Strauss, Dina Katabi, and Frans Kaashoek. SIGCOMM 2003.
@inproceedings{spruce:imc03,
title        = {{A Measurement Study of Available Bandwidth
Estimation Tools}},
author       = {Jacob Strauss and Dina Katabi and Frans Kaashoek},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
Measurement Conference '03},
year         = 2003,
month        = oct,
}
Decentralized user authentication in a global file system Michael Kaminsky, George Savvides, David Mazières, and M. Frans Kaashoek. SOSP 2003.

The challenge for user authentication in a global file system is allowing people to grant access to specific users and groups in remote administrative domains, without assuming any kind of pre-existing administrative relationship. The traditional approach to user authentication across administrative domains is for users to prove their identities through a chain of certificates. Certificates allow for general forms of delegation, but they often require more infrastructure than is necessary to support a network file system.

This paper introduces an approach without certificates. Local authentication servers pre-fetch and cache remote user and group definitions from remote authentication servers. During a file access, an authentication server can establish identities for users based just on local information. This approach is particularly well-suited to file systems, and it provides a simple and intuitive interface that is similar to those found in local access control mechanisms. An implementation of the authentication server and a file server supporting access control lists demonstrate the viability of this design in the context of the Self-certifying File System (SFS). Experiments demonstrate that the authentication server can scale to groups with tens of thousands of members.

@inproceedings{sfs:sosp03,
title        = {{D}ecentralized User Authentication in a Global File
System},
author       = {Michael Kaminsky and George Savvides and David
Mazi{\e}res and M. Frans Kaashoek},
pages        = {60--73},
booktitle    = {Proceedings of the 19th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '03)},
year         = 2003,
month        = oct,
address      = {Bolton Landing, New York},
}
A high-throughput path metric for multi-hop wireless routing Douglas S. J. De Couto, Daniel Aguayo, John Bicket, and Robert Morris. MobiCom 2003.

This paper presents the expected transmission count metric (ETX), which finds high-throughput paths on multi-hop wireless networks. ETX minimizes the expected total number of packet transmissions (including retransmissions) required to successfully deliver a packet to the ultimate destination. The ETX metric incorporates the effects of link loss ratios, asymmetry in the loss ratios between the two directions of each link, and interference among the successive links of a path. In contrast, the minimum hop-count metric chooses arbitrarily among the different paths of the same minimum length, regardless of the often large differences in throughput among those paths, and ignoring the possibility that a longer path might offer higher throughput.

This paper describes the design and implementation of ETX as a metric for the DSDV and DSR routing protocols, as well as modifications to DSDV and DSR which allow them to use ETX. Measurements taken from a 29-node 802.11b test-bed demonstrate the poor performance of minimum hop-count, illustrate the causes of that poor performance, and confirm that ETX improves performance. For long paths the throughput improvement is often a factor of two or more, suggesting that ETX will become more useful as networks grow larger and paths become longer.

@inproceedings{grid:mobicom03,
title        = {A High-Throughput Path Metric for Multi-Hop Wireless
Routing},
author       = {Douglas S. J. {De Couto} and Daniel Aguayo and John
Bicket and Robert Morris},
booktitle    = {Proceedings of the 9th {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '03)},
year         = 2003,
month        = sep,
}
SSL splitting: securely serving data from untrusted caches Chris Lesniewski-Laas, and M. Frans Kaashoek. USENIX Security 2003.

A popular technique for reducing the bandwidth load on Web servers is to serve the content from proxies. Typically these hosts are trusted by the clients and server not to modify the data that they proxy. SSL splitting is a new technique for guaranteeing the integrity of data served from proxies without requiring changes to Web clients. Instead of relaying an insecure HTTP connection, an SSL splitting proxy simulates a normal Secure Sockets Layer (SSL) connection with the client by merging authentication records from the server with data records from a cache. This technique reduces the bandwidth load on the server, while allowing an unmodified Web browser to verify that the data served from proxies is endorsed by the originating server.

SSL splitting is implemented as a patch to the industry-standard OpenSSL library, with which the server is linked. In experiments replaying two-hour access.log traces taken from LCS Web sites over an ADSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@inproceedings{ssl-splitting:usenixsecurity03,
title        = {{SSL} splitting: securely serving data from
untrusted caches},
author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
pages        = {187--199},
booktitle    = {Proceedings of the 12th {USENIX} {S}ecurity
{S}ymposium},
year         = 2003,
month        = aug,
}
Experience with an evolving overlay network testbed David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. SIGCOMM 33(3), July 2003.
@article{ron:ccr,
title        = {Experience with an evolving overlay network testbed},
author       = {David Andersen and Hari Balakrishnan and M. Frans
Kaashoek and Robert Morris},
month        = jul,
volume       = 33,
number       = 3,
pages        = {13--19},
journal      = {{ACM SIGCOMM} Computer Communication Review},
year         = 2003,
}
Multiprocessor support for event-driven programs Nickolai Zeldovich, Alexander Yip, Frank Dabek, Robert Morris, David Mazières, and Frans Kaashoek. USENIX 2003.

This paper presents a new asynchronous programming library (libasync-smp) that allows event-driven applications to take advantage of multiprocessors by running code for event handlers in parallel. To control the concurrency between events, the programmer can specify a color for each event: events with the same color (the default case) are handled serially; events with different colors can be handled in parallel. The programmer can incrementally expose parallelism in existing event-driven applications by assigning different colors to computationally-intensive events that do not share mutable state.

An evaluation of libasyncsmp demonstrates that applications achieve multiprocessor speedup with little programming effort. As an example, parallelizing the cryptography in the SFS file server required about 90 lines of changed code in two modules, out of a total of about 12,000 lines. Multiple clients were able to read large cached files from the libasync-smp SFS server running on a 4-CPU machine 2.5 times as fast as from an unmodified uniprocessor SFS server on one CPU. Applications without computationally intensive tasks also benefit: an event-driven Web server achieves 1.5 speedup on four CPUs with multiple clients reading small cached files.

@inproceedings{asyncmp:usenix03,
title        = {Multiprocessor Support for Event-Driven Programs},
author       = {Nickolai Zeldovich and Alexander Yip and Frank Dabek
and Robert Morris and David Mazi{\e}res and Frans Kaashoek},
booktitle    = {Proceedings of the 2003 USENIX Annual Technical
Conference (USENIX '03)},
year         = 2003,
month        = jun,
}
Certifying program execution with secure processors Benjie Chen, and Robert Morris. HotOS 2003.
@inproceedings{cerium:hotos03,
title        = {{C}ertifying Program Execution with Secure
Processors},
author       = {Benjie Chen and Robert Morris},
booktitle    = {Proceedings of the 9th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-IX})},
year         = 2003,
month        = may,
}
Robust and efficient data management for a distributed hash table Josh Cates. Master's thesis, Massachusetts Institute of Technology, May 2003.
@mastersthesis{chord:cates-meng,
title        = {Robust and Efficient Data Management for a
Distributed Hash Table},
author       = {Josh Cates},
school       = {Massachusetts Institute of Technology},
year         = 2003,
month        = may,
}
On the feasibility of peer-to-peer web indexing and search Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kaashoek, David Karger, and Robert Morris. IPTPS 2003.
@inproceedings{p2psearch:iptps03,
title        = {On the feasibility of peer-to-peer web indexing and
search},
author       = {Jinyang Li and Boon Thau Loo and Joseph M.
Hellerstein and M. Frans Kaashoek and David Karger and Robert
Morris},
booktitle    = {Proceedings of the 2nd International Workshop on
Peer-to-Peer Systems (IPTPS03)},
year         = 2003,
month        = feb,
}
Towards a common API for structured peer-to-peer overlays Frank Dabek, Ben Zhao, Peter Druschel, John Kubiatowicz, and Ion Stoica. IPTPS 2003.

In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer overlays and the key abstractions that can be built on them. In doing so, we hope to facilitate independent innovation in overlay protocols, services, and applications, to allow direct experimental comparisons, and to encourage application development by third parties. We provide a snapshot of our efforts and discuss open problems in an effort to solicit feedback from the research community.

@inproceedings{iptps:apis,
title        = {Towards a Common {API} for Structured Peer-to-Peer
Overlays},
author       = {Frank Dabek and Ben Zhao and Peter Druschel and John
Kubiatowicz and Ion Stoica},
booktitle    = {Proceedings of the 2nd International Workshop on
Peer-to-Peer Systems (IPTPS03)},
year         = 2003,
month        = feb,
}
Chord: A scalable peer-to-peer lookup service for internet applications Ion Stoica, Robert Morris, David Liben-Nowell, David Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. IEEE Transactions on Networking 11, February 2003.

A fundamental problem that confronts peer-to-peer applications is the efficient location of the node that stores a desired data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis and simulations show that Chord is scalable: communication cost and the state maintained by each node scale logarithmically with the number of Chord nodes.

@article{chord:ton,
title        = {Chord: A Scalable Peer-to-peer Lookup Service for
Internet Applications},
author       = {Ion Stoica and Robert Morris and David Liben-Nowell
and David Karger and M. Frans Kaashoek and Frank Dabek and Hari
Balakrishnan},
journal      = {{IEEE} Transactions on Networking},
volume       = 11,
year         = 2003,
month        = feb,
},
}
Looking up data in P2P systems Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robbert Morris, and Ion Stoica. Communications of the ACM 46, February 2003.
@article{acm:lookup,
title        = {Looking up data in {P2P} systems},
author       = {Hari Balakrishnan and M. Frans Kaashoek and David
Karger and Robbert Morris and Ion Stoica},
volume       = 46,
month        = feb,
year         = 2003,
pages        = {43--48},
journal      = {Communications of the ACM},
}
Melody: A distributed music-sharing system James Robertson. Master's thesis, Massachusetts Institute of Technology, February 2003.
@mastersthesis{chord:jsr-meng,
title        = {Melody: A Distributed Music-Sharing System},
author       = {James Robertson},
school       = {Massachusetts Institute of Technology},
year         = 2003,
month        = feb,
}
SSL splitting and barnraising: Cooperative caching with authenticity guarantees Chris Lesniewski-Laas. Master's thesis, Massachusetts Institute of Technology, February 2003.

SSL splitting is a cryptographic technique to guarantee that public data served by caching Web proxies is endorsed by the originating server. When a client makes a request, the trusted server generates a stream of authentication records and sends them to the untrusted proxy, which combines them with a stream of data records retrieved from its local cache. The combined stream is relayed to the client, a standard Web browser, which verifies the data's integrity. Since the combined stream simulates a normal Secure Sockets Layer (SSL) connection, SSL splitting works with unmodified browsers; however, since it does not provide confidentiality, it is appropriate for applications that require only authentication. The server must be linked to a patched version of the industry-standard OpenSSL library; no other server modifications are necessary. In experiments replaying two-hour access.log traces taken from LCS Web sites over a DSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@mastersthesis{ssl-splitting:ctl-meng,
title        = {{SSL} splitting and Barnraising: Cooperative caching
with authenticity guarantees},
author       = {Chris Lesniewski-Laas},
school       = {Massachusetts Institute of Technology},
year         = 2003,
month        = feb,
}
REX: Secure, modular remote execution through file descriptor passing Michael Kaminsky, Eric Peterson, Kevin Fu, David Mazières, and M. Frans Kaashoek. MIT LCS technical report, January 2003.

The ubiquitous SSH package has demonstrated the importance of secure remote login and execution. This paper presents a new system, REX, designed to provide remote login and execution in the context of the SFS secure distributed file system. REX departs from traditional remote login design and is built around two main mechanisms---file descriptor passing and a user agent process.

File descriptor passing allows REX to be split into several smaller pieces; privileged code can run as its own process to provide enhanced security guarantees. REX also emulates secure file descriptor passing over network connections, allowing users to build extensions to REX outside of the core REX software.

REX uses and extends SFS's agent mechanism to provide a transparent distributed computing environment to users. The agent stores private keys, server nicknames, and other per-user configuration state; REX makes the SFS agent available to programs that it executes on remote machines.

We have an implementation of REX and demonstrate that its flexibility does not come at the cost of performance. Initial REX connections are comparable to those of SSH in speed, while subsequent connections are much faster because REX exploits the SFS agent to cache connection state to avoid costly public-key operations.

@techreport{sfs:rextr03,
title        = {{REX}: Secure, modular remote execution through file
descriptor passing},
author       = {Michael Kaminsky and Eric Peterson and Kevin Fu and
David Mazi{\e}res and M. Frans Kaashoek},
institution  = {{MIT} Laboratory for Computer Science},
year         = 2003,
month        = jan,
number       = {MIT-LCS-TR-884},
note         = {http://www.pdos.lcs.mit.edu/papers/sfs:rextr03/},
}

## 2002

Ivy: A read/write peer-to-peer file system Athicha Muthitacharoen, Robert Morris, Thomer Gil, and Benjie Chen. OSDI 2002.
@inproceedings{ivy:osdi02,
title        = {Ivy: A Read/Write Peer-to-peer File System},
author       = {Athicha Muthitacharoen and Robert Morris and Thomer
Gil and Benjie Chen},
booktitle    = {Proceedings of the 5th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '02)},
year         = 2002,
month        = dec,
}
A session-based architecture for Internet mobility Alex C. Snoeren. Ph.D. thesis, Massachusetts Institute of Technology, December 2002.
@phdthesis{snoeren-phd,
title        = {A Session-Based Architecture for {Internet} Mobility},
author       = {Alex C. Snoeren},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = dec,
}
Tarzan: A peer-to-peer anonymizing network layer Michael J. Freedman, and Robert Morris. CCS 2002.

Tarzan is a peer-to-peer anonymous IP network overlay. Because it provides IP service, Tarzan is general-purpose and transparent to applications. Organized as a decentralized peer-to-peer overlay, Tarzan is fault-tolerant, highly scalable, and easy to manage.

Tarzan achieves its anonymity with layered encryption and multi-hop routing, much like a Chaumian mix. A message initiator chooses a path of peers pseudo-randomly through a restricted topology in a way that adversaries cannot easily influence. Cover traffic prevents a global observer from using traffic analysis to identify an initiator. Protocols toward unbiased peer-selection offer new directions for distributing trust among untrusted entities.

Tarzan provides anonymity to either clients or servers, without requiring that both participate. In both cases, Tarzan uses a network address translator (NAT) to bridge between Tarzan hosts and oblivious Internet hosts.

Measurements show that Tarzan imposes minimal overhead over a corresponding non-anonymous overlay route.

@inproceedings{tarzan:ccs9,
title        = {Tarzan: A Peer-to-Peer Anonymizing Network Layer},
author       = {Michael J. Freedman and Robert Morris},
booktitle    = {Proceedings of the 9th {ACM} Conference on Computer
and Communications Security ({CCS-9})},
year         = 2002,
month        = nov,
}
Programming language optimizations for modular router configurations Eddie Kohler, Robert Morris, and Benjie Chen. ASPLOS 2002.
@inproceedings{click:asplos02,
title        = {Programming Language Optimizations for Modular
Router Configurations},
author       = {Eddie Kohler and Robert Morris and Benjie Chen},
booktitle    = {Proceedings of the 10th Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS)},
month        = oct,
year         = 2002,
}
Performance of multihop wireless networks: Shortest path is not enough Douglas S. J. De Couto, Daniel Aguayo, Benjamin A. Chambers, and Robert Morris. HotNets 2002.

Existing wireless ad hoc routing protocols typically find routes with the minimum hop-count. This paper presents experimental evidence from two wireless test-beds which shows that there are usually multiple minimum hop-count paths, many of which have poor throughput. As a result, minimum-hop-count routing often chooses routes that have significantly less capacity than the best paths that exist in the network. Much of the reason for this is that many of the radio links between nodes have loss rates low enough that the routing protocol is willing to use them, but high enough that much of the capacity is consumed by retransmissions. These observations suggest that more attention be paid to link quality when choosing ad hoc routes; the paper presents measured link characteristics likely to be useful in devising a better path quality metric.

@inproceedings{grid:hotnets02,
title        = {Performance of Multihop Wireless Networks: Shortest
Path is Not Enough},
author       = {Douglas S. J. {De Couto} and Daniel Aguayo and
Benjamin A. Chambers and Robert Morris},
booktitle    = {Proceedings of the First {W}orkshop on {H}ot
{T}opics in {N}etworks ({HotNets-I})},
year         = 2002,
month        = oct,
organization = {{ACM SIGCOMM}},
}
Packrat parsing: Simple, powerful, lazy, linear time Bryan Ford. ICFP 2002.
@inproceedings{packrat-parsing:icfp02,
author       = {Bryan Ford},
title        = {Packrat Parsing: Simple, Powerful, Lazy, Linear Time},
booktitle    = {Proceedings of the 2002 International Conference on
Functional Programming},
year         = 2002,
month        = oct,
}
Event-driven programming for robust software Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazières, and Robert Morris. SIGOPS European Workshop 2002.

Events are a better means of managing I/O concurrency in server software than threads: events help avoid bugs caused by the unnecessary CPU concurrency introduced by threads. Event-based programs also tend to have more stable performance under heavy load than threaded programs. We argue that our libasync-smp non-blocking I/O library makes event-based programming convenient and evaluate extensions to the library that allow event-based programs to take advantage of multi-processors. We conclude that events provide all the benefits of threads, with substantially less complexity; the result is more robust software.

@inproceedings{events:sigops,
title        = {Event-Driven Programming for Robust Software},
author       = {Frank Dabek and Nickolai Zeldovich and Frans
Kaashoek and David Mazi{\e}res and Robert Morris},
booktitle    = {Proceedings of the 2002 SIGOPS European Workshop},
year         = 2002,
month        = sep,
}
Span: An energy-efficient coordination algorithm for topology maintenance in ad hoc wireless networks Benjie Chen, Kyle Jamieson, Hari Balakrishnan, and Robert Morris. ACM Wireless Networks 8(5), September 2002.

This paper presents Span, a power saving technique for multi-hop ad hoc wireless networks that reduces energy consumption without significantly diminishing the capacity or connectivity of the network. Span builds on the observation that when a region of a shared-channel wireless network has a sufficient density of nodes, only a small number of them need be on at any time to forward traffic for active connections.

Span is a distributed, randomized algorithm where nodes make local decisions on whether to sleep, or to join a forwarding backbone as a coordinator. Each node bases its decision on an estimate of how many of its neighbors will benefit from it being awake, and the amount of energy available to it. We give a randomized algorithm where coordinators rotate with time, demonstrating how localized node decisions lead to a connected, capacity-preserving global topology.

Improvement in system lifetime due to Span increases as the ratio of idle-to-sleep energy consumption increases. Our simulations show that with a practical energy model, system lifetime of an 802.11 network in power saving mode with Span is a factor of two better than without. Additionally, Span also improves communication latency and capacity.

@article{span:wireless01,
title        = {Span: An Energy-Efficient Coordination Algorithm for
Topology Maintenance in Ad Hoc Wireless Networks},
author       = {Benjie Chen and Kyle Jamieson and Hari Balakrishnan
and Robert Morris},
volume       = 8,
number       = 5,
year         = 2002,
month        = sep,
},
journal      = {{ACM} Wireless Networks},
}
Packrat parsing: a practical linear-time algorithm with backtracking Bryan Ford. Master's thesis, Massachusetts Institute of Technology, September 2002.
@mastersthesis{packrat-parsing:ford-ms,
author       = {Bryan Ford},
title        = {Packrat Parsing: a practical linear-time algorithm
with backtracking},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = sep,
}
Choosing Internet paths with high bulk transfer capacity Jacob Strauss. Master's thesis, Massachusetts Institute of Technology, Sep 2002.
@mastersthesis{chord:jastr-meng,
title        = {Choosing {Internet} Paths with High Bulk Transfer
Capacity},
author       = {Jacob Strauss},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = sep,
}
Network layer support for overlay networks John Jannotti. Ph.D. thesis, Massachusetts Institute of Technology, August 2002.
@phdthesis{jannotti-phd,
title        = {Network Layer Support for Overlay Networks},
author       = {John Jannotti},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = aug,
}
Natron: Overlay routing to oblivious destinations Alexander Yip. Master's thesis, Massachusetts Institute of Technology, Aug 2002.
@mastersthesis{natron:yip-meng,
title        = {NATRON: Overlay Routing to Oblivious Destinations},
author       = {Alexander Yip},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = aug,
}
Access control lists for the self-certifying filesystem George Savvides. Master's thesis, Massachusetts Institute of Technology, Aug 2002.
@mastersthesis{sfs:savvides-meng,
title        = {Access Control Lists for the Self-Certifying
Filesystem},
author       = {George Savvides},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = aug,
}
Self-certifying filesystem implementation for Windows David Euresti. Master's thesis, Massachusetts Institute of Technology, Aug 2002.
@mastersthesis{sfs:euresti-meng,
title        = {Self-Certifying Filesystem Implementation for
{Windows}},
author       = {David Euresti},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = aug,
}
Simple and general statistical profiling with PCT Charles Blake, and Steve Bauer. USENIX 2002.

The Profile Collection Toolkit (PCT) provides a novel generalized CPU profiling facility. PCT enables arbitrarily late profiling activation and arbitrarily early report generation. PCT usually requires no re-compilation, re-linking, or even re-starting of programs. Profiling reports gracefully degrade with available debugging data.

PCT uses its debugger controller, dbctl, to drive a debugger's control over a process. dbctl has a configuration language that allows users to specify context-specific debugger commands. These commands can sample general program state, such as call stacks and function parameters.

For systems or situations with poor debugger support, PCT provides several other portable and flexible collection methods. PCT can track most program code, including code in shared libraries and late-loaded shared objects. On Linux, PCT can seamlessly merge kernel CPU time profiles with user-level CPU profiles to create whole system reports.

@inproceedings{pct:usenix02,
title        = {Simple and General Statistical Profiling with {PCT}},
author       = {Charles Blake and Steve Bauer},
pages        = {333--346},
booktitle    = {Proceedings of the 2002 USENIX Annual Technical
Conference (USENIX '02)},
year         = 2002,
month        = jun,
}
A keyword set search system for peer-to-peer networks Omprakash D Gnawali. Master's thesis, Massachusetts Institute of Technology, Jun 2002.
@mastersthesis{chord:om_p-meng,
title        = {A Keyword Set Search System for Peer-to-Peer
Networks},
author       = {Omprakash D Gnawali},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = jun,
}
Concurrency control for multi-processor event-driven systems Nickolai Zeldovich. Master's thesis, Massachusetts Institute of Technology, May 2002.
@mastersthesis{sfs:zeldovich-meng,
title        = {Concurrency Control for Multi-Processor Event-Driven
Systems},
author       = {Nickolai Zeldovich},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = may,
}
The Grid Roofnet: a rooftop ad hoc wireless network Benjamin A. Chambers. Master's thesis, Massachusetts Institute of Technology, May 2002.
@mastersthesis{grid:bac-meng,
title        = {The {Grid} {Roofnet}: a Rooftop Ad Hoc Wireless
Network},
author       = {Benjamin A. Chambers},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = may,
}
A peer-to-peer anonymizing network layer Michael J. Freedman. Master's thesis, Massachusetts Institute of Technology, May 2002.

Existing Internet systems implement anonymity at the application layer or through centralized components. A robust, decentralized infrastructure that anonymizes any Internet traffic could benefit a wide array of existing protocols and systems. This anonymous network layer could seamlessly replace the current communications channel, and it could continue to offer anonymity and availability even while components fail maliciously.

This thesis proposes Tarzan, a peer-to-peer anonymous IP network overlay. Because it provides IP service, Tarzan is general-purpose and transparent to applications. Organized as a decentralized peer-to-peer overlay, Tarzan is fault-tolerant, highly scalable, and easy to manage.

Tarzan achieves its anonymity with layered encryption and multi-hop routing, much like a Chaumian mix. A message initiator chooses a path of peers pseudo-randomly in a way that adversaries cannot easily influence. Cover traffic prevents a global observer from drawing conclusions based on traffic analysis as to an initiator's identity.

Tarzan provides anonymity to either clients or servers, without requiring that both participate, presenting the abstraction of a one-way anonymous tunnel. In both cases, Tarzan uses a network address translator (NAT) to bridge between Tarzan hosts and oblivious Internet hosts.

We quantify Tarzan's anonymity properties and show that Tarzan imposes minimal performance overhead over a corresponding non-anonymous overlay route.

@mastersthesis{tarzan:freedman-meng,
title        = {A Peer-to-Peer Anonymizing Network Layer},
author       = {Michael J. Freedman},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = may,
}
Herodotus: A peer-to-peer web archival system Timo Burkard. Master's thesis, Massachusetts Institute of Technology, May 2002.
@mastersthesis{chord:tburkard-meng,
title        = {Herodotus: A Peer-to-Peer Web Archival System},
author       = {Timo Burkard},
school       = {Massachusetts Institute of Technology},
year         = 2002,
month        = may,
}
Efficient peer-to-peer lookup based on a distributed trie Michael J. Freedman, and Radek Vingralek. IPTPS 2002.

Two main approaches have been taken for distributed key/value lookup operations in peer-to-peer systems: broadcast searches and location-deterministic algorithms. We describe a third alternative based on a distributed trie. This algorithm functions well in a very dynamic, hostile environment, offering security benefits over prior proposals. Our approach takes advantage of working-set temporal locality and global key/value distribution skews due to content popularity. Peers gradually learn system state during lookups, receiving the sought values and/or internal information used by the trie. The distributed trie converges to an accurate network map over time. We describe several modes of information piggybacking, and conservative and liberal variances of the basic algorithm for adversarial settings. Simulations show efficient lookups and low failure rates.

@inproceedings{trie:iptps02,
title        = {Efficient Peer-To-Peer Lookup Based on a Distributed
Trie},
author       = {Michael J. Freedman and Radek Vingralek},
booktitle    = {Proceedings of the 1st International Workshop on
Peer-to-Peer Systems (IPTPS)},
year         = 2002,
month        = mar,
}
Serving DNS using Chord Russ Cox, Athicha Muthitacharoen, and Robert Morris. IPTPS 2002.

The current domain name system (DNS) couples ownership of domains with the responsibility of serving data for them. The DNS security extensions (DNSSEC) allow verificaton of records obtained by alternate means, opening exploration of alternative storage systems for DNS records. We explore one such alternative using DHash, a peer-to-peer distributed hash table built on top of Chord. Our system inherits Chord's fault-tolerance and load balance properties, at the same time eliminating many administrative problems with the current DNS. Still, our system has significantly higher latencies and other disadvantages in comparison with conventional DNS. We use this comparison to draw conclusions about general issues that still need to be addressed in peer-to-peer systems and distributed hash tables in particular.

@inproceedings{chord:dns02,
title        = {Serving {DNS} using {Chord}},
author       = {Russ Cox and Athicha Muthitacharoen and Robert
Morris},
booktitle    = {Proceedings of the 1st International Workshop on
Peer-to-Peer Systems (IPTPS)},
year         = 2002,
month        = mar,
}
Security considerations for peer-to-peer distributed hash tables Emil Sit, and Robert Morris. IPTPS 2002.

Recent peer-to-peer research has focused on providing efficient hash lookup systems that can be used to build more complex systems. These systems have good properties when their algorithms are executed correctly but have not generally considered how to handle misbehaving nodes. This paper looks at what sorts of security problems are inherent in large peer-to-peer systems based on distributed hash lookup systems. We examine the types of problems that such systems might face, drawing examples from existing systems, and propose some design principles for detecting and preventing these problems.

@inproceedings{chord:security02,
title        = {Security Considerations for Peer-to-Peer Distributed
Hash Tables},
author       = {Emil Sit and Robert Morris},
booktitle    = {Proceedings of the 1st International Workshop on
Peer-to-Peer Systems (IPTPS)},
year         = 2002,
month        = mar,
}
Effects of loss rate on ad hoc wireless routing Douglas S. J. De Couto, Daniel Aguayo, Benjamin A. Chambers, and Robert Morris. MIT LCS technical report, March 2002.

This paper uses measurements from two deployed wireless ad hoc networks to illustrate the effects of link loss rates on routing protocol performance. Measurements of these networks show that the radio links between the majority of nodes have substantial loss rates. These loss rates are high enough to decrease forwarding performance, but not high enough to prevent existing ad hoc routing protocols from using the links. Link-level retransmission can mask high loss rates, at the cost of substantial decreases in throughput. Simulations, driven by the observed loss rates, show that the shortest paths chosen by existing routing protocols tend to find routes with much less capacity than is available along the best route.

Based on these observations, we present a routing metric intended to allow routing protocols to find good routes in wireless ad hoc networks. The metric is the expected total number of transmissions required to deliver a packet along a route. This metric favors routes with high throughput and low total impact on spectrum. It is expected to perform better than existing techniques that eliminate links based on loss rate thresholds.

@techreport{grid:losstr01,
title        = {Effects of Loss Rate on Ad Hoc Wireless Routing},
author       = {Douglas S. J. {De Couto} and Daniel Aguayo and
Benjamin A. Chambers and Robert Morris},
institution  = {{MIT} Laboratory for Computer Science},
year         = 2002,
month        = mar,
number       = {MIT-LCS-TR-836},
}
Introducing Tarzan, a peer-to-peer anonymizing network layer Michael J. Freedman, Emil Sit, Josh Cates, and Robert Morris. IPTPS 2002.

We introduce Tarzan, a peer-to-peer anonymous network layer that provides generic IP forwarding. Unlike prior anonymizing layers, Tarzan is flexible, transparent, decentralized, and highly scalable.

Tarzan achieves these properties by building anonymous IP tunnels between an open-ended set of peers. Tarzan can provide anonymity to existing applications, such as web browsing and file sharing, without change to those applications. Performance tests show that Tarzan imposes minimal overhead over a corresponding non-anonymous overlay route.

@inproceedings{tarzan:iptps02,
title        = {Introducing {Tarzan}, a Peer-to-Peer Anonymizing
Network Layer},
author       = {Michael J. Freedman and Emil Sit and Josh Cates and
Robert Morris},
booktitle    = {Proceedings of the 1st International Workshop on
Peer-to-Peer Systems (IPTPS)},
year         = 2002,
month        = mar,
}
Fast and secure distributed read-only file system Kevin Fu, M. Frans Kaashoek, and David Mazières. ACM Transactions on Computer Systems 20(1), February 2002.
@article{sfsro:tocs2002,
title        = {{F}ast and secure distributed read-only file system},
author       = {Kevin Fu and M. Frans Kaashoek and David
Mazi{\e}res},
volume       = 20,
number       = 1,
year         = 2002,
month        = feb,
pages        = {1--24},
journal      = {{ACM} Transactions on Computer Systems},
}
Fast and flexible application-level networking on exokernel systems Gregory R. Ganger, Dawson R. Engler, M. Frans Kaashoek, Hector M. Briceno, Russell Hunt, and Thomas Pinckney. ACM Transactions on Computer Systems 20(1), February 2002.

Application-level networking is a promising software organization for improving performance and functionality for important network services. The Xok/ExOS exokernel system includes application-level support for standard network services, while at the same time allowing application writers to specialize networking services. This paper describes how Xok/ExOS’s kernel mechanisms and library operating system organization achieve this flexibility, and retrospectively shares our experiences and lessons learned (both positive and negative). It also describes how we used this flexibility to build and specialize three network data services: the Cheetah HTTP server, the webswamp Web benchmarking tool, and an application-level TCP forwarder. Overall measurements show large performance improvements relative to similar services built on conventional interfaces, in each case reaching the maximum possible end-to-end performance for the experimental platform. For example, Cheetah provides factor of 2-4 increases in throughput compared to highly tuned socket-based implementations and factor of 3-8 increases compared to conventional systems. Webswamp can offer loads that are two to eight times heavier. The TCP forwarder provides 50-300% higher throughput while also providing end-to-end TCP semantics that cannot be achieved with POSIX sockets. With more detailed measurements and profiling, these overall performance improvements are also broken down and attributed to the specific specializations described, providing server writers with insights into where to focus their optimization efforts.

@article{exo:tocs2002,
title        = {{F}ast and flexible Application-Level Networking on
Exokernel Systems},
author       = {Gregory R. Ganger and Dawson R. Engler and M. Frans
Kaashoek and Hector M. Briceno and Russell Hunt and Thomas
Pinckney},
volume       = 20,
number       = 1,
year         = 2002,
month        = feb,
pages        = {49--83},
journal      = {{ACM} Transactions on Computer Systems},
}
DNS performance and the effectiveness of caching Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. IEEE Transactions on Networking 10(5), 2002.
@article{dns:ton,
author       = {Jaeyeon Jung and Emil Sit and Hari Balakrishnan and
Robert Morris},
title        = {{DNS} performance and the effectiveness of caching},
volume       = 10,
number       = 5,
year         = 2002,
pages        = {589--603},
publisher    = {IEEE Press},
journal      = {{IEEE} Transactions on Networking},
}

## 2001

DNS performance and the effectiveness of caching Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. SIGCOMM 2001.
@inproceedings{dnscache:sigcommimw01,
title        = {{DNS} Performance and the Effectiveness of Caching},
author       = {Jaeyeon Jung and Emil Sit and Hari Balakrishnan and
Robert Morris},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
Measurement Workshop '01},
year         = 2001,
month        = nov,
}
Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. SOSP 2001.

The Cooperative File System (CFS) is a new peer-to-peer read-only storage system that provides provable guarantees for the efficiency, robustness, and load-balance of file storage and retrieval. CFS does this with a completely decentralized architecture that can scale to large systems. CFS servers provide a distributed hash table (DHash) for block storage. CFS clients interpret DHash blocks as a file system. DHash distributes and caches blocks at a fine granularity to achieve load balance, uses replication for robustness, and decreases latency with server selection. DHash finds blocks using the Chord location protocol, which operates in time logarithmic in the number of servers.

CFS is implemented using the SFS file system toolkit and runs on Linux, OpenBSD, and FreeBSD. Experience on a globally deployed prototype shows that CFS delivers data to clients as fast as FTP. Controlled tests show that CFS is scalable: with 4,096 servers, looking up a block of data involves contacting only seven servers. The tests also demonstrate nearly perfect robustness and unimpaired performance even when as many as half the servers fail.

@inproceedings{cfs:sosp01,
title        = {Wide-area cooperative storage with {CFS}},
author       = {Frank Dabek and M. Frans Kaashoek and David Karger
and Robert Morris and Ion Stoica},
},
booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '01)},
year         = 2001,
month        = oct,
}
Resilient overlay networks David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. SOSP 2001.
@inproceedings{ron:sosp01,
title        = {Resilient Overlay Networks},
author       = {David Andersen and Hari Balakrishnan and M. Frans
Kaashoek and Robert Morris},
},
booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '01)},
year         = 2001,
month        = oct,
}
A low-bandwidth network file system Athicha Muthitacharoen, Benjie Chen, and David Mazières. SOSP 2001.

Users rarely consider running network file systems over slow or wide-area networks, as the performance would be unacceptable and the bandwidth consumption too high. Nonetheless, efficient remote file access would often be desirable over such networks---particularly when high latency makes remote login sessions unresponsive. Rather than run interactive programs such as editors remotely, users could run the programs locally and manipulate remote files through the file system. To do so, however, would require a network file system that consumes less bandwidth than most current file systems.

This paper presents LBFS, a network file system designed for low-bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique in conjunction with conventional compression and caching, LBFS consumes over an order of magnitude less bandwidth than traditional network file systems on common workloads.

@inproceedings{lbfs:sosp01,
title        = {A Low-bandwidth Network File System},
author       = {Athicha Muthitacharoen and Benjie Chen and David
Mazi{\e}res},
pages        = {174--187},
booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '01)},
year         = 2001,
month        = oct,
}
A cooperative file system Frank Dabek. Master's thesis, Massachusetts Institute of Technology, September 2001.
@mastersthesis{cfs:dabek-meng,
title        = {A Cooperative File System},
author       = {Frank Dabek},
school       = {Massachusetts Institute of Technology},
year         = 2001,
month        = sep,
}
A case study of server selection Tina Tyan. Master's thesis, Massachusetts Institute of Technology, September 2001.
@mastersthesis{chord:tyan-meng,
title        = {A Case Study of Server Selection},
author       = {Tina Tyan},
school       = {Massachusetts Institute of Technology},
year         = 2001,
month        = sep,
}
Chord: A scalable peer-to-peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. SIGCOMM 2001.

A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

@inproceedings{chord:sigcomm01,
title        = {Chord: A Scalable Peer-to-peer Lookup Service for
{Internet} Applications},
author       = {Ion Stoica and Robert Morris and David Karger and M.
Frans Kaashoek and Hari Balakrishnan},
},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '01 Conference},
year         = 2001,
month        = aug,
}
Dos and don'ts of client authentication on the web Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster. USENIX Security 2001.

Client authentication has been a continuous source of problems on the Web. Although many well-studied techniques exist for authentication, Web sites continue to use extremely weak authentication schemes, especially in non-enterprise environments such as store fronts. These weaknesses often result from careless use of authenticators within Web cookies. Of the twenty-seven sites we investigated, we weakened the client authentication on two systems, gained unauthorized access on eight, and extracted the secret key used to mint authenticators from one.

We provide a description of the limitations, requirements, and security models specific to Web client authentication. This includes the introduction of the interrogative adversary, a surprisingly powerful adversary that can adaptively query a Web site.

We propose a set of hints for designing a secure client authentication scheme. Using these hints, we present the design and analysis of a simple authentication scheme secure against forgeries by the interrogative adversary. In conjunction with SSL, our scheme is secure against forgeries by the active adversary.

@inproceedings{webauth:sec10,
title        = {{D}os and Don'ts of Client Authentication on the Web},
author       = {Kevin Fu and Emil Sit and Kendra Smith and Nick
Feamster},
note         = {An extended version is available as MIT-LCS-TR-818},
booktitle    = {Proceedings of the 10th {USENIX} {S}ecurity
{S}ymposium},
year         = 2001,
month        = aug,
}
Span: An energy-efficient coordination algorithm for topology maintenance in ad hoc wireless networks Benjie Chen, Kyle Jamieson, Hari Balakrishnan, and Robert Morris. MobiCom 2001.

This paper presents Span, a power saving technique for multi-hop ad hoc wireless networks that reduces energy consumption without significantly diminishing the capacity or connectivity of the network. Span builds on the observation that when a region of a shared-channel wireless network has a sufficient density of nodes, only a small number of them need be on at any time to forward traffic for active connections.

Span is a distributed, randomized algorithm where nodes make local decisions on whether to sleep, or to join a forwarding backbone as a coordinator. Each node bases its decision on an estimate of how many of its neighbors will benefit from it being awake, and the amount of energy available to it. We give a randomized algorithm where coordinators rotate with time, demonstrating how localized node decisions lead to a connected, capacity-preserving global topology.

Improvement in system lifetime due to Span increases as the ratio of idle-to-sleep energy consumption increases, and increases as the density of the network increases. For example, our simulations show that with a practical energy model, system lifetime of an 802.11 network in power saving mode with Span is a factor of two better than without. Span integrates nicely with 802.11---when run in conjunction with the 802.11 power saving mode, Span improves communication latency, capacity, and system lifetime.

@inproceedings{span:mobicom01,
title        = {Span: An Energy-Efficient Coordination Algorithm for
Topology Maintenance in Ad Hoc Wireless Networks},
author       = {Benjie Chen and Kyle Jamieson and Hari Balakrishnan
and Robert Morris},
pages        = {85--96},
booktitle    = {Proceedings of the 7th {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '01)},
year         = 2001,
month        = jul,
}
Capacity of ad hoc wireless networks Jinyang Li, Charles Blake, Douglas S. J. De Couto, Hu Imm Lee, and Robert Morris. MobiCom 2001.

Early simulation experience with wireless ad hoc networks suggests that their capacity can be surprisingly low, due to the requirement that nodes forward each others' packets. The achievable capacity depends on network size, traffic patterns, and detailed local radio interactions. This paper examines these factors alone and in combination, using simulation and analysis from first principles. Our results include both specific constants and general scaling relationships helpful in understanding the limitations of wireless ad hoc networks.

We examine interactions of the 802.11 MAC and ad hoc forwarding and the effect on capacity for several simple configurations and traffic patterns. While 802.11 discovers reasonably good schedules, we nonetheless observe capacities markedly less than optimal for very simple chain and lattice networks with very regular traffic patterns. We validate some simulation results with experiments.

We also show that the traffic pattern determines whether an ad hoc network's per node capacity will scale to large networks. In particular, we show that for total capacity to scale up with network size the average distance between source and destination nodes must remain small as the network grows. Non-local traffic patterns in which this average distance grows with the network size result in a rapid decrease of per node capacity. Thus the question Are large ad hoc networks feasible?'' reduces to a question about the likely locality of communication in such networks.

@inproceedings{grid:mobicom01,
title        = {Capacity of Ad Hoc Wireless Networks},
author       = {Jinyang Li and Charles Blake and Douglas S. J. {De
Couto} and Hu Imm Lee and Robert Morris},
pages        = {61--69},
booktitle    = {Proceedings of the 7th {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '01)},
year         = 2001,
month        = jul,
}
Location proxies and intermediate node forwarding for practical geographic forwarding Douglas S. J. De Couto, and Robert Morris. MIT LCS technical report, June 2001.

Two main problems prevent the deployment of geographic forwarding in real systems: geographic forwarding requires that all nodes know their locations, and it has trouble routing around local dead ends. This paper presents practical solutions to each problem.

The location proxy technique allows a node that does not know its location to find a nearby location aware node to use as a proxy for geographic forwarding. The technique works well over a large range of densities of location aware nodes, and allows a tradeoff between bandwidth used for routing information and expense of providing location information.

The intermediate node forwarding (INF) mechanism is a probabilistic solution for routing around bad geographic topologies via intermediate geographic locations. Existing solutions unrealistically assume that nodes have identical radio propagation; INF works on a restricted set of situations but makes assumptions that better match reality.

Experiments using the ns simulator show that location proxies and INF are effective enough to make geographic forwarding practical. We believe geographic forwarding will enable scalable ad hoc networking.

@techreport{grid:proxytr01,
title        = {Location Proxies and Intermediate Node Forwarding
for Practical Geographic Forwarding},
author       = {Douglas S. J. {De Couto} and Robert Morris},
institution  = {{MIT} Laboratory for Computer Science},
year         = 2001,
month        = jun,
number       = {MIT-LCS-TR-824},
}
Flexible control of parallelism in a multiprocessor PC router Benjie Chen, and Robert Morris. USENIX 2001.

SMP Click is a software router that provides both flexibility and high performance on stock multiprocessor PC hardware. It achieves high performance using device, buffer, and queue management techniques optimized for multiprocessor routing. It allows vendors or network administrators to configure the router in a way that indicates parallelizable packet processing tasks, and adaptively load-balances those tasks across the available CPUs.

SMP Click's absolute performance is high: it can forward 494,000 64-byte IP packets per second on a 2-CPU 500 MHz Intel Xeon machine, compared to 302,000 packets per second for uniprocessor Click. SMP Click also scales well for CPU intensive tasks: 4-CPU SMP Click can encrypt and forward 87,000 64-byte packets per second using IPSec 3DES, compared to 23,000 packets per second for uniprocessor Click.

@inproceedings{click:usenix01,
title        = {Flexible Control of Parallelism in a Multiprocessor
{PC} Router},
author       = {Benjie Chen and Robert Morris},
pages        = {333--346},
booktitle    = {Proceedings of the 2001 USENIX Annual Technical
Conference (USENIX '01)},
year         = 2001,
month        = jun,
}
Building peer-to-peer systems with Chord, a distributed lookup service Frank Dabek, Emma Brunskill, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, and Hari Balakrishnan. HotOS 2001.

We argue that the core problem facing peer-to-peer systems is locating documents in a decentralized network and propose Chord, a distributed lookup primitive. Chord provides an efficient method of locating documents while placing few constraints on the applications that use it. As proof that Chord's functionality is useful in the development of peer-to-peer applications, we outline the implementation of a peer-to-peer file sharing system based on Chord.

@inproceedings{chord:hotos,
title        = {Building Peer-to-Peer Systems With {Chord}, a
Distributed Lookup Service},
author       = {Frank Dabek and Emma Brunskill and M. Frans Kaashoek
and David Karger and Robert Morris and Ion Stoica and Hari
Balakrishnan},
},
booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-VIII})},
year         = 2001,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
}
Resilient overlay networks David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. HotOS 2001.
@inproceedings{ron:hotos8,
title        = {Resilient Overlay Networks},
author       = {David Andersen and Hari Balakrishnan and M. Frans
Kaashoek and Robert Morris},
booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-VIII})},
year         = 2001,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
}
Dos and don'ts of client authentication on the web Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster. MIT LCS technical report, May 2001.
@techreport{webauth:tr,
title        = {{D}os and Don'ts of Client Authentication on the Web},
author       = {Kevin Fu and Emil Sit and Kendra Smith and Nick
Feamster},
institution  = {{MIT} Laboratory for Computer Science},
year         = 2001,
month        = may,
number       = {MIT-LCS-TR-818},
}
Reconsidering internet mobility Alex C. Snoeren, Hari Balakrishnan, and M. Frans Kaashoek. HotOS 2001.
@inproceedings{migrate:hotos8,
title        = {Reconsidering Internet Mobility},
author       = {Alex C. Snoeren and Hari Balakrishnan and M. Frans
Kaashoek},
booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-VIII})},
year         = 2001,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
}

## 2000

Modular components for network address translation Eddie Kohler, Robert Morris, and Massimiliano Poletto. MIT LCS Click Project technical report, December 2000.

We present a general-purpose toolkit for network address translation in a modular, component-based networking system. Network address translation is a powerful, general technique for building network applications, such as allowing disparate address realms to communicate, load balancing among servers, and changing ordinary proxies into transparent proxies. The components of our toolkit can be combined in a variety of ways to implement these applications and others. The context of this work, the Click modular networking system, makes the NAT components simpler and more understandable. For example, the NAT components concern themselves solely with address translation; related functions, such as classification, are implemented by separate components. This design is more flexible than most existing NAT implementations. The user can choose where network address translation takes place in relation to other router functions, and can use multiple translators in a single configuration. These components have been in use in a production environment for several months.

We describe our design approach, demonstrate its flexibility by presenting a range of examples of its use, and evaluate its performance.

@techreport{click:rewritertr,
title        = {Modular components for network address translation},
author       = {Eddie Kohler and Robert Morris and Massimiliano
Poletto},
institution  = {{MIT} Laboratory for Computer Science Click Project},
year         = 2000,
month        = dec,
note         = {http://www.pdos.lcs.mit.edu/papers/click-rewriter/},
}
The Click modular router Eddie Kohler. Ph.D. thesis, Massachusetts Institute of Technology, November 2000.
@phdthesis{click:kohler-phd,
title        = {The {Click} modular router},
author       = {Eddie Kohler},
school       = {Massachusetts Institute of Technology},
year         = 2000,
month        = nov,
}
Fast and secure distributed read-only file system Kevin Fu, M. Frans Kaashoek, and David Mazières. OSDI 2000.

Internet users increasingly rely on publicly available data for everything from software installation to investment decisions. Unfortunately, the vast majority of public content on the Internet comes with no integrity or authenticity guarantees. This paper presents the self-certifying read-only file system, a content distribution system providing secure, scalable access to public, read-only data.

The read-only file system makes the security of published content independent from that of the distribution infrastructure. In a secure area (perhaps off-line), a publisher creates a digitally-signed database out of a file system's contents. The publisher then replicates the database on untrusted content-distribution servers, allowing for high availability. The read-only file system protocol furthermore pushes the cryptographic cost of content verification entirely onto clients, allowing servers to scale to a large number of clients. Measurements of an implementation show that an individual server running on a 550 Mhz Pentium III with FreeBSD can support 1,012 connections per second and 300 concurrent clients compiling a large software package.

@inproceedings{sfsro:osdi2000,
title        = {{F}ast and secure distributed read-only file system},
author       = {Kevin Fu and M. Frans Kaashoek and David
Mazi{\e}res},
pages        = {181--196},
booktitle    = {Proceedings of the 4th {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} 2000)},
year         = 2000,
month        = oct,
}
CarNet: A scalable ad hoc wireless network system Robert Morris, John Jannotti, Frans Kaashoek, Jinyang Li, and Douglas S. J. De Couto. SIGOPS European Workshop 2000.

CarNet is an application for a large ad hoc mobile network system that scales well without requiring a fixed network infrastructure to route messages. CarNet places radio nodes in cars, which communicate using Grid, a novel scalable routing system. Grid uses geographic forwarding and a scalable distributed location service to route packets from car to car without flooding the network. CarNet will support IP connectivity as well as applications such as cooperative highway congestion monitoring, fleet tracking, and discovery of nearby points of interest.

@inproceedings{grid:sigops-euro9,
title        = {{C}ar{N}et: A Scalable Ad Hoc Wireless Network
System},
author       = {Robert Morris and John Jannotti and Frans Kaashoek
and Jinyang Li and Douglas S. J. {De Couto}},
},
note         = {The published version incorrectly lists Douglas De
Couto's name},
booktitle    = {Proceedings of the 9th {ACM} {SIGOPS} {E}uropean
workshop: Beyond the {PC}: New Challenges for the Operating System},
year         = 2000,
month        = sep,
}
A scalable location service for geographic ad hoc routing Jinyang Li, John Jannotti, Douglas S. J. De Couto, David R. Karger, and Robert Morris. MobiCom 2000.

GLS is a new distributed location service which tracks mobile node locations. GLS combined with geographic forwarding allows the construction of ad hoc mobile networks that scale to a larger number of nodes than possible with previous work. GLS is decentralized and runs on the mobile nodes themselves, requiring no fixed infrastructure. Each mobile node periodically updates a small set of other nodes (its location servers) with its current location. A node sends its position updates to its location servers without knowing their actual identities, assisted by a predefined ordering of node identifiers and a predefined geographic hierarchy. Queries for a mobile node's location also use the predefined identifier ordering and spatial hierarchy to find a location server for that node.

Experiments using the ns simulator for up to 600 mobile nodes show that the storage and bandwidth requirements of GLS grow slowly with the size of the network. Furthermore, GLS tolerates node failures well: each failure has only a limited effect and query performance degrades gracefully as nodes fail and restart. The query performance of GLS is also relatively insensitive to node speeds. Simple geographic forwarding combined with GLS compares favorably with Dynamic Source Routing (DSR): in larger networks (over 200 nodes) our approach delivers more packets, but consumes fewer network resources.

@inproceedings{grid:mobicom00,
title        = {A Scalable Location Service for Geographic Ad Hoc
Routing},
author       = {Jinyang Li and John Jannotti and Douglas S. J. {De
Couto} and David R. Karger and Robert Morris},
pages        = {120--130},
booktitle    = {Proceedings of the 6th {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '00)},
year         = 2000,
month        = aug,
}
Programming language techniques for modular router configurations Eddie Kohler, Benjie Chen, M. Frans Kaashoek, Robert Morris, and Massimiliano Poletto. MIT LCS technical report, August 2000.

This paper applies programming language techniques to a high-level system description, both to optimize the system and to prove useful properties about it. The system in question is Click, a modular software router framework. Click routers are built from components called elements. Elements are written in C++, but the user creates a configuration using a simple, declarative data flow language. This language is amenable to data flow analysis and other conventional programming language techniques. Applied to a router configuration, these techniques have high-level results---for example, optimizing the router or verifying its high-level properties. This paper describes several programming language techniques that have been useful in practice, including optimization tools that remove virtual function calls from router definitions and remove redundant parts of adjacent routers. We also present performance results for an extensively optimized standards-compliant IP router. On conventional PC hardware, this router can forward up to 456,000 64-byte packets per second.

@techreport{click:lcstr00,
title        = {Programming language techniques for modular router
configurations},
author       = {Eddie Kohler and Benjie Chen and M. Frans Kaashoek
and Robert Morris and Massimiliano Poletto},
institution  = {{MIT} Laboratory for Computer Science},
year         = 2000,
month        = aug,
number       = {MIT-LCS-TR-812},
}
The Click modular router Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. ACM Transactions on Computer Systems 18(3), August 2000.

Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. A router configuration is a directed graph with elements at the vertices; packets flow along the edges of the graph. Several features make individual elements more powerful and complex configurations easier to write, including pull connections, which model packet flow driven by transmitting hardware devices, and flow-based router context, which helps an element locate other interesting elements.

Click configurations are modular and easy to extend. A standards-compliant Click IP router has sixteen elements on its forwarding path; some of its elements are also useful in Ethernet switches and IP tunneling configurations. Extending the IP router to support dropping policies, fairness among flows, or Differentiated Services simply requires adding a couple elements at the right place. On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.

@article{click:tocs00,
title        = {The {Click} modular router},
author       = {Eddie Kohler and Robert Morris and Benjie Chen and
John Jannotti and M. Frans Kaashoek},
volume       = 18,
number       = 3,
year         = 2000,
month        = aug,
pages        = {263--297},
journal      = {{ACM} Transactions on Computer Systems},
}
Multops: a data structure for denial-of-service attack detection Thomer M. Gil. Master's thesis, Vrije Universiteit, August 2000.
@mastersthesis{click:gil-ms,
title        = {MULTOPS: a data structure for denial-of-service
attack detection},
author       = {Thomer M. Gil},
school       = {Vrije Universiteit},
year         = 2000,
month        = aug,
}
Self-certifying file system David Mazières. Ph.D. thesis, Massachusetts Institute of Technology, May 2000.
@phdthesis{sfs:mazieres-phd,
title        = {Self-certifying File System},
author       = {David Mazi{\e}res},
school       = {Massachusetts Institute of Technology},
year         = 2000,
month        = may,
}
A study of caching in the internet domain name system Emil Sit. Master's thesis, Massachusetts Institute of Technology, May 2000.
@mastersthesis{click:sit-ms,
title        = {A Study of Caching in the Internet Domain Name
System},
author       = {Emil Sit},
school       = {Massachusetts Institute of Technology},
year         = 2000,
month        = may,
}
Flexible key management with SFS agents Michael Kaminsky. Master's thesis, Massachusetts Institute of Technology, May 2000.
@mastersthesis{sfs:kaminsky-ms,
title        = {Flexible Key Management with {SFS} Agents},
author       = {Michael Kaminsky},
school       = {Massachusetts Institute of Technology},
year         = 2000,
month        = may,
}
Multiprocessing with the Exokernel operating system Benjie Chen. Master's thesis, Massachusetts Institute of Technology, February 2000.

Exokernel is a minimal operating system kernel that safely multiplexes hardware resources, while leaving all system abstractions to applications. An exokernel exhibits better performance and offers more functionality because applications can provide optimized system abstractions, at the user-level, based on their needs. Current design of the exokernel system, however, does not support multiprocessor architectures. This thesis presents a symmetric multiprocessing exokernel and demonstrates that unprivileged library implementation of operating system abstractions is viable on a multiprocessor system.

This thesis focus on three issues. First, it presents synchronization strategies used in kernel. Second, this thesis describes three new exokernel interfaces: message passing, kernel support for multithreading, and multiprocessor scheduling. Third, because exokernel applications do not trust each other, traditional synchronization primitives used to guard system abstractions, such as voluntary memory locks, do not function well. This thesis presents and evaluates a strategy for synchronization among untrusted processes. A multiprocessor exokernel and a synchronized library operating system result from this thesis. Performance analysis shows that the overheads of synchronization in both the kernel and the library operating system are small.

@mastersthesis{exo:chen-meng,
title        = {Multiprocessing with the {Exokernel} Operating
System},
author       = {Benjie Chen},
school       = {Massachusetts Institute of Technology},
year         = 2000,
month        = feb,
}

## 1999

The Click modular router Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek. SOSP 1999.

Please consider citing the journal version of this paper, which has improved explanations, more detail and examples, and significantly better performance results.

Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. Complete configurations are built by connecting elements into a graph; packets flow along the graph's edges. Several features make individual elements more powerful and complex configurations easier to write, including pull processing, which models packet flow driven by transmitting interfaces, and flow-based router context, which helps an element locate other interesting elements.

We demonstrate several working configurations, including an IP router and an Ethernet bridge. These configurations are modular---the IP router has 16 elements on the forwarding path---and easy to extend by adding additional elements, which we demonstrate with augmented configurations. On commodity PC hardware running Linux, the Click IP router can forward 64-byte packets at 73,000 packets per second, just 10% slower than Linux alone.

@inproceedings{click:sosp99,
title        = {The {C}lick modular router},
author       = {Robert Morris and Eddie Kohler and John Jannotti and
M. Frans Kaashoek},
pages        = {217--231},
booktitle    = {Proceedings of the 17th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '99)},
year         = 1999,
month        = dec,
address      = {Kiawah Island, South Carolina},
}
Separating key management from file system security David Mazières, Michael Kaminsky, M. Frans Kaashoek, and Emmett Witchel. SOSP 1999.
@inproceedings{sfs:sosp99,
title        = {{S}eparating key management from file system
security},
author       = {David Mazi{\e}res and Michael Kaminsky and M. Frans
Kaashoek and Emmett Witchel},
pages        = {124--139},
booktitle    = {Proceedings of the 17th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '99)},
year         = 1999,
month        = dec,
address      = {Kiawah Island, South Carolina},
}
Linear scan register allocation Massimiliano Poletto, and Vivek Sarkar. ACM Transactions on Programming Languages and Systems 21(5), September 1999.
@article{linearscan,
title        = {Linear scan register allocation},
author       = {Massimiliano Poletto and Vivek Sarkar},
volume       = 21,
number       = 5,
year         = 1999,
month        = sep,
pages        = {895--913},
journal      = {{ACM} Transactions on Programming Languages and
Systems},
}
A readable TCP in the Prolac protocol language Eddie Kohler, M. Frans Kaashoek, and David R. Montgomery. SIGCOMM 1999.

Prolac is a new statically-typed, object-oriented language for network protocol implementation. It is designed for readability, extensibility, and real-world implementation; most previous protocol languages, in contrast, have been based on hard-to-implement theoretical models and have focused on verification. We present a working Prolac TCP implementation directly derived from 4.4BSD. Our implementation is modular---protocol processing is logically divided into minimally-interacting pieces; readable---Prolac encourages top-down structure and naming intermediate computations; and extensible---subclassing cleanly separates protocol extensions like delayed acknowledgements and slow start. The Prolac compiler uses simple global analysis to remove expensive language features like dynamic dispatch, resulting in end-to-end performance comparable to an unmodified Linux 2.0 TCP.

@inproceedings{prolac:sigcomm99,
title        = {A readable {TCP} in the {Prolac} protocol language},
author       = {Eddie Kohler and M. Frans Kaashoek and David R.
Montgomery},
pages        = {3--13},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '99 Conference:
Applications, Technologies, Architectures, and Protocols for
Computer Communication},
year         = 1999,
month        = aug,
}
Language and compiler support for dynamic code generation Massimiliano Poletto. Ph.D. thesis, Massachusetts Institute of Technology, June 1999.
@phdthesis{tickc:poletto-phd,
title        = {Language and compiler support for dynamic code
generation},
author       = {Massimiliano Poletto},
school       = {Massachusetts Institute of Technology},
year         = 1999,
month        = jun,
}
The secure file system under Windows NT Matthew Rimer. Master's thesis, Massachusetts Institute of Technology, June 1999.
@mastersthesis{sfs:rimer-ms,
title        = {The Secure File System under {Windows NT}},
author       = {Matthew Rimer},
school       = {Massachusetts Institute of Technology},
year         = 1999,
month        = jun,
}
Evolving software with an application-specific language Eddie Kohler, Massimiliano Poletto, and David R. Montgomery. WCSSS 1999.

Software systems can be developed through evolution (gradual change) or revolution (reimplementation from scratch). Both approaches have advantages and disadvantages. An evolutionary approach keeps the system working throughout, allowing early problem detection, but tends to retain ingrained design flaws and can result in complex, ad hoc systems. A revolutionary approach is required to change the basic architecture of a system, but many more resources must be invested before the system can be evaluated. In this paper, we describe how we used a little application-specific language to combine these approaches' advantages.

The context of our work is CTAS, the next-generation air traffic control automation system developed originally by NASA. The overall goal was to redesign and reimplement one of the CTAS processes in Java, while retaining its ability to communicate with unmodified processes---a project complicated by CTAS's ad hoc message formats. To address this, we designed a language that combines C code copied from CTAS source, to express the message formats, with new Java code for message actions. A compiler then automatically generates code for marshalling and unmarshalling. The result is a system with both evolutionary and revolutionary properties, exemplified by the use of both old CTAS code and new Java code in the message language.

This paper discusses the language and compiler and evaluates some of the engineering tradeoffs inherent in their design.

@inproceedings{evolving-software:wcsss99,
title        = {Evolving software with an application-specific
language},
author       = {Eddie Kohler and Massimiliano Poletto and David R.
Montgomery},
pages        = {94--102},
booktitle    = {Workshop Record of {WCSSS} '99: The 2nd {ACM}
{SIGPLAN} Workshop on Compiler Support for Systems Software},
year         = 1999,
month        = may,
}
A fast Prolac TCP for the real world Jr. Montgomery David Rogers. Master's thesis, Massachusetts Institute of Technology, May 1999.
@mastersthesis{prolac:montgomery-meng,
title        = {A fast {Prolac} {TCP} for the real world},
author       = {Montgomery, Jr., David Rogers},
school       = {Massachusetts Institute of Technology},
year         = 1999,
month        = may,
}
An x86 protected mode virtual machine monitor for the MIT Exokernel Charles L. Coffing. Master's thesis, Massachusetts Institute of Technology, May 1999.
@mastersthesis{exo:coffing-meng,
title        = {An x86 Protected Mode Virtual Machine Monitor for
the {MIT} {Exokernel}},
author       = {Charles L. Coffing},
school       = {Massachusetts Institute of Technology},
year         = 1999,
month        = may,
}
PAN: a high-performance active network node supporting multiple mobile code systems Erik L. Nygren, Stephen J. Garland, and M. Frans Kaashoek. OpenArch 1999.

A capsule-based active network transports capsules containing code to be executed on network nodes through which they pass. Active networks facilitate the deployment of new protocols, which can be used without any changes to the underlying network infrastructure. This paper describes the design, implementation, and evaluation of a high-performance active network node which supports multiple mobile code systems. Experiments, using capsules executing unsafe native Intel ix86 object code, indicate that active networks may be able to provide significant flexibility relative to traditional networks with only a small performance overhead (as little as 13% for 1500 byte packets). However, capsules executing JavaVM code performed far worse (with over three times the performance overhead of native code for 128 byte packets), indicating that mobile code system performance is critical to overall node performance.

@inproceedings{pan:openarch99,
title        = {{PAN}: a high-performance active network node
supporting multiple mobile code systems},
author       = {Erik L. Nygren and Stephen J. Garland and M. Frans
Kaashoek},
pages        = {78--89},
booktitle    = {Proceedings of the 2nd {IEEE} Conference on Open
Architectures and Network Programming ({OpenArch} '99)},
year         = 1999,
month        = mar,
address      = {New York, New York},
}
C and tcc: A language and compiler for dynamic code generation Massimiliano Poletto, Wilson C. Hsieh, Dawson R. Engler, and M. Frans Kaashoek. ACM Transactions on Programming Languages and Systems 21(2), March 1999.
@article{tickc:toplas,
title        = {{C} and {tcc}: A language and compiler for dynamic
code generation},
author       = {Massimiliano Poletto and Wilson C. Hsieh and Dawson
R. Engler and M. Frans Kaashoek},
volume       = 21,
number       = 2,
year         = 1999,
month        = mar,
pages        = {324--369},
journal      = {{ACM} Transactions on Programming Languages and
Systems},
}

## 1998

The design, implementation and operation of an email pseudonym server David Mazières, and M. Frans Kaashoek. CCS 1998.
@inproceedings{nymserver:ccs5,
title        = {The design, implementation and operation of an email
pseudonym server},
author       = {David Mazi{\e}res and M. Frans Kaashoek},
pages        = {27--36},
booktitle    = {Proceedings of the 5th {ACM} Conference on Computer
and Communications Security ({CCS-5})},
year         = 1998,
month        = nov,
}
The exokernel operating system architecture Dawson R. Engler. Ph.D. thesis, Massachusetts Institute of Technology, October 1998.
@phdthesis{exo:engler-phd,
title        = {The exokernel operating system architecture},
author       = {Dawson R. Engler},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = oct,
}
Escaping the evils of centralized control with self-certifying pathnames David Mazières, and M. Frans Kaashoek. SIGOPS European Workshop 1998.
@inproceedings{sfs:sigops-euro8,
title        = {Escaping the evils of centralized control with
self-certifying pathnames},
author       = {David Mazi{\e}res and M. Frans Kaashoek},
},
booktitle    = {Proceedings of the 8th {ACM} {SIGOPS} {E}uropean
workshop: Support for composing distributed applications},
year         = 1998,
month        = sep,
}
Framework for implementing file systems in Windows NT Danilo Almeida. Master's thesis, Massachusetts Institute of Technology, May 1998.
@mastersthesis{sfs:almeida-ms,
title        = {Framework for Implementing File Systems in {Windows
NT}},
author       = {Danilo Almeida},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = may,
}
Flexible and efficient sharing of protected abstractions George M. Candea. Master's thesis, Massachusetts Institute of Technology, May 1998.

Traditional operating systems are overly restrictive and do not allow user-level applications to modify operating system abstractions. The exokernel operating system architecture safely gives untrusted applications efficient control over hardware and software resources by separating management from protection. Decentralized control, however, makes it very difficult for mutually distrustful applications to share system abstractions.

This thesis presents the design, implementation, and evaluation of the protected abstraction mechanism (PAM), a novel way to safely share user-level abstractions in an exokernel. PAM enables unprivileged, untrusted applications to define and securely share generic abstractions at run-time. PAM achieves a good flexibility-performance combination by eliminating the need for context switches and optimizing for the common case, in which the same abstraction is invoked repeatedly. PAM's design emphasizes simplicity and provable correctness, which makes it easy to understand and use: a couple of manual pages are sufficient for the average programmer to start using PAM.

We report measurements of PAM's performance on null method calls. In spite of the fact that such invocations do not take advantage of PAM's context switch-free operation, the PAM version of a simple abstraction outperforms the equivalent LRPC implementation by over 15% on null method calls. It is also considerably easier to write an abstraction using PAM. We therefore believe the protected abstraction mechanism is a viable solution to the problem of safely sharing user-level abstractions in the exokernel.

@mastersthesis{exo:candea-meng,
title        = {Flexible and efficient sharing of protected
abstractions},
author       = {George M. Candea},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = may,
}
Performance evaluation of the Orca shared-object system Henri Bal, Raoul Bhoedjang, Rutger Hofman, Ceriel Jacobs, Koen Langendoen, Tim Ruhl, and M. Frans Kaashoek. ACM Transactions on Computer Systems 16(1), February 1998.
@article{orca:tocs,
title        = {Performance evaluation of the {Orca} shared-object
system},
author       = {Henri Bal and Raoul Bhoedjang and Rutger Hofman and
Ceriel Jacobs and Koen Langendoen and Tim Ruhl and M. Frans
Kaashoek},
pages        = {1--40},
month        = feb,
year         = 1998,
volume       = 16,
number       = 1,
journal      = {{ACM} Transactions on Computer Systems},
}
An efficient virtual network interface in the FUGU scalable workstation Kenneth M. Mackenzie. Ph.D. thesis, Massachusetts Institute of Technology, February 1998.
@phdthesis{fugu:mackenzie-phd,
title        = {An efficient virtual network interface in the {FUGU}
scalable workstation},
author       = {Kenneth M. Mackenzie},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = feb,
}
Applying exokernel principles to conventional operating systems John Jannotti. Master's thesis, Massachusetts Institute of Technology, February 1998.
@mastersthesis{exo-os:jj-meng,
title        = {Applying Exokernel Principles to Conventional
Operating Systems},
author       = {John Jannotti},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = feb,
}
The design and implementation of a high-performance active network node Erik L. Nygren. Master's thesis, Massachusetts Institute of Technology, February 1998.

A capsule-oriented active network transports capsules containing code to be executed on the network nodes that they pass through. This approach makes networks more extensible by allowing new networking protocols to be deployed and used without any changes to the underlying network infrastructure. This thesis project describes the design, implementation, and evaluation of a high-performance practical active network node that can serve as a testbed for research into active network performance and resource management issues. Nodes provide resources to executing capsules containing Intel ix86 object code. Although the current implementation does not yet provide safety or interoperability, the results of experiments performed on the system implemented for this thesis indicate that an active network architecture may be able to provide significant flexibility while only incurring a small performance overhead relative to traditional networks.

@mastersthesis{pan:nygren-meng,
title        = {The design and implementation of a high-performance
active network node},
author       = {Erik L. Nygren},
school       = {Massachusetts Institute of Technology},
year         = 1998,
month        = feb,
}

## 1997

Mobile computing with the Rover toolkit Anthony Joseph. Ph.D. thesis, Massachusetts Institute of Technology, November 1997.
@phdthesis{rover:adj-phd,
title        = {Mobile computing with the {Rover} toolkit},
author       = {Anthony Joseph},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = nov,
}
Application performance and flexibility on exokernel systems M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Héctor M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. SOSP 1997.

The exokernel operating system architecture safely gives untrusted software efficient control over hardware and software resources by separating management from protection. This paper describes an exokernel system that allows specialized applications to achieve high performance without sacrificing the performance of unmodified UNIX programs. It evaluates the exokernel architecture by measuring end-to-end application performance on Xok, an exokernel for Intel x86-based computers, and by comparing Xok's performance to the performance of two widely-used 4.4BSD UNIX systems (FreeBSD and OpenBSD). The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). This paper also describes insights about the exokernel approach gained through building three different exokernel systems, and presents novel approaches to resource multiplexing.

@inproceedings{exo:sosp97,
title        = {Application performance and flexibility on exokernel
systems},
author       = {M. Frans Kaashoek and Dawson R. Engler and Gregory
R. Ganger and H{\'e}ctor M. Brice{\~n}o and Russell Hunt and David
Mazi{\e}res and Thomas Pinckney and Robert Grimm and John
Jannotti and Kenneth Mackenzie},
pages        = {52--65},
booktitle    = {Proceedings of the 16th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '97)},
year         = 1997,
month        = oct,
}
Shared libraries in an exokernel operating system Douglas Karl Wyatt. Master's thesis, Massachusetts Institute of Technology, September 1997.
@mastersthesis{exo:wyatt-meng,
title        = {Shared libraries in an exokernel operating system},
author       = {Douglas Karl Wyatt},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = sep,
}
Prolac: a language for protocol compilation Eddie Kohler. Master's thesis, Massachusetts Institute of Technology, September 1997.
@mastersthesis{prolac:kohler-ms,
title        = {Prolac: a language for protocol compilation},
author       = {Eddie Kohler},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = sep,
}
Security and decentralized control in the SFS global file system David Mazières. Master's thesis, Massachusetts Institute of Technology, August 1997.
@mastersthesis{sfs:mazieres-ms,
title        = {Security and decentralized control in the {SFS}
global file system},
author       = {David Mazi{\e}res},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = aug,
}
tcc: a system for fast, flexible, and high-level dynamic code generation Massimiliano Poletto, Dawson R. Engler, and M. Frans Kaashoek. PLDI 1997.
@inproceedings{tickc:pldi97,
title        = {tcc: a system for fast, flexible, and high-level
dynamic code generation},
author       = {Massimiliano Poletto and Dawson R. Engler and M.
Frans Kaashoek},
pages        = {109--121},
booktitle    = {Proceedings of the {ACM} {SIGPLAN} '97 Conference on
Programming Design and Implementation ({PLDI} '97)},
year         = 1997,
month        = jun,
}
Secure applications need flexible operating systems David Mazières, and M. Frans Kaashoek. HotOS 1997.
@inproceedings{secure-apps:hotos6,
title        = {Secure applications need flexible operating systems},
author       = {David Mazi{\e}res and M. Frans Kaashoek},
pages        = {56--61},
booktitle    = {Proceedings of the 6th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-VI})},
year         = 1997,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
address      = {Chatham, Cape Cod, Massachusetts},
}
Mobile computing with the Rover toolkit Anthony D. Joseph, Joshua A. Tauber, and M. Frans Kaashoek. IEEE Transactions on Computers 46(3), March 1997.
@article{rover:ieee-toc,
title        = {Mobile computing with the {Rover} toolkit},
author       = {Anthony D. Joseph and Joshua A. Tauber and M. Frans
Kaashoek},
volume       = 46,
number       = 3,
year         = 1997,
month        = mar,
pages        = {337--352},
journal      = {{IEEE} Transactions on Computers},
}
Operating system extensibility through event capture Thomas Pinckney III. Master's thesis, Massachusetts Institute of Technology, February 1997.
@mastersthesis{exo:pinckney-meng,
title        = {Operating system extensibility through event capture},
author       = {Thomas {Pinckney III}},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = feb,
}
Decentralizing UNIX abstractions in the exokernel architecture Héctor Manuel Briceño Pulido. Master's thesis, Massachusetts Institute of Technology, February 1997.
@mastersthesis{exo:briceno-meng,
title        = {Decentralizing {UNIX} abstractions in the exokernel
architecture},
author       = {H{\'e}ctor Manuel {Brice{\~n}o Pulido}},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = feb,
}
Embedded inodes and explicit grouping: exploiting disk bandwidth for small files Gregory R. Ganger, and M. Frans Kaashoek. USENIX 1997.

Small file performance in most file systems is limited by slowly improving disk access times, even though current file systems improve on-disk locality by allocating related data objects in the same general region. The key insight for why current file systems perform poorly is that locality is insufficient - exploiting disk bandwidth for small data objects requires that they be placed adjacently. We describe C-FFS (Co-locating Fast File System), which introduces two techniques, embedded inodes and explicit grouping, for exploiting what disks do well (bulk data movement) to avoid what they do poorly (reposition to new locations). With embedded inodes, the inodes for most files are stored in the directory with the corresponding name, removing a physical level of indirection without sacrificing the logical level of indirection. With explicit grouping, the data blocks of multiple small files named by a given directory are allocated adjacently and moved to and from the disk as a unit in most cases. Measurements of our C-FFS implementation show that embedded inodes and explicit grouping have the potential to increase small file throughput (for both reads and writes) by a factor of 5-7 compared to the same file system without these techniques. The improvement comes directly from reducing the number of disk accesses required by an order of magnitude. Preliminary experience with software-development applications shows performance improvements ranging from 10-300 percent.

@inproceedings{cffs:usenix97,
title        = {Embedded inodes and explicit grouping: exploiting
disk bandwidth for small files},
author       = {Gregory R. Ganger and M. Frans Kaashoek},
pages        = {1--17},
booktitle    = {Proceedings of the 1997 {USENIX} Annual Technical
Conference (USENIX '97)},
year         = 1997,
month        = jan,
}
High-performance application-specific networking Deborah Anne Wallach. Ph.D. thesis, Massachusetts Institute of Technology, January 1997.
@phdthesis{app-specific-networking:wallach-phd,
title        = {High-performance application-specific networking},
author       = {Deborah Anne Wallach},
school       = {Massachusetts Institute of Technology},
year         = 1997,
month        = jan,
}
Building reliable mobile-aware applications using the Rover toolkit Anthony D. Joseph, and M. Frans Kaashoek. ACM Wireless Networks 3(5), 1997.
@article{rover:winet,
title        = {Building reliable mobile-aware applications using
the {Rover} toolkit},
author       = {Anthony D. Joseph and M. Frans Kaashoek},
volume       = 3,
number       = 5,
year         = 1997,
pages        = {405--419},
journal      = {{ACM} Wireless Networks},
}

## 1996

Building reliable mobile-aware applications using the Rover toolkit Anthony D. Joseph, Joshua A. Tauber, and M. Frans Kaashoek. MobiCom 1996.
@inproceedings{rover:mobicom,
title        = {Building reliable mobile-aware applications using
the {Rover} toolkit},
author       = {Anthony D. Joseph and Joshua A. Tauber and M. Frans
Kaashoek},
booktitle    = {Proceedings of the 2nd {ACM} International
Conference on Mobile Computing and Networking ({MobiCom} '96)},
year         = 1996,
month        = nov,
}
Dynamic computation migration in distributed shared memory systems Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl. HPCC 1996.
@inproceedings{dynamic-migration:supercomp96,
title        = {Dynamic computation migration in distributed shared
memory systems},
author       = {Wilson C. Hsieh and M. Frans Kaashoek and William E.
Weihl},
booktitle    = {Supercomputing '96 Conference Proceedings: The
international conference on high performance computing and
communications},
organization = {{ACM}},
year         = 1996,
month        = nov,
}
Server operating systems M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, and Deborah A. Wallach. SIGOPS European Workshop 1996.

We introduce server operating systems, which are sets of abstractions and runtime support for specialized, high-performance server applications. We have designed and are implementing a prototype server OS with support for aggressive specialization, direct device-to-device access, an event-driven organization, and dynamic compiler-assisted ILP. Using this server OS, we have constructed an HTTP server that outperforms servers running on a conventional OS by more than an order of magnitude and that can safely timeshare the hardware platform with other applications.

@inproceedings{server-os:sigops-euro,
title        = {Server operating systems},
author       = {M. Frans Kaashoek and Dawson R. Engler and Gregory
R. Ganger and Deborah A. Wallach},
pages        = {141--148},
booktitle    = {Proceedings of the 7th {ACM} {SIGOPS} {E}uropean
workshop: Systems support for worldwide applications},
year         = 1996,
month        = sep,
}
ASHs: application-specific handlers for high-performance messaging Deborah A. Wallach, Dawson R. Engler, and M. Frans Kaashoek. SIGCOMM 1996.
@inproceedings{ash:sigcomm96,
title        = {{ASHs}: application-specific handlers for
high-performance messaging},
author       = {Deborah A. Wallach and Dawson R. Engler and M. Frans
Kaashoek},
pages        = {40--52},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '96 Conference:
Applications, Technologies, Architectures, and Protocols for
Computer Communication},
year         = 1996,
month        = aug,
}
DPF: fast, flexible message demultiplexing using dynamic code generation Dawson R. Engler, and M. Frans Kaashoek. SIGCOMM 1996.
@inproceedings{dpf:sigcomm96,
title        = {{DPF}: fast, flexible message demultiplexing using
dynamic code generation},
author       = {Dawson R. Engler and M. Frans Kaashoek},
pages        = {53--59},
booktitle    = {Proceedings of the {ACM} {SIGCOMM} '96 Conference:
Applications, Technologies, Architectures, and Protocols for
Computer Communication},
year         = 1996,
month        = aug,
}
The Rover NNTP proxy Constantine Cristakos. Advanced Undergraduate Project, Massachusetts Institute of Technology, June 1996.
@mastersthesis{rover:nntp,
title        = {The {Rover} {NNTP} proxy},
author       = {Constantine Cristakos},
school       = {Massachusetts Institute of Technology},
year         = 1996,
month        = jun,
}
An evaluation of the Amoeba group communication system M. Frans Kaashoek, and Andrew S. Tanenbaum. ICDCS 1996.
@inproceedings{amoeba-eval:dcs16,
title        = {An evaluation of the {Amoeba} group communication
system},
author       = {M. Frans Kaashoek and Andrew S. Tanenbaum},
pages        = {436--448},
booktitle    = {Proceedings of the 16th International Conference on
Distributed Computing Systems},
organization = {{IEEE} {C}omputer {S}ociety},
year         = 1996,
month        = may,
}
Atomic recovery units: failure atomicity for logical disks Robert Grimm, Wilson C. Hsieh, Wiebren de Jonge, and M. Frans Kaashoek. ICDCS 1996.
@inproceedings{arus:dcs16,
title        = {Atomic recovery units: failure atomicity for logical
disks},
author       = {Robert Grimm and Wilson C. Hsieh and Wiebren de
Jonge and M. Frans Kaashoek},
pages        = {26--37},
booktitle    = {Proceedings of the 16th International Conference on
Distributed Computing Systems},
organization = {{IEEE} {C}omputer {S}ociety},
year         = 1996,
month        = may,
}
VCODE: a retargetable, extensible, very fast dynamic code generation system Dawson R. Engler. PLDI 1996.
@inproceedings{vcode:pldi96,
title        = {{VCODE}: a retargetable, extensible, very fast
dynamic code generation system},
author       = {Dawson R. Engler},
pages        = {160--170},
booktitle    = {Proceedings of the {ACM} {SIGPLAN} '96 Conference on
Programming Design and Implementation ({PLDI} '96)},
year         = 1996,
month        = may,
}
Exodisk: maximizing application control over storage management Robert Grimm. Master's thesis, Massachusetts Institute of Technology, May 1996.
@mastersthesis{exo:grimm-ms,
title        = {Exodisk: maximizing application control over storage
management},
author       = {Robert Grimm},
school       = {Massachusetts Institute of Technology},
year         = 1996,
month        = may,
}
Issues in building mobile-aware applications with the Rover toolkit Joshua A. Tauber. Master's thesis, Massachusetts Institute of Technology, May 1996.
@mastersthesis{rover:tauber-ms,
title        = {Issues in building mobile-aware applications with
the {Rover} toolkit},
author       = {Joshua A. Tauber},
school       = {Massachusetts Institute of Technology},
year         = 1996,
month        = may,
}
tcc: a template-based compiler for C Massimiliano Poletto, Dawson R. Engler, and M. Frans Kaashoek. WCSSS 1996.
@inproceedings{tickc:wcsss96,
title        = {tcc: a template-based compiler for {C}},
author       = {Massimiliano Poletto and Dawson R. Engler and M.
Frans Kaashoek},
pages        = {1--7},
booktitle    = {Workshop Record of {WCSSS} '96: The Inaugural
Workshop on Compiler Support for Systems Software},
organization = {{ACM} {SIGPLAN}},
year         = 1996,
month        = feb,
}
C: A language for efficient, machine-independent dynamic code generation Dawson R. Engler, Wilson C. Hsieh, and M. Frans Kaashoek. POPL 1996.
@inproceedings{tickc:popl96,
title        = {{C}: A language for efficient, machine-independent
dynamic code generation},
author       = {Dawson R. Engler and Wilson C. Hsieh and M. Frans
Kaashoek},
pages        = {131--144},
note         = {An earlier version is available as MIT-LCS-TM-526},
booktitle    = {Proceedings of the 23rd {ACM} {SIGPLAN}-{SIGACT}
Symposium on Principles of Programming Languages ({POPL} '96)},
year         = 1996,
month        = jan,
address      = {St. Petersburg Beach, Florida},
}

## 1995

Rover: a toolkit for mobile information access Anthony D. Joseph, Alan F. deLespinasse, Joshua A. Tauber, David K. Gifford, and M. Frans Kaashoek. SOSP 1995.
@inproceedings{rover:sosp95,
title        = {{Rover}: a toolkit for mobile information access},
author       = {Anthony D. Joseph and {deLespinasse}, Alan F. and
Joshua A. Tauber and David K. Gifford and M. Frans Kaashoek},
pages        = {156--171},
booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '95)},
year         = 1995,
month        = dec,
}
Exokernel: an operating system architecture for application-level resource management Dawson R. Engler, M. Frans Kaashoek, and James O'Toole Jr.. SOSP 1995.
@inproceedings{exo:sosp95,
title        = {{E}xokernel: an operating system architecture for
application-level resource management},
author       = {Dawson R. Engler and M. Frans Kaashoek and James
{O'Toole Jr.}},
pages        = {251--266},
booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '95)},
year         = 1995,
month        = dec,
}
CRL: high-performance all-software distributed shared memory Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. SOSP 1995.
@inproceedings{crl:sosp95,
title        = {{CRL}: high-performance all-software distributed
shared memory},
author       = {Kirk L. Johnson and M. Frans Kaashoek and Deborah A.
Wallach},
pages        = {213--226},
note         = {An earlier version of this work appeared as
Technical Report MIT-LCS-TM-517, MIT Laboratory for Computer
Science, March 1995},
booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '95)},
year         = 1995,
month        = dec,
}
High-performance all-software distributed shared memory Kirk L. Johnson. Ph.D. thesis, Massachusetts Institute of Technology, December 1995.
@phdthesis{crl:johnson-phd,
title        = {High-performance all-software distributed shared
memory},
author       = {Kirk L. Johnson},
school       = {Massachusetts Institute of Technology},
year         = 1995,
month        = dec,
}
Dynamic computation migration in distributed shared memory systems Wilson C. Hsieh. Ph.D. thesis, Massachusetts Institute of Technology, September 1995.
@phdthesis{dyn-comp-migration:hsieh-phd,
title        = {Dynamic computation migration in distributed shared
memory systems},
author       = {Wilson C. Hsieh},
school       = {Massachusetts Institute of Technology},
year         = 1995,
month        = sep,
note         = {Also available as MIT LCS tech report
MIT-LCS-TR-665.},
}
Optimistic active messages: a mechanism for scheduling communication with computation Deborah A. Wallach, Wilson C. Hsieh, Kirk Johnson, M. Frans Kaashoek, and William E. Weihl. PPoPP 1995.
@inproceedings{oam:ppopp95,
title        = {Optimistic active messages: a mechanism for
scheduling communication with computation},
author       = {Deborah A. Wallach and Wilson C. Hsieh and Kirk
Johnson and M. Frans Kaashoek and William E. Weihl},
pages        = {217--226},
booktitle    = {Proceedings of the 5th {ACM} {SIGPLAN} Symposium on
Principles and Practice of Parallel Programming ({PPoPP} '95)},
year         = 1995,
month        = jul,
}
Implementing sequentially consistent shared objects using broadcast and point-to-point communication Alan Fekete, M. Frans Kaashoek, and Nancy Lynch. ICDCS 1995.
@inproceedings{formal-sequential-consistent:dcs15,
title        = {Implementing sequentially consistent shared objects
author       = {Alan Fekete and M. Frans Kaashoek and Nancy Lynch},
pages        = {439--449},
booktitle    = {Proceedings of the 15th International Conference on
Distributed Computing Systems},
organization = {{IEEE} {C}omputer {S}ociety},
year         = 1995,
month        = jun,
}
Implementing sequentially consistent shared objects using broadcast and point-to-point communication Alan Fekete, M. Frans Kaashoek, and Nancy Lynch. MIT LCS technical report, June 1995.
@techreport{formal-sequential-consistent:tr,
title        = {Implementing sequentially consistent shared objects
author       = {Alan Fekete and M. Frans Kaashoek and Nancy Lynch},
institution  = {{MIT} Laboratory for Computer Science},
year         = 1995,
month        = jun,
number       = {MIT-LCS-TR-518},
}
Rover Mosaic: e-mail communication for a full-function Web browser Alan F. deLespinasse. Master's thesis, Massachusetts Institute of Technology, June 1995.
@mastersthesis{rover-mosaic:delespinasse-thesis,
title        = {{Rover} {Mosaic}: e-mail communication for a
full-function {Web} browser},
author       = {Alan F. {deLespinasse}},
school       = {Massachusetts Institute of Technology},
year         = 1995,
month        = jun,
}
Exterminate all operating system abstractions Dawson R. Engler, and M. Frans Kaashoek. HotOS 1995.
@inproceedings{exo:hotos5,
title        = {Exterminate all operating system abstractions},
author       = {Dawson R. Engler and M. Frans Kaashoek},
pages        = {78--83},
booktitle    = {Proceedings of the 5th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-V})},
year         = 1995,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
}
AVM: application-level virtual memory Dawson R. Engler, Sandeep K. Gupta, and M. Frans Kaashoek. HotOS 1995.
@inproceedings{avm:hotos5,
title        = {{AVM}: application-level virtual memory},
author       = {Dawson R. Engler and Sandeep K. Gupta and M. Frans
Kaashoek},
pages        = {72--77},
booktitle    = {Proceedings of the 5th {W}orkshop on {H}ot {T}opics
in {O}perating {S}ystems ({HotOS-V})},
year         = 1995,
month        = may,
organization = {{IEEE} {C}omputer {S}ociety},
}
The operating system kernel as a secure programmable machine Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. Operating Systems Review 29(1), January 1995.
@article{exo:osr,
title        = {The operating system kernel as a secure programmable
machine},
author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
Jr.}, James W.},
year         = 1995,
month        = jan,
volume       = 29,
number       = 1,
pages        = {78--82},
journal      = {Operating Systems Review},
organization = {{ACM}},
}

## 1994

Dynamic documents: mobile wireless access to the WWW M. Frans Kaashoek, Tom Pinckney, and Joshua A. Tauber. WMCSA 1994.

We propose dynamic documents as an approach to extending and customizing the WWW/Mosaic for mobile computing platforms. Dynamic documents are programs executed on a mobile platform to generate a document; they are implemented as Tcl scripts. We have modified the NCSA Mosaic web client to run the dynamic documents it retrieves through a modified Tcl interpreter. The interpreter is designed to execute only commands that do not violate safety.

To hide the latencies of slow links we have modified the Mosaic client to perform caching and prefetching. The policies for caching and prefetching can be under control of dynamic documents, allowing the strategies to be document-specific. Using dynamic documents, we have built an adaptive email browser that employs application-specific caching and prefetching strategies. Both the browser and the displayed email messages are dynamically customized to the mobile computing environment in which they run.

@inproceedings{dynamic-documents:wmcsa,
{WWW}},
author       = {M. Frans Kaashoek and Tom Pinckney and Joshua A.
Tauber},
pages        = {179--184},
booktitle    = {Proceedings of the Workshop on Mobile Computing
Systems and Applications ({WMCSA} '94)},
organization = {{IEEE} {C}omputer {S}ociety},
year         = 1994,
month        = dec,
}
Storage alternatives for mobile computers Fred Douglis, Ramón Cáceres, M. Frans Kaashoek, Kai Li, Brian Marsh, and Joshua A. Tauber. OSDI 1994.
@inproceedings{mobile-storage-alt:osdi1,
title        = {Storage alternatives for mobile computers},
author       = {Fred Douglis and Ram{\'o}n C{\'a}ceres and M. Frans
Kaashoek and Kai Li and Brian Marsh and Joshua A. Tauber},
pages        = {25--37},
booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
year         = 1994,
month        = nov,
}
The exokernel approach to extensibility (panel statement) Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. OSDI 1994.
@inproceedings{exo:osdi1,
title        = {The exokernel approach to extensibility (panel
statement)},
author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
Jr.}, James W.},
pages        = 198,
booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
year         = 1994,
month        = nov,
}
Software prefetching and caching for translation lookaside buffers Kavita Bala, M. Frans Kaashoek, and William Weihl. OSDI 1994.
@inproceedings{software-tlb-prefetch:osdi1,
title        = {Software prefetching and caching for translation
lookaside buffers},
author       = {Kavita Bala and M. Frans Kaashoek and William Weihl},
pages        = {243--253},
booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
{O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
year         = 1994,
month        = nov,
}
Dynamic documents: extensibility and adaptability in the WWW M. Frans Kaashoek, Tom Pinckney, and Joshua A. Tauber. WWW 1994.
@inproceedings{dynamic-documents:www94,
title        = {Dynamic documents: extensibility and adaptability in
the {WWW}},
author       = {M. Frans Kaashoek and Tom Pinckney and Joshua A.
Tauber},
edition      = {developers' day track},
booktitle    = {Proceedings of the 2nd International {WWW}
Conference: Mosaic and the Web},
year         = 1994,
month        = oct,
}
DCG: an efficient, retargetable dynamic code generation system Dawson R. Engler, and Todd A. Proebsting. ASPLOS 1994.
@inproceedings{dcg:asplos6,
title        = {{DCG}: an efficient, retargetable dynamic code
generation system},
author       = {Dawson R. Engler and Todd A. Proebsting},
pages        = {263--272},
booktitle    = {Proceedings of the 6th International Conference on
Architectural Support for Programming Languages and Operating
Systems ({ASPLOS-VI})},
year         = 1994,
month        = oct,
}
The operating system kernel as a secure programmable machine Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. SIGOPS European Workshop 1994.
@inproceedings{exo:sigops-euro,
title        = {The operating system kernel as a secure programmable
machine},
author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
Jr.}, James W.},
pages        = {62--67},
booktitle    = {Proceedings of the 6th {ACM} {SIGOPS} {E}uropean
workshop: Matching operating systems to application needs},
year         = 1994,
month        = sep,
}
Efficient implementation of high-level languages on user-level communication architectures Wilson C. Hsieh, Kirk L. Johnson, M. Frans Kaashoek, Deborah A. Wallach, and William E. Weihl. MIT LCS technical report, May 1994.
@techreport{user-level-comm:tr,
title        = {Efficient implementation of high-level languages on
user-level communication architectures},
author       = {Wilson C. Hsieh and Kirk L. Johnson and M. Frans
Kaashoek and Deborah A. Wallach and William E. Weihl},
institution  = {{MIT} Laboratory for Computer Science},
year         = 1994,
month        = may,
number       = {MIT-LCS-TR-616},
}

## 1993

The logical disk: a new approach to improving file systems Wiebren de Jonge, M. Frans Kaashoek, and Wilson C. Hsieh. SOSP 1993.
@inproceedings{logicaldisk:sosp14,
title        = {The logical disk: a new approach to improving file
systems},
author       = {Wiebren de Jonge and M. Frans Kaashoek and Wilson C.
Hsieh},
pages        = {15--28},
booktitle    = {Proceedings of the 14th {ACM} {S}ymposium on
{O}perating {S}ystems {P}rinciples ({SOSP} '93)},
year         = 1993,
month        = dec,
}
The persistent relevance of IPC performance Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl. WWOS 1993.
@inproceedings{ipc-persistent-relevance:wwos4,
title        = {The persistent relevance of {IPC} performance},
author       = {Wilson C. Hsieh and M. Frans Kaashoek and William E.
Weihl},
pages        = {186--190},
booktitle    = {Proceedings of the 4th Workshop on Workstation
Operating Systems},
organization = {{IEEE} {C}omputer {S}ociety},
year         = 1993,
month        = oct,
}`