MIT CSAIL Parallel and Distributed Operating Systems Group

2024

Unifying serverless and microservice tasks with SigmaOS Ariel Szekely, Adam Belay, Robert Morris, and M. Frans Kaashoek. SOSP 2024.

Many cloud applications use both serverless functions, for bursts of stateless parallel computation, and container orchestration, for long-running microservices and tasks that need to interact. Ideally a single platform would offer the union of these systems' capabilities, but neither is sufficient to act as that single platform: serverless functions are lightweight but cannot act as servers with long-term state, while container orchestration offers general-purpose computation but instance start-up takes too long to support burst parallelism. SigmaOS is a new multi-tenant cloud operating system that combines the best of container orchestration and serverless in one platform with one API. SigmaOS computations, called procs, can be long-running, stateful, and interact with each other, making them a good match for both serverless and microservice tasks. A key aspect of the SigmaOS design is its *cloud-centric* API, which provides flexible management of computation, a novel abstraction for communication endpoints, SigmaEPs ---which allow procs of a tenant to communicate efficiently but prohibits procs from sending packets to other tenants---and a flexible naming system to name, for example, SigmaEPs. Quick start-up is important for serverless uses. A key enabling observation is that both serverless and microservice applications rely on cloud services for much of the work traditionally done by the local OS (e.g., access to durable storage and additional compute resources). SigmaOS exploits this observation by providing only a small and generic local operating system image to each , which can be created much more quickly than a container orchestration instance since SigmaOS need not install application-specific filesystem content or (due to SigmaOS's SigmaEPs) configure an isolated overlay network. Microbenchmarks show that SigmaOS can cold start a in msec and can create procs per second, distributing them over a 24-machine cluster. An evaluation of SigmaOS with two microservice applications from DeathStarBench, a MapReduce application, and an image processing benchmark, shows that the SigmaOS API supports both microservices and lambda-style computations, and provides better performance than corresponding versions on AWS Lambda and Kubernetes.

@inproceedings{sigmaos:sosp24,
  title        = {Unifying serverless and microservice tasks with
    {SigmaOS}},
  author       = {Ariel Szekely and Adam Belay and Robert Morris and
    M. Frans Kaashoek},
  booktitle    = {Proceedings of the 30th ACM Symposium on Operating
    Systems Principles (SOSP 2024)},
  year         = 2024,
  month        = nov,
  address      = {Austin, Texas},
}

Modular verification of secure and leakage-free systems: From application specification to circuit-level implementation Anish Athalye, Henry Corrigan-Gibbs, M. Frans Kaashoek, Joseph Tassarotti, and Nickolai Zeldovich. SOSP 2024.

Parfait is a framework for proving that an implementation of a hardware security module (HSM) leaks nothing more than what is mandated by an application specification. Parfait proofs cover the software and the hardware of an HSM, which catches bugs above the cycle-level digital circuit abstraction, including timing side channels. Parfait's contribution is a scalable approach to proving security and non-leakage by using intermediate levels of abstraction and relating them with transitive information-preserving refinement. This enables Parfait to use different techniques to verify the implementation at different levels of abstraction, reuse existing verified components such as CompCert, and automate parts of the proof, while still providing end-to-end guarantees. We use Parfait to verify four HSMs, including an ECDSA certificate-signing HSM and a password-hashing HSM, on top of the OpenTitan Ibex and PicoRV32 processors. Parfait provides strong guarantees for these HSMs: for instance, it proves that the ECDSA-on-Ibex HSM implementation—2,300 lines of code and 13,500 lines of Verilog—leaks nothing more than what is allowed by a 40-line specification of its behavior.

@inproceedings{parfait:sosp24,
  title        = {Modular Verification of Secure and Leakage-Free
    Systems: From Application Specification to Circuit-Level
    Implementation},
  author       = {Anish Athalye and Henry Corrigan-Gibbs and M. Frans
    Kaashoek and Joseph Tassarotti and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 30th ACM Symposium on Operating
    Systems Principles (SOSP 2024)},
  year         = 2024,
  month        = nov,
  address      = {Austin, Texas},
}

Formally verifying secure and leakage-free systems: From application specification to circuit-level implementation Anish Athalye. Ph.D. thesis, Massachusetts Institute of Technology, August 2024.

Hardware and software systems are susceptible to bugs and timing side-channel vulnerabilities. Timing leakage is particularly hard to eliminate because leakage is an emergent property that can arise from subtle behaviors or interactions between hardware and software components in the entire system, with root causes such as non-constant-time code, compiler-generated timing variation, and microarchitectural side channels. This thesis contributes a new approach using formal verification to rule out such bugs and build systems that are correct, secure, and leakage-free.

This thesis introduces a new theory called information-preserving refinement (IPR) for capturing non-leakage in addition to correctness and security, implements a verification approach for IPR in the Parfait framework, and applies it to verifying hardware security modules (HSMs). Using Parfait, a developer can verify that an HSM implementation leaks no more information than is allowed by a succinct application-level specification of the device's intended behavior, with proofs covering the implementation's hardware and software down to its cycle-precise wire-I/O-level behavior.

This thesis uses Parfait to implement and verify several HSMs, including an ECDSA certificate-signing HSM and a password-hashing HSM, on top of Ibex and PicoRV32-based hardware platforms. Parfait provides strong guarantees for these HSMs: for example, it proves that the ECDSA-on-Ibex implementation—2,300 lines of code and 13,500 lines of Verilog—leaks nothing more than what is allowed by a 40-line specification of its behavior.

@phdthesis{aathalye:thesis,
  title        = {Formally Verifying Secure and Leakage-Free Systems:
    From Application Specification to Circuit-Level Implementation},
  author       = {Anish Athalye},
  school       = {Massachusetts Institute of Technology},
  year         = 2024,
  month        = aug,
}

Verifying correctness of the number theoretic transform and fast number theoretic transform in F* Ryuta R. Ono. Master's thesis, Massachusetts Institute of Technology, August 2024.

As engineers continue to develop more sophisticated algorithms to optimize cryptographic algorithms, their often simple mathematical specifications become convoluted in the algorithms, from which a class of correctness bugs arise. Because cryptographic algorithms often secure sensitive information, their correctness, and in turn their security is a top priority. The Number Theoretic Transform (NTT) is an algorithm that enables efficient polynomial multiplication and has recently gained importance in post-quantum cryptography. This thesis presents a proof of correctness of the NTT in F^★, a proof-oriented programming language that extracts to OCaml, and shows that we can use the NTT to perform polynomial multiplications. We provide an implementation of the Cooley-Tukey fast NTT algorithm and a proof that it matches the original NTT specification. This thesis also presents a representation of polynomials in the F^★ subset Low^★, which extracts to performant C code.

@mastersthesis{rrono-meng,
  title        = {Verifying Correctness of the Number Theoretic
    Transform and Fast Number Theoretic Transform in {F*}},
  author       = {Ryuta R. Ono},
  school       = {Massachusetts Institute of Technology},
  year         = 2024,
  month        = aug,
}

Verifying hardware security modules with true random number generators Katherine Zhao. Master's thesis, Massachusetts Institute of Technology, May 2024.

Hardware security modules (HSMs) are powerful tools in building secure computer systems, allowing developers to factor out security-critical code to separate devices. Because HSMs usually work with sensitive data, it is crucial that we are able to verify that they are secure. Many HSMs today also include true random number generators (TRNGs) as part of their architecture to seed cryptographic functions for generating keys, creating nonces, padding, and more. This thesis presents a definition of Information-Preserving Refinement with Randomness (IPRR) that captures the idea that a HSM with a TRNG is correct and is secure from timing side channel attacks. We additionally construct a strategy to prove IPRR, and develop Karatroc, a tool for verifying that a HSM satisfies IPRR. Through the creation and evaluation of Karatroc, we demonstrate the ability to verify HSMs with TRNGs without incurring significant added cost in performance and proof length as compared to existing proof methods.

@mastersthesis{zhao-meng,
  title        = {Verifying Hardware Security Modules With True Random
    Number Generators},
  author       = {Katherine Zhao},
  school       = {Massachusetts Institute of Technology},
  year         = 2024,
  month        = may,
}

Towards cycle-level verification of constant-time cryptography Jessica Y. Xu. Master's thesis, Massachusetts Institute of Technology, May 2024.

Cryptographic primitives—hash functions, symmetric key encryption algorithms, asymmetric key exchange algorithms, and more—are used everywhere to achieve security in modern computing. Since these algorithms have complicated, math-heavy implementations, they are typically used through cryptographic library functions. However, many timing side-channel attacks, which leak information when execution time depends on secrets, have been found in popular cryptographic libraries, such as OpenSSL. Formal verification aims to rule out timing side channels in cryptographic software.

This thesis presents Quake, a framework for verifying cryptographic library functions are constant-time for a specific hardware implementation, regardless of where the code is located in memory. Quake represents the location of code in memory using symbolic addresses and introduces a ROM model that gets concrete memory data from symbolic addresses. This thesis evaluates Quake and demonstrates that it can detect address-dependent timing behavior and does so in a reasonable amount of time.

@mastersthesis{jyxu-meng,
  title        = {Towards Cycle-Level Verification of Constant-Time
    Cryptography},
  author       = {Jessica Y. Xu},
  school       = {Massachusetts Institute of Technology},
  year         = 2024,
  month        = may,
}

Flexible privacy via disguise and revealing Lillian Tsai. Ph.D. thesis, Massachusetts Institute of Technology, May 2024.

@phdthesis{tslilyai:thesis,
  title        = {Flexible Privacy via Disguise and Revealing},
  author       = {Lillian Tsai},
  school       = {Massachusetts Institute of Technology},
  year         = 2024,
  month        = may,
}

Making kernel bypass practical for the cloud with Junction Joshua Fried, Gohar Irfan Chaudhry, Enrique Saurez, Esha Choukse, Inigo Goiri, Sameh Elnikety, Rodrigo Fonseca, and Adam Belay. NSDI 2024.

Kernel bypass systems have demonstrated order of magnitude improvements in throughput and tail latency for network-intensive applications relative to traditional operating systems (OSes). To achieve such excellent performance, however, they rely on dedicated resources (e.g., spinning cores, pinned memory) and require application rewriting. This is unattractive to cloud operators because they aim to densely pack applications, and rewriting cloud software requires a massive investment of valuable developer time. For both reasons, kernel bypass, as it exists, is impractical for the cloud.

In this paper, we show these compromises are not necessary to unlock the full benefits of kernel bypass. We present Junction, the first kernel bypass system that can pack thousands of instances on a machine while providing compatibility with unmodified Linux applications. Junction achieves high density through several advanced NIC features that reduce pinned memory and the overhead of monitoring large numbers of queues. It maintains compatibility with minimal overhead through optimizations that exploit a shared address space with the application. Junction scales to 19–62× more instances than existing kernel bypass systems and can achieve similar or better performance without code changes. Furthermore, Junction delivers significant performance benefits to applications previously unsupported by kernel bypass, including those that depend on runtime systems like Go, Java, Node, and Python. In a comparison to native Linux, Junction increases throughput by 1.6–7.0× while using 1.2–3.8× less cores across seven applications.

@inproceedings{junction:nsdi24,
  title        = {Making Kernel Bypass Practical for the Cloud with
    {Junction}},
  author       = {Joshua Fried and Gohar Irfan Chaudhry and Enrique
    Saurez and Esha Choukse and Inigo Goiri and Sameh Elnikety and
    Rodrigo Fonseca and Adam Belay},
  booktitle    = {Proceedings of the 21st {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '24)},
  year         = 2024,
  month        = apr,
  address      = {Santa Clara, CA},
}

2023

Private web search with Tiptoe Alexandra Henzinger, Emma Dauterman, Henry Corrigan-Gibbs, and Nickolai Zeldovich. SOSP 2023.

Tiptoe is a private web search engine that allows clients to search over hundreds of millions of documents, while revealing no information about their search query to the search engine’s servers. Tiptoe’s privacy guarantee is based on cryptography alone; it does not require hardware enclaves or non-colluding servers. Tiptoe uses semantic embeddings to reduce the problem of private full-text search to private nearest-neighbor search. Then, Tiptoe implements private nearest-neighbor search with a new, high-throughput protocol based on linearly homomorphic encryption. Running on a 45-server cluster, Tiptoe can privately search over 360 million web pages with 145 core-seconds of server compute, 56.9 MiB of client-server communication (74% of which occurs before the client enters its search query), and 2.7 seconds of end-to-end latency. Tiptoe’s search works best on conceptual queries (“knee pain”) and less well on exact string matches (“123 Main Street, New York”). On the MS MARCO search-quality benchmark, Tiptoe ranks the best-matching result in position 7.7 on average. This is worse than a state-of-the-art, non-private neural search algorithm (average rank: 2.3), but is close to the classical tf-idf algorithm (average rank: 6.7). Finally, Tiptoe is extensible: it also supports private text-to-image search and, with minor modifications, it can search over audio, code, and more.

@inproceedings{tiptoe:sosp23,
  author       = {Alexandra Henzinger and Emma Dauterman and Henry
    Corrigan-Gibbs and Nickolai Zeldovich},
  title        = {Private Web Search with {Tiptoe}},
  booktitle    = {Proceedings of the 29th ACM Symposium on Operating
    Systems Principles (SOSP 2023)},
  year         = 2023,
  month        = oct,
  address      = {Koblenz, Germany},
}

Edna: Disguising and revealing user data in web applications Lillian Tsai, Hannah Gross, Eddie Kohler, Frans Kaashoek, and Malte Schwarzkopf. SOSP 2023.

Edna is a system that helps web applications allow users to remove their data without permanently losing their accounts, anonymize their old data, and selectively dissociate personal data from public profiles. Edna helps developers support these features while maintaining application functionality and referential integrity via disguising and revealing transformations. Disguising selectively renders user data inaccessible via encryption, and revealing enables the user to restore their data to the application. Edna's techniques allow transformations to compose in any order, e.g., deleting a previously anonymized user's account, or restoring an account back to an anonymized state.Experiments with Edna that add disguising and revealing transformations to three real-world applications show that Edna enables new privacy features in existing applications with low developer effort, is simpler than alternative approaches, and adds limited overhead to applications.

@inproceedings{edna:sosp23,
  author       = {Tsai, Lillian and Gross, Hannah and Kohler, Eddie
    and Kaashoek, Frans and Schwarzkopf, Malte},
  title        = {Edna: Disguising and Revealing User Data in Web
    Applications},
  booktitle    = {Proceedings of the 29th ACM Symposium on Operating
    Systems Principles (SOSP 2023)},
  year         = 2023,
  month        = oct,
  address      = {Koblenz, Germany},
}

Grove: a Separation-Logic Library for Verifying Distributed Systems Upamanyu Sharma, Ralf Jung, Joseph Tassarotti, Frans Kaashoek, and Nickolai Zeldovich. SOSP 2023.

Grove is a concurrent separation logic library for verifying distributed systems. Grove is the first to handle time-based leases, including their interaction with reconfiguration, crash recovery, thread-level concurrency, and unreliable networks. This paper uses Grove to verify several distributed system components written in Go, including vKV, a realistic distributed multi-threaded key-value store. vKV supports reconfiguration, primary/backup replication, and crash recovery, and uses leases to execute read-only requests on any replica. vKV achieves high performance (67-73% of Redis on a single core), scales with more cores and more backup replicas (achieving about 2× the throughput when going from 1 to 3 servers), and can safely execute reads while reconfiguring.

@inproceedings{grove:sosp23,
  title        = {{G}rove: a {S}eparation-{L}ogic {L}ibrary for
    {V}erifying {D}istributed {S}ystems},
  author       = {Upamanyu Sharma and Ralf Jung and Joseph Tassarotti
    and Frans Kaashoek and Nickolai Zeldovich},
  pages        = {113--129},
  booktitle    = {Proceedings of the 29th ACM Symposium on Operating
    Systems Principles (SOSP 2023)},
  year         = 2023,
  month        = oct,
  address      = {Koblenz, Germany},
}

The K2 architecture for trustworthy hardware security modules Anish Athalye, M. Frans Kaashoek, Nickolai Zeldovich, and Joseph Tassarotti. KISV 2023.

K2 is a new architecture and verification approach for hardware security modules (HSMs). The K2 architecture's rigid separation between I/O, storage, and computation over secret state enables modular proofs and allows for software development and verification independent of hardware development and verification while still providing correctness and security guarantees about the composed system. For a key step of verification, K2 introduces a new tool called Chroniton that automatically proves timing properties of software running on a particular hardware implementation, ensuring the lack of timing side channels at a cycle-accurate level.

@inproceedings{k2:kisv23,
  title        = {The {K2} Architecture for Trustworthy Hardware
    Security Modules},
  author       = {Anish Athalye and M. Frans Kaashoek and Nickolai
    Zeldovich and Joseph Tassarotti},
  booktitle    = {1st {W}orkshop on {K}ernel {I}solation, {S}afety and
    {V}erification ({KISV} '23)},
  year         = 2023,
  month        = oct,
}

Evaluating sigmaos with kubernetes for orchestrating microservice and serverless applications Yizheng He. Master's thesis, Massachusetts Institute of Technology, September 2023.

SigmaOS is a new multi-tenant cloud operating system that simplifes distributed application development. Its design centers around the novel concepts of realms and procs. A realm presents a tenant with a shared global namespace that hides the machine boundaries. Tenants structure their applications as process-like procs interacting through the realm’s namespace. Procs are lightweight, stateful, and can communicate. SigmaOS manages the scheduling and execution of procs to achieve high resource utilization and performance isolation.

This thesis compares SigmaOS with Kubernetes, a mainstream cloud operating system, using a microservice-style social network website and a serverless image resizing program. It measures their performances on a small-scale cluster in CloudLab. The SigmaOS version of the social network is easier to build (30% fewer lines), and its image resizing starts faster (25% - 89%). SigmaOS performs comparably to Kubernetes regarding latency and resource consumption when running a single application but provides better performance isolation when running multiple applications in separate realms: latency increases by 4-11% with concurrent applications in SigmaOS versus over 150% in Kubernetes.

@mastersthesis{yizheng-he-ms-thesis,
  title        = {Evaluating SigmaOS with Kubernetes for Orchestrating
    Microservice and Serverless Applications},
  author       = {Yizheng He},
  school       = {Massachusetts Institute of Technology},
  year         = 2023,
  month        = sep,
}

One server for the price of two: Simple and fast single-server private information retrieval Alexandra Henzinger, Matthew M. Hong, Henry Corrigan-Gibbs, Sarah Meiklejohn, and Vinod Vaikuntanathan. USENIX Security 2023.

We present SimplePIR, the fastest single-server private information retrieval scheme known to date. SimplePIR’s security holds under the learning-with-errors assumption. To answer a client’s query, the SimplePIR server performs fewer than one 32-bit multiplication and one 32-bit addition per database byte. SimplePIR achieves 10 GB/s/core server throughput, which approaches the memory bandwidth of the machine and the performance of the fastest two-server private-information-retrieval schemes (which require non-colluding servers). SimplePIR has relatively large communication costs: to make queries to a 1 GB database, the client must download a 121 MB "hint" about the database contents; thereafter, the client may make an unbounded number of queries, each requiring 242 KB of communication. We present a second single-server scheme, DoublePIR, that shrinks the hint to 16 MB at the cost of slightly higher per-query communication (345 KB) and slightly lower throughput (7.4 GB/s/core). Finally, we apply our new private-information-retrieval schemes, together with a novel data structure for approximate set membership, to the task of private auditing in Certificate Transparency. We achieve a strictly stronger notion of privacy than Google Chrome’s current approach with modest communication overheads: 16 MB of download per month, along with 150 bytes per TLS connection.

@inproceedings{simplepir:usenixsec23,
  author       = {Alexandra Henzinger and Matthew M. Hong and Henry
    Corrigan-Gibbs and Sarah Meiklejohn and Vinod Vaikuntanathan},
  title        = {One Server for the Price of Two: Simple and Fast
    Single-Server Private Information Retrieval},
  booktitle    = {32nd USENIX Security Symposium (USENIX Security 23)},
  year         = 2023,
  address      = {Anaheim, CA},
  publisher    = {USENIX Association},
  month        = aug,
}

Verifying vMVCC, a high-performance transaction library using multi-version concurrency control Yun-Sheng Chang, Ralf Jung, Upamanyu Sharma, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2023.

Multi-version concurrency control (MVCC) is a widely used, sophisticated approach for handling concurrent transactions. vMVCC is the first MVCC-based transaction library that comes with a machine-checked proof of correctness, providing clients with a guarantee that it will correctly handle all transactions despite a complicated design and implementation that might otherwise be error-prone. vMVCC is implemented in Go, stores data in memory, and uses several optimizations, such as RDTSC-based timestamps, to achieve high performance (25--96% the throughput of Silo, a state-of-the-art in-memory database, for YCSB and TPC-C workloads). Formally specifying and verifying vMVCC required adopting advanced proof techniques, such as logical atomicity and prophecy variables, owing to the fact that MVCC transactions can linearize at timestamp generation prior to transaction execution.

@inproceedings{vmvcc:osdi23,
  title        = {Verifying {vMVCC}, a high-performance transaction
    library using multi-version concurrency control},
  author       = {Yun-Sheng Chang and Ralf Jung and Upamanyu Sharma
    and Joseph Tassarotti and M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 17th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '23)},
  year         = 2023,
  month        = jul,
}

Privilege-separating embedded applications using WebAssembly in the Plat FIDO2 security key Benjamin B. Kettle. Master's thesis, Massachusetts Institute of Technology, June 2023.

Plat is a FIDO2 security key that uses privilege separation to protect the application's private keys even if bugs are present in bug-prone parts of its codebase. Plat's design encapsulates drivers and parsers in sandboxes that are isolated from the secrets that are used to perform authentication.

To achieve privilege separation in the embedded context, Plat uses a new WebAssembly-based toolchain for ARM microcontrollers to implement and enforce isolation between individual components of an existing system without rewriting drivers and application code. This toolchain includes special support for device drivers, safely enabling isolated modules to access peripheral memory-mapped IO.

Plat's privilege-separation reduces the lines of code in the trusted code base by 60% from our 20,000-line reference implementation while adding only 319 new trusted lines. Plat's isolation strategy has acceptable performance overhead that does not prevent interactive use, with the slowest step of an authentication jumping from 277ms natively to 600ms when sandboxed.

Plat ensures the protection of its secret key, and thus the security of the accounts it authenticates, in the presence of several classes of bugs.

@mastersthesis{bkettle-meng,
  title        = {Privilege-separating embedded applications using
    {WebAssembly} in the {Plat} {FIDO2} security key},
  author       = {Benjamin B. Kettle},
  school       = {Massachusetts Institute of Technology},
  year         = 2023,
  month        = jun,
}

Leakage models are a leaky abstraction: the case for cycle-level verification of constant-time cryptography Anish Athalye, M. Frans Kaashoek, Nickolai Zeldovich, and Joseph Tassarotti. PLARCH 2023.

We propose abandoning leakage models for verifying timing properties of cryptographic software, instead directly verifying software with respect to a hardware implementation at the RTL level. Early experiments include verifying that an Ed25519 implementation running on a 6-stage pipelined processor executes in a constant number of cycles. Many challenges remain, including scaling up to modern out-of-order speculative cores and extending the approach to reason about library code outside the context of a whole application.

@inproceedings{chroniton:plarch23,
  title        = {Leakage models are a leaky abstraction: the case for
    cycle-level verification of constant-time cryptography},
  author       = {Anish Athalye and M. Frans Kaashoek and Nickolai
    Zeldovich and Joseph Tassarotti},
  booktitle    = {1st {W}orkshop on {P}rogramming {L}anguages and
    {C}omputer {A}rchitecture ({PLARCH} '23)},
  year         = 2023,
  month        = jun,
}

Verifying confidentiality under nondeterminism for storage systems Atalay Mert Ileri. Ph.D. thesis, Massachusetts Institute of Technology, February 2023.

Storage systems must often store confidential data for their users. It is important to ensure that confidentiality of the stored data is maintained in the presence of bugs and malicious adversaries. This thesis tackles this problem using formal verification, a technique that involves proving a software system always satisfy certain requirements.

There are numerous challenges in specifying what it means a system being confidential and proving that a system satisfies that specification: nondeterministic behavior, indirect leakage of the data, system complexity, and others. Nondeterminism in particular creates unique challenges by making probabilistic leakage possible. This dissertation introduces the following to address these challenges:

Two novel confidentiality specifications for storage systems with nondeterministic behavior: data nonleakage and relatively deterministic noninfluence. Both definitions accommodate discretionary access control and intentional disclosure of the system metadata.

Two techniques accompanying these specifications: sealed blocks and nondeterminism oracles. These techniques addressed the challenges encountered in proving the confidentiality of the systems, and also reduced the proof effort required for said proofs. These techniques are formalized and implemented in two frameworks: DiskSec and ConFrm. Both frameworks contain metatheory to help the developer to prove that their implementation satisfies the specification.

The first confidential, crash-safe, and formally verified file systems with machine-checkable proofs: SFSCQ and ConFs. SFSCQ uses data nonleakage and ConFs uses relatively deterministic noninfluence as their confidentiality specifications. Both are implemented and verified in Coq.

An evaluation shows relatively deterministic noninfluence has 9.2x proof overhead per line of implementation code. Experiments with multiple benchmarks show that our systems perform better compared to FSCQ verified file system but worse compared to ext4 file system.

@phdthesis{atalay:thesis,
  title        = {Verifying confidentiality under nondeterminism for
    storage systems},
  author       = {Atalay Mert Ileri},
  school       = {Massachusetts Institute of Technology},
  year         = 2023,
  month        = feb,
}

2022

Modular Verification of Distributed Systems with Grove Upamanyu Sharma. Master's thesis, Massachusetts Institute of Technology, September 2022.

Grove is a new framework for machine-checked verification of distributed systems. Grove focuses on modular verification. It enables developers to state and prove specifications for their components (e.g. an RPC library), and to use those specifications when proving the correctness of components that build on it (e.g. a key value service built on RPC).

To enable modular specification and verification in a distributed systems, Grove uses the idea of ownership from separation logic. Using Grove, we built a verified unreliable RPC library, where we captured unreliability in the formal specification by using duplicable ownership. We also built a verified exactly-once RPC library, where we reasoned about ownership transfer from the client to server (and back) over an unreliable network by using the escrow pattern.

Overall, we developed and verified an example system written in Go consisting of the RPC libraries, a sharded key-value store with support for dynamically adding new servers and rebalancing shards, a lock service, and a bank application that supports atomic transfers across accounts that live in different shards, built on top of these services. The key-value service scales well with the number of servers and the number of cores per server. The proofs are mechanized in the Coq proof assistant using the Iris library and Goose.

@mastersthesis{sharma-sm,
  title        = {{M}odular {V}erification of {D}istributed {S}ystems
    with {Grove}},
  author       = {Upamanyu Sharma},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = sep,
}

Verifying the DaisyNFS concurrent and crash-safe file system with sequential reasoning Tej Chajed, Joseph Tassarotti, Mark Theng, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2022.

This paper develops a new approach to verifying a performant file system that isolates crash safety and concurrency reasoning to a transaction system that gives atomic access to the disk, so that the rest of the file system can be verified with sequential reasoning

We demonstrate this approach in DaisyNFS, a Network File System (NFS) server written in Go that runs on top of a disk. DaisyNFS uses GoTxn, a new verified, concurrent transaction system that extends GoJournal with two-phase locking and an allocator. The transaction system's specification formalizes under what conditions transactions can be verified with only sequential reasoning, and comes with a mechanized proof in Coq that connects the specification to the implementation.

As evidence that proofs enjoy sequential reasoning, DaisyNFS uses Dafny, a sequential verification language, to implement and verify all the NFS operations on top of GoTxn. The sequential proofs helped achieve a number of good properties in DaisyNFS: easy incremental development (for example, adding support for large files), a relatively short proof (only 2× as many lines of proof as code), and a performant implementation (at least 60% the throughput of the Linux NFS server exporting ext4 across a variety of benchmarks).

@inproceedings{daisy-nfs:osdi22,
  title        = {Verifying the {DaisyNFS} concurrent and crash-safe
    file system with sequential reasoning},
  author       = {Tej Chajed and Joseph Tassarotti and Mark Theng and
    M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 16th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '22)},
  year         = 2022,
  month        = jul,
}

Verifying hardware security modules with information-preserving refinement Anish Athalye, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2022.

Knox is a new framework that enables developers to build hardware security modules (HSMs) with high assurance through formal verification. The goal is to rule out all hardware bugs, software bugs, and timing side channels.

Knox’s approach is to relate an implementation’s wire-level behavior to a functional specification stated in terms of method calls and return values with a new definition called information-preserving refinement (IPR). This definition captures the notion that the HSM implements its functional specification, and that it leaks no additional information through its wire-level behavior. The Knox framework provides support for writing specifications, importing HSM implementations written in Verilog and C code, and proving IPR using a combination of lightweight annotations and interactive proofs.

To evaluate the IPR definition and the Knox framework, we verified three simple HSMs, including an RFC 6238-compliant TOTP token. The TOTP token is written in 2950 lines of Verilog and 360 lines of C and assembly. Its behavior is captured in a succinct specification: aside from the definition of the TOTP algorithm, the spec is only 10 lines of code. In all three case studies, verification covers entire hardware and software stacks and rules out hardware/software bugs and timing side channels.

@inproceedings{knox:osdi22,
  title        = {Verifying Hardware Security Modules with
    Information-Preserving Refinement},
  author       = {Anish Athalye and M. Frans Kaashoek and Nickolai
    Zeldovich},
  booktitle    = {Proceedings of the 16th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '22)},
  year         = 2022,
  month        = jul,
}

Verifying a concurrent, crash-safe file system with sequential reasoning Tej Chajed. Ph.D. thesis, Massachusetts Institute of Technology, May 2022.

Critical systems software such as the file system is challenging to make correct due to the combination of concurrency in the implementation for good performance and the requirement to preserve data even on crash, where the whole computer stops and reboots unexpectedly. To build reliable systems, this thesis extends formal verification — proving that a system always meets a mathematical specification of its behavior — to reason about the concurrency and crash safety.

The thesis applies the new verification techniques to the verification of DaisyNFS, a new concurrent file system. The file system is an important service of an operating system, because nearly all persistent data is ultimately stored in a file system and bugs can lead to permanent data loss in any application running on top. Another contribution of the thesis is a system design and verification techniques to scale verification to a system of the size and complexity of a file system. In particular, the file system is designed around a transaction system that addresses the core challenges of crashes and concurrency, so that the rest of the code can be verified with comparatively simpler sequential reasoning.

An evaluation of proof overhead finds that verification required 2× as many lines of proof as code for the sequential reasoning, compared to 20× for the crash safety and concurrency proofs. A performance evaluation finds that DaisyNFS achieves good performance compared to Linux NFS exporting ext4 over a range of benchmarks: at least 60% the throughput even on the most challenging concurrent benchmarks, and 90% on other workloads.

@phdthesis{tchajed:thesis,
  title        = {Verifying a concurrent, crash-safe file system with
    sequential reasoning},
  author       = {Tej Chajed},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = may,
}

𝜎OS: Elastic Realms for multi-tenant cloud computing Ariel Szekely. Master's thesis, Massachusetts Institute of Technology, May 2022.

Despite the enormous success of cloud computing, programming and deploying cloud applications remains challenging. Application developers are forced to either explicitly provision resources or limit the types of applications they write to fit a serverless framework such as AWS Lambda.

𝜎OS is a new multi-tenant cloud operating system that allows providers to manage resources for tenants while simplifying application development. A key contribution of 𝜎OS is its novel abstraction: realms. Realms present tenants with the illusion of a single-system image and abstract boundaries between physical machines. Developers structure their applications as processes, called procs in 𝜎OS. Much like a time-sharing OS multiplexes users’ processes across a machine’s cores, 𝜎OS multiplexes tenants’ procs across the cloud provider’s physical machines. Since each tenant tends to plan for peak load, realms can improve data center utilization by enabling providers to transparently reallocate partial machines to another tenant’s realm when load dips.

An evaluation of 𝜎OS demonstrates that a 𝜎OS-based MapReduce (𝜎OS-MR) implementation grows quickly from 1 core to 32 and scales near-perfectly achieving 15.26× speedup over the same implementation running on 2 cores. Similarly, an elastic Key-Value service built on 𝜎OS (𝜎OS-KV) cooperates with 𝜎OS to scale the number of kvd servers and balance shards across them, according to client load. 𝜎OS also achieves high resource utilization when multiple tenants’ realms compete for a shared group of machines. For example, when 𝜎OS multiplexes a long-running 𝜎OS-MR job in one realm and a 𝜎OS-KV service with varying numbers of clients in another realm, 𝜎OS keeps utilization above 90% and transparently moves partial machines between the realms as the 𝜎OS-KV client load changes

@mastersthesis{sigmaos:arielck-ms-thesis,
  title        = {{𝜎OS}: Elastic {Realms} for Multi-Tenant Cloud
    Computing},
  author       = {Ariel Szekely},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = may,
}

Vivid: An operating system kernel for radiation-tolerant flight control software Cel Skeggs. Master's thesis, Massachusetts Institute of Technology, May 2022.

This thesis considers the challenge of defending flight software from radiation errors without a radiation-hardened processor. A new real-time operating system, Vivid, explores the use of redundant multithreading to protect critical software components from radiation errors, and offers new abstractions to reduce the number of single points of vulnerability in the system. It introduces a static component initialization system for C, which eliminates most runtime initialization steps from the operating system and flight software. It introduces a partition scheduler based on execution clips, which ensures that software components always start from a safe state, and it protects the system’s safe state using a pair of memory scrubbers. Vivid introduces voting ducts, an inter-process communication primitive for redundant multithreading that eliminates single points of vulnerability from the voting process. Finally, it defines a sequence of repair that ultimately grounds the correct operation of all components in the system’s software in a hardware watchdog.

To demonstrate the applicability and effectiveness of Vivid, this thesis introduces Swivel, a testbench spacecraft, and describes SwivelFSW, which is the implementation of flight software that meets Swivel’s behavioral requirements, and SwivelSim, which is the simulation of Swivel’s avionics. Next, this thesis introduces Hailburst, a system for efficient processor emulation and radiation fault injection, and uses it to evaluate Vivid’s radiation tolerance through a series of accelerated radiation injection trials. In the tested configuration, Vivid tolerates approximately 149 out of every 150 injected radiation faults without any observed requirement failures, and recovers from the remaining 1 out of 150 radiation faults within at most 2.05 seconds of recovery time in the worst observed case. Because some of Vivid’s defenses appear to be more effective than others, and some may be counterproductive, this thesis discusses future work that would be required before Vivid’s abstractions could be applied to real-world flight software.

@mastersthesis{cela-meng,
  title        = {{Vivid}: An Operating System Kernel for
    Radiation-Tolerant Flight Control Software},
  author       = {Cel Skeggs},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = may,
}

GoTxn: Verifying a crash-safe, concurrent transaction system Mark Theng. Master's thesis, Massachusetts Institute of Technology, January 2022.

Bugs related to concurrency and crash safety are infamous for being subtle and hard to reproduce. Formal verification provides a way to combat such bugs through the use of machine-checked proofs about program behavior. However, reasoning about concurrency and crashes can be tricky, especially when scaling up to larger systems that must also have good performance.

This thesis discusses the verification of GoTxn, the concurrent, crash-safe transaction system underlying the verified Network File System (NFS) server DaisyNFS. It focuses on the specification and proof of the write-ahead log and the automatic two-phase locking interface used to enforce crash and concurrent atomicity in transactions, detailing how the verification framework Perennial can be used to manage assertions about crash behavior across multiple threads. By effectively harnessing concurrency to hide disk access latency, GoTxn enables performance in DaisyNFS similar to the unverified Linux NFS server.

@mastersthesis{mtheng-meng,
  title        = {{GoTxn}: Verifying a Crash-Safe, Concurrent
    Transaction System},
  author       = {Mark Theng},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = jan,
}

Understanding and improving the performance of mitigating transient execution attacks Jonathan Behrens. Ph.D. thesis, Massachusetts Institute of Technology, January 2022.

This thesis makes two contributions: (1) a measurement study of the performance evolution of mitigations against transient execution attacks over generations of processors, and (2) the Ward kernel design, which eliminates as much as half the overhead of mitigations on older processors.

The measurement study maps end-to-end overheads to the specific mitigations that cause them. It reveals that hardware fixes for several transient execution attacks have reduced overheads on OS heavy workloads by a factor of ten. However, overheads for JavaScript applications have remained roughly flat because they are caused by mitigations for attacks that even the most recent processors are still vulnerable to. Finally, the study shows that a few mitigations account for most performance costs.

Ward is a novel operating system architecture that is resilient to transient execution attacks, yet avoids expensive software mitigations that existing operating systems employ when running on pre-2018 processors. It leverages a new hardware/software contract termed the Unmapped Speculation Contract, which describes limits on the speculative behavior of processors.

@phdthesis{behrensj:thesis,
  title        = {Understanding and Improving the Performance of
    Mitigating Transient Execution Attacks},
  author       = {Jonathan Behrens},
  school       = {Massachusetts Institute of Technology},
  year         = 2022,
  month        = jan,
}

Efficient strong scaling through burst parallel training Seo Jin Park, Joshua Fried, Sunghyun Kim, Mohammad Alizadeh, and Adam Belay. MLSys 2022.

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2-2.3× improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.

@inproceedings{deeppool:mlsys22,
  author       = {Seo Jin Park and Joshua Fried and Sunghyun Kim and
    Mohammad Alizadeh and Adam Belay},
  booktitle    = {Proceedings of {M}achine {L}earning and {S}ystems
    ({MLSys} '24)},
  pages        = {748--761},
  title        = {Efficient Strong Scaling Through Burst Parallel
    Training},
  volume       = 4,
  year         = 2022,
}

Single-server private information retrieval with sublinear amortized time Henry Corrigan-Gibbs, Alexandra Henzinger, and Dmitry Kogan. EUROCRYPT 2022.

We construct new private-information-retrieval protocols in the single-server setting. Our schemes allow a client to privately fetch a sequence of database records from a server, while the server answers each query in average time sublinear in the database size. Specifically, we introduce the first single-server private-information-retrieval schemes that have sublinear amortized server time, require sublinear additional storage, and allow the client to make her queries adaptively. Our protocols rely only on standard cryptographic assumptions (decision Diffie-Hellman, quadratic residuosity, learning with errors, etc.). They work by having the client first fetch a small "hint" about the database contents from the server. Generating this hint requires server time linear in the database size. Thereafter, the client can use the hint to make a bounded number of adaptive queries to the server, which the server answers in sub-linear time--yielding sublinear amortized cost. Finally, we give lower bounds proving that our most efficient scheme is optimal with respect to the trade-off it achieves between server online time and client storage.

@inproceedings{pir:eurocrypt22,
  title        = {Single-Server Private Information Retrieval with
    Sublinear Amortized Time},
  author       = {Corrigan-Gibbs, Henry and Henzinger, Alexandra and
    Kogan, Dmitry},
  booktitle    = {41st Annual International Conference on the Theory
    and Applications of Cryptographic Techniques ({EUROCRYPT} '22)},
  year         = 2022,
}

2021

GoJournal: a verified, concurrent, crash-safe journaling system Tej Chajed, Joseph Tassarotti, Mark Theng, Ralf Jung, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2021.

The main contribution of this paper is GoJournal, a verified, concurrent journaling system that provides atomicity for storage applications, together with Perennial 2.0, a framework for formally specifying and verifying concurrent crash-safe systems. GoJournal’s goal is to bring the advantages of journaling for code to specs and proofs. Perennial 2.0 makes this possible by introducing several techniques to formalize GoJournal’s specification and to manage the complexity in the proof of GoJournal’s implementation. Lifting predicates and crash framing make the specification easy to use for developers, and logically atomic crash specifications allow for modular reasoning in GoJournal, making the proof tractable despite complex concurrency and crash interleavings.

GoJournal is implemented in Go, and Perennial is implemented in the Coq proof assistant. While verifying GoJournal, we found one serious concurrency bug, even though GoJournal has many unit tests. We built a functional NFSv3 server, called GoNFS, to use GoJournal. Performance experiments show that GoNFS provides similar performance (e.g., at least 90% throughput across several benchmarks on an NVMe disk) to Linux’s NFS server exporting an ext4 file system, suggesting that GoJournal is a competitive journaling system. We also verified a simple NFS server using GoJournal’s specs, which confirms that they are helpful for application verification: a significant part of the proof doesn’t have to consider concurrency and crashes.

@inproceedings{gojournal:osdi21,
  title        = {{GoJournal}: a verified, concurrent, crash-safe
    journaling system},
  author       = {Tej Chajed and Joseph Tassarotti and Mark Theng and
    Ralf Jung and M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 15th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '21)},
  year         = 2021,
  month        = jul,
}

Privacy heroes need data disguises Lillian Tsai, Malte Schwarzkopf, and Eddie Kohler. HotOS XVIII 2021.

Providing privacy in complex, data-rich applications is hard. Deleting accounts, anonymizing an account's contributions, and other privacy-related actions may require the traversal and transformation of interwoven state in a relational database. Finding the affected data is already nontrivial, but privacy actions must additionally balance competing requirements, such as preserving data trails for legal reasons or allowing users to change their mind. We believe a systematic shared framework for specifying and implementing privacy transformations could simplify and empower applications. Our prototype, data disguising, supports fine-grained, nuanced, and useful policies that would be cumbersome to implement manually, including

@inproceedings{tslilyai:hotos21,
  author       = {Tsai, Lillian and Schwarzkopf, Malte and Kohler,
    Eddie},
  title        = {Privacy Heroes Need Data Disguises},
  month        = jun,
  year         = 2021,
  pages        = {112–118},
  booktitle    = {Proceedings of the 18th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS XVIII})},
  address      = {Ann Arbor, Michigan},
}

rtlv: push-button verification of software on hardware Noah Moroze, Anish Athalye, M. Frans Kaashoek, and Nickolai Zeldovich. CARRV 2021.

This paper presents rtlv, an approach for push-button formal verification of properties that involve software running on hardware for many cycles. For example, rtlv can be used to prove that executing boot code resets a processor's microarchitectural state to known deterministic values, while existing tools time out when attempting to verify such a property.

Two key ideas enable rtlv to handle reasoning about many cycles of circuit execution. First, rtlv uses hybrid symbolic execution to reason about a circuit with symbolic values while minimizing the complexity of symbolic expressions; this is achieved by compiling circuits to programs in the Rosette solver-aided language. Second, rtlv enables the development of reusable circuit-agnostic property checkers that have a performance hint interface, allowing developers to optimize verification performance while maintaining confidence that the proof is correct.

Using rtlv, formally verifying a state-clearing property for a small (1,300 flip-flop) RISC-V SoC takes only 1.3 seconds, while SymbiYosys, a popular open-source verification tool, is unable to finish within 12 hours. In another case study, rtlv scales to a larger 4,300 flip-flop RISC-V SoC where verifying this state-clearing property requires modeling over 20,000 cycles of software executing on hardware. Formal verification with rtlv helped us find and fix violations of the property in the baseline hardware, demonstrating that rtlv is useful for finding bugs.

@inproceedings{rtlv:carrv21,
  title        = {{rtlv}: push-button verification of software on
    hardware},
  author       = {Noah Moroze and Anish Athalye and M. Frans Kaashoek
    and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 5th {W}orkshop on {C}omputer
    {A}rchitecture {R}esearch with {RISC-V} ({CARRV} 2021)},
  year         = 2021,
  month        = jun,
}

Guarda: A web application firewall for WebAuthn transaction authentication Damian Barabonkov. Master's thesis, Massachusetts Institute of Technology, May 2021.

Transaction authentication is an attractive extension to two-factor authentication. It is proposed in the WebAuthn standard by the World-Wide-Web Consortium (W3C) as a mechanism to secure individual "high-risk" operations of a website via a hardware authenticator device. It defends against a stringent threat model where an adversary can modify or create HTTP requests between the user and the web service. Transaction authentication as defined by WebAuthn is not yet adopted in practice, partially because it requires intrusive web application changes.

This thesis presents Guarda, a firewall for integrating transaction authentication into a new or existing web service with relatively few code changes. The firewall intercepts all HTTP traffic sent to the web service, and based on the configuration, any requests deemed safe are proxied directly to the web service. All other requests are considered high-risk and are held back and validated using transaction authentication. Only if the validation passes are they also permitted to pass through to the web service.

This thesis uses the firewall approach to integrate transaction authentication into three web applications: a blogging site named Conduit, a WordPress admin panel named Calypso and a self-hosted Git service named Gogs. Compared to directly modifying them to support transaction authentication, the firewall approach is close to 8 times more concise. Under heavy load, there is an associated latency of at worst 1.5x slower when using Guarda to secure Gogs versus accessing the web service directly without WebAuthn.

@mastersthesis{barabonkov-meng,
  title        = {Guarda: A web application firewall for {WebAuthn}
    transaction authentication},
  author       = {Damian Barabonkov},
  school       = {Massachusetts Institute of Technology},
  year         = 2021,
  month        = may,
}

Kronos: Verifying leak-free reset for a system-on-chip with multiple clock domains Noah Moroze. Master's thesis, Massachusetts Institute of Technology, January 2021.

Notary uses formal verification to prove a hardware-level security property called deterministic start for a simple system-on-chip (SoC). Deterministic start requires that an SoC's state is fully reset by boot code to ensure that secrets cannot leak across reset boundaries. However, Notary's approach has several limitations. Its security property requires that all of the SoC's microarchitectural state can be reset to known values through software, and the property and proof technique apply only to SoCs with a single clock domain. These limitations prevent Notary's approach from being applied to more complex systems.

This thesis addresses these limitations through Kronos, a system consisting of a verified SoC that satisfies a new security property called output determinism. Output determinism provides the same security guarantees as Notary without requiring that all of an SoC's state be reset by software. The SoC used in Kronos, called MicroTitan, is based on the open-source OpenTitan and includes multiple clock domains. This thesis evaluates Kronos and demonstrates that existing open-source hardware can be modified to satisfy output determinism with minimal changes, and that the process of proving output determinism reveals hardware issues that violate desired security guarantees.

@mastersthesis{moroze-meng,
  title        = {Kronos: Verifying leak-free reset for a
    system-on-chip with multiple clock domains},
  author       = {Noah Moroze},
  school       = {Massachusetts Institute of Technology},
  year         = 2021,
  month        = jan,
}

2020

Overload control for us-scale RPCs with Breakwater Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. OSDI 2020.

Modern datacenter applications are composed of hundreds of microservices with high degrees of fanout. As a result, they are sensitive to tail latency and require high request throughputs. Maintaining these characteristics under overload is difficult, especially for RPCs with short service times. In this paper, we consider the challenging case of microsecond-scale RPCs, where the cost of communicating information and dropping a request is similar to the cost of processing a request. We present Breakwater, an overload control scheme that can prevent overload in microsecond-scale services through a new, server-driven admission control scheme that issues credits based on server-side queueing delay. Breakwater contributes several techniques to amortize communication costs. It engages in demand speculation, where it assumes clients have unmet demand and issues additional credits when the server is not overloaded. Moreover, it piggybacks client-side demand information in RPC requests and credits in RPC responses. To cope with the occasional bursts in load caused by demand speculation, Breakwater drops requests when overloaded using active queue management. When clients’ demand spikes unexpectedly to 1.4x capacity, Breakwater converges to stable performance in less than 20 ms with no congestion collapse while DAGOR and SEDA take 500 ms and 1.58 s to recover from congestion collapse, respectively.

@inproceedings{breakwater:osdi20,
  title        = {Overload Control for us-scale {RPC}s with
    {Breakwater}},
  author       = {Inho Cho and Ahmed Saeed and Joshua Fried and Seo
    Jin Park and Mohammad Alizadeh and Adam Belay},
  booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)},
  year         = 2020,
  month        = nov,
}

Caladan: Mitigating interference at microsecond timescales Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. OSDI 2020.

The conventional wisdom is that CPU resources such as cores, caches, and memory bandwidth must be partitioned to achieve performance isolation between tasks. Both the widespread availability of cache partitioning in modern CPUs and the recommended practice of pinning latency-sensitive applications to dedicated cores attest to this belief.

In this paper, we show that resource partitioning is neither necessary nor sufficient. Many applications experience bursty request patterns or phased behavior, drastically changing the amount and type of resources they need. Unfortunately, partitioning-based systems fail to react quickly enough to keep up with these changes, resulting in extreme spikes in latency and lost opportunities to increase CPU utilization.

Caladan is a new CPU scheduler that can achieve significantly better quality of service (tail latency, throughput, etc.) through a collection of control signals and policies that rely on fast core allocation instead of resource partitioning. Caladan consists of a centralized scheduler core that actively manages resource contention in the memory hierarchy and between hyperthreads, and a kernel module that bypasses the standard Linux Kernel scheduler to support microsecond-scale monitoring and placement of tasks. When colocating memcached with a best-effort, garbage-collected workload, Caladan outperforms Parties, a state-of-the-art resource partitioning system, by 11,000x, reducing tail latency from 580 ms to 52 µs during shifts in resource usage while maintaining high CPU utilization.

@inproceedings{caladan:osdi20,
  title        = {Caladan: Mitigating Interference at Microsecond
    Timescales},
  author       = {Joshua Fried and Zhenyuan Ruan and Amy Ousterhout
    and Adam Belay},
  booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)},
  year         = 2020,
  month        = nov,
}

Efficiently mitigating transient execution attacks using the unmapped speculation contract Jonathan Behrens, Anton Cao, Cel Skeggs, Adam Belay, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2020.

Today’s kernels pay a performance penalty for mitigations—such as KPTI, retpoline, return stack stuffing, speculation barriers—to protect against transient execution side-channel attacks such as Meltdown and Spectre .

To address this performance penalty, this paper articulates the unmapped speculation contract, an observation that memory that isn’t mapped in a page table cannot be leaked through transient execution. To demonstrate the value of this contract, the paper presents Ward, a new kernel design that maintains a separate kernel page table for every process. This page table contains mappings for kernel memory that is safe to expose to that process. Because a process doesn’t map data of other processes, this design allows for many system calls to execute without any mitigation overhead. When a process needs access to sensitive data, Ward switches to a kernel page table that provides access to all of memory and executes with all mitigations.

An evaluation of the Ward design implemented in the sv6 research kernel shows that can execute many system calls without mitigations. For some hardware generations, this results in performance improvement ranging from a few percent (huge page fault) to several factors (getpid), compared to a standard design with mitigations.

@inproceedings{ward:osdi20,
  title        = {Efficiently Mitigating Transient Execution Attacks
    using the Unmapped Speculation Contract},
  author       = {Jonathan Behrens and Anton Cao and Cel Skeggs and
    Adam Belay and M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)},
  year         = 2020,
  month        = nov,
}

AIFM: High-performance, application-integrated far memory Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. OSDI 2020.

Memory is the most contended and least elastic resource in datacenter servers today. Applications can use only local memory---which may be scarce---even though memory might be readily available on another server. This leads to unnecessary killings of workloads under memory pressure and reduces effective server utilization.

We present application-integrated far memory (AIFM), which makes remote, "far" memory available to applications through a simple API and with high performance. AIFM achieves the same common-case access latency for far memory as for local RAM; it avoids read and write amplification that paging-based approaches suffer; it allows data structure engineers to build remoteable, hybrid near/far memory data structures; and it makes far memory transparent and easy to use for application developers.

Our key insight is that exposing application-level semantics to a high-performance runtime makes efficient remoteable memory possible. Developers use AIFM's APIs to make allocations remoteable, and AIFM's runtime handles swapping objects in and out, prefetching, and memory evacuation.

We evaluate AIFM with a prototypical web application frontend, a NYC taxi data analytics workload, a memcached-like key-value cache, and Snappy compression. Adding AIFM remoteable memory to these applications increases their available memory without performance penalty. AIFM outperforms Fastswap, a state-of-the-art kernel-integrated, paging-based far memory system by up to 61X.

@inproceedings{aifm:osdi20,
  title        = {{AIFM}: High-Performance, Application-Integrated Far
    Memory},
  author       = {Zhenyuan Ruan and Malte Schwarzkopf and Marcos K.
    Aguilera and Adam Belay},
  booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '20)},
  year         = 2020,
  month        = nov,
}

Partial state in dataflow-based materialized views Jon Gjengset. Ph.D. thesis, Massachusetts Institute of Technology, November 2020.

This thesis proposes a practical database system that lowers latency and increases supported load for read-heavy applications by using incrementally-maintained materialized views to cache query results. As opposed to state-of-the-art materialized view systems, the presented system builds the cache on demand, keeps it updated, and evicts cache entries in response to a shifting workload.

The enabling technique the thesis introduces is partially stateful materialization, which allows entries in materialized views to be missing. The thesis proposes upqueries as a mechanism to fill such missing state on demand using dataflow, and implements them in the materialized view system Noria. The thesis details issues that arise when dataflow updates and upqueries race with one another, and introduces mechanisms that uphold eventual consistency in the face of such races.

Partial materialization saves application developers from having to implement ad hoc caching mechanisms to speed up their database accesses. Instead, the database has transparent caching built in. Experimental results suggest that the presented system increases supported application load by up to 20x over MySQL and performs similarly to an optimized key-value store cache. Partial state also reduces memory use by up to 2/3 compared to traditional materialized views.

@phdthesis{jfrg:thesis,
  title        = {Partial State in Dataflow-Based Materialized Views},
  author       = {Jon Gjengset},
  school       = {Massachusetts Institute of Technology},
  year         = 2020,
  month        = nov,
}

Optimizations for performant multiverse databases Jacqueline M. Bredenberg. Master's thesis, Massachusetts Institute of Technology, May 2020.

Modern web applications store data in backend databases, and access it through a variety of frontend queries. User permissions are implemented by checks on those queries, but a maliciously injected (or simply buggy) query can easily leak private data. Multiverse databases attempt to prevent these data leaks by creating a separate view of the database contents (or "universe") for each user, and enforcing in the backend that this universe contains only data that the user is allowed to query. These views are precomputed and materialized using a streaming dataflow system so that queries return promptly.

This design is difficult to make efficient. A simple approach makes copies of data and operators for each universe, but state size that increases proportionally to the number of users quickly becomes impractical. In this work, we developed optimizations for multiverse dataflow graphs, which aim to reuse the same computations (i.e. dataflow subgraphs) in many different universes while maintaining security invariants.

We evaluate these optimizations in the context of the HotCRP and Piazza web applications. The resulting graphs are about 2x more space-efficient and 3x more computation-efficient than the naive ones. Graph size and processing time still scale linearly with the number of users, so our design may still not be efficient enough to be practical, but our optimizations make progress toward making multiverse databases a feasible solution to web application security.

@mastersthesis{bredenberg-meng,
  title        = {Optimizations for Performant Multiverse Databases},
  author       = {Jacqueline M. Bredenberg},
  school       = {Massachusetts Institute of Technology},
  year         = 2020,
  month        = may,
}

Waddle: A proven interpreter and test framework for a subset of the Go semantics Sydney Gibson. Master's thesis, Massachusetts Institute of Technology, May 2020.

Goose is a tool for importing programs written in a subset of Go into Coq. Proving properties in Coq about a translated Go program implies that these properties hold for the original Go program, assuming that Goose is correct. Testing the correctness of Goose is challenging, however, because the Coq translation, called GooseLang, is not an executable semantics.

This thesis presents Waddle, a testing framework for Goose. Waddle consists of an interpreter for GooseLang, which makes GooseLang programs executable, a proof that the interpreter implements GooseLang correctly, and a framework for running GooseLang programs through the interpreter and checking that they return the same result as their corresponding Goose program. This thesis evaluates Waddle as a test-driven development framework and as a bug finding tool, and describes several bugs that Waddle has found.

@mastersthesis{gibsons-meng,
  title        = {Waddle: A proven interpreter and test framework for
    a subset of the {Go} semantics},
  author       = {Sydney Gibson},
  school       = {Massachusetts Institute of Technology},
  year         = 2020,
  month        = may,
}

Notary: A device for secure transaction approval Anish Athalye, Adam Belay, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. ;login: The USENIX Magazine 45(1), March 2020.

Notary is a new design for a hardware wallet, a type of security token that is used to protect sensitive transactional operations like cryptocurrency transfers. Notary aims to be more secure than past hardware wallets by eliminating classes of bugs by design and by formally proving the correctness of the key operation used in its implementation. We built a physical prototype of Notary and showed that it achieves functionality similar to existing hardware wallets while avoiding many bugs that affect them.

@article{notary:login20,
  title        = {{Notary}: A Device for Secure Transaction Approval},
  author       = {Anish Athalye and Adam Belay and M. Frans Kaashoek
    and Robert Morris and Nickolai Zeldovich},
  year         = 2020,
  pages        = {12--16},
  volume       = 45,
  number       = 1,
  month        = mar,
  journal      = {{{;login: The USENIX Magazine}}},
}

Strong and scalable metadata security for voice calls David Lazar. Ph.D. thesis, Massachusetts Institute of Technology, February 2020.

This dissertation presents a scalable approach to protecting metadata (who is communicating with whom) in a communication system. The emphasis in this dissertation is on hiding metadata for voice calls, but the approach is applicable to any two-way communication between users.

Our approach is embodied in a new system named Yodel, the first system for voice calls that hides metadata from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been difficult to achieve in prior work where processing each message requires an expensive public key operation at each hop in the network. Yodel avoids this expense with the idea of self-healing circuits, reusable paths through a mix network that use only fast symmetric cryptography. Once created, these circuits are resilient to passive and active attacks from global adversaries. Creating and connecting to these circuits without leaking metadata is another challenge that Yodel addresses with the idea of guarded circuit exchange, where each user creates a backup circuit in case an attacker tampers with their traffic. We evaluate Yodel across the internet and it achieves acceptable voice quality with 990 ms of latency for 5 million simulated users.

@phdthesis{lazard:thesis,
  title        = {Strong and Scalable Metadata Security for Voice
    Calls},
  author       = {David Lazar},
  school       = {Massachusetts Institute of Technology},
  year         = 2020,
  month        = feb,
}

2019

Yodel: Strong metadata security for voice calls David Lazar, Yossi Gilad, and Nickolai Zeldovich. SOSP 2019.

Yodel is the first system for voice calls that hides metadata (e.g. who is communicating with whom) from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been difficult to achieve in prior work where processing each message requires an expensive public key operation at each hop in the network. Yodel avoids this expense with the idea of self-healing circuits, reusable paths through a mix network that use only fast symmetric cryptography. Once created, these circuits are resilient to passive and active attacks from global adversaries. Creating and connecting to these circuits without leaking metadata is another challenge that Yodel addresses with the idea of guarded circuit exchange, where each user creates a backup circuit in case an attacker tampers with their traffic. We evaluate Yodel across the internet and it achieves acceptable voice quality with 990ms of latency for 5 million simulated users.

@inproceedings{yodel:sosp19,
  title        = {{Yodel}: Strong Metadata Security for Voice Calls},
  author       = {David Lazar and Yossi Gilad and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating
    Systems Principles (SOSP 2019)},
  year         = 2019,
  month        = oct,
  address      = {Hunstville, ON, Canada},
}

Notary: A device for secure transaction approval Anish Athalye, Adam Belay, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. SOSP 2019.

Notary is a new hardware and software architecture for running isolated approval agents in the form factor of a USB stick with a small display and buttons. Approval agents allow factoring out critical security decisions, such as getting the user’s approval to sign a Bitcoin transaction or to delete a backup, to a secure environment. The key challenge addressed by Notary is to securely switch between agents on the same device. Prior systems either avoid the problem by building single-function devices like a USB U2F key, or they provide weak isolation that is susceptible to kernel bugs, side channels, or Rowhammer-like attacks. Notary achieves strong isolation using reset-based switching, along with the use of physically separate systems-on-a-chip for agent code and for the kernel, and a machine-checked proof of both the hardware’s register-transfer-level design and software, showing that reset-based switching leaks no state. Notary also provides a trustworthy I/O path between the agent code and the user, which prevents an adversary from tampering with the user’s screen or buttons.

We built a hardware/software prototype of Notary, using a combination of ARM and RISC-V processors. The prototype demonstrates that it is feasible to verify Notary’s reset-based switching, and that Notary can support diverse agents, including cryptocurrencies and a transaction approval agent for traditional client-server applications such as websites. Measurements of reset-based switching show that it is fast enough for interactive use. We analyze security bugs in existing cryptocurrency hardware wallets, which aim to provide a similar form factor and feature set as Notary, and show that Notary’s design avoids many bugs that affect them.

@inproceedings{notary:sosp19,
  title        = {{Notary}: A Device for Secure Transaction Approval},
  author       = {Anish Athalye and Adam Belay and M. Frans Kaashoek
    and Robert Morris and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating
    Systems Principles (SOSP 2019)},
  year         = 2019,
  month        = oct,
  address      = {Hunstville, ON, Canada},
}

Verifying concurrent, crash-safe systems with Perennial Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2019.

This paper introduces Perennial, a framework for verifying concurrent, crash-safe systems. Perennial extends the Iris concurrency framework with three techniques to enable crash-safety reasoning: recovery leases, recovery helping, and versioned memory. To ease development and deployment of applications, Perennial provides Goose, a subset of Go and a translator from that subset to a model in Perennial with support for reasoning about Go threads, data structures, and file-system primitives. We implemented and verified a crash-safe, concurrent mail server using Perennial and Goose that achieves speedup on multiple cores. Both Perennial and Iris use the Coq proof assistant, and the mail server and the framework’s proofs are machine checked.

@inproceedings{perennial:sosp19,
  title        = {Verifying Concurrent, Crash-safe Systems with
    {Perennial}},
  author       = {Tej Chajed and Joseph Tassarotti and M. Frans
    Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating
    Systems Principles (SOSP 2019)},
  year         = 2019,
  month        = oct,
  address      = {Hunstville, ON, Canada},
}

Scalable fault tolerance for high-performance streaming dataflow Gina Yuan. Master's thesis, Massachusetts Institute of Technology, September 2019.

Streaming dataflow systems offer an appealing alternative to classic MySQL / memcached web backend stacks. But websites must not go down, and current fault tolerance techniques for dataflow systems either come with long downtimes during recovery, or fail to scale to large deployments due to the overhead of global coordination. For example, in the failure of a single dataflow node, existing lineage-based techniques take a long time to recompute all lost and downstream state, while checkpointing techniques require costly global coordination for rollback recovery.

This thesis presents a causal logging approach to fault tolerance that rolls back and replays the execution of only the failed node, without any global coordination. The key to knowing how to replay a valid execution while ensuring exactly-once semantics is a small, constant-size tree clock piggybacked onto each message, incurring runtime overheads that are low and scalable. After recovery, the state of the system is indistinguishable from one that never failed at all.

We implement and evaluate the protocol on Noria, a streaming dataflow backend for read-heavy web applications. Compared to Noria’s original protocol of lineage-based recovery, tree clock recovery time is constant in relation to state size and graph size. Experimental results show sub-second recovery times with 1.5ms runtime overheads, which translates to a 290x improvement in recovery time.

@mastersthesis{yuan-meng,
  title        = {Scalable Fault Tolerance for High-Performance
    Streaming Dataflow},
  author       = {Gina Yuan},
  school       = {Massachusetts Institute of Technology},
  year         = 2019,
  month        = sep,
}

GDPR compliance by construction Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. Poly 2019.

Web services today store and process ever-increasing amounts of our personal data on remote servers. This is true for large social networks like Facebook and Twitter, but also for smaller, special purpose web services run by small and medium-sized organizations. The organizations that operate these services are forced to spend significant resources to ensure that they comply with data protection laws, such as the European Union’s recent General Data Protection Regulation (GDPR). Compliance with this comprehensive legislation is costly and difficult, at least in part because the standard systems software that web-based applications rely on was never designed with user privacy in mind.

Our research seeks to understand this problem space – what widely-used abstractions are convenient for developers, and efficiently implemented, but hinder compliance? We are designing new, fundamentally privacy-centric computer systems abstractions that seek to achieve compliance with GDPR-like legislation by default.

@inproceedings{schwarzkopf2019gdprcbyc,
  author       = {Schwarzkopf, Malte and Kohler, Eddie and Kaashoek,
    M. Frans and Morris, Robert},
  title        = {{GDPR} Compliance by Construction},
  booktitle    = {Proceedings of the 2019 VLDB Workshop Towards
    Polystores that manage multiple Databases, Privacy, Security
    and/or Policy Issues for Heterogenous Data (Poly)},
  month        = aug,
  year         = 2019,
}

Argosy: Verifying layered storage systems with recovery refinement Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. PLDI 2019.

Storage systems make persistence guarantees even if the system crashes at any time, which they achieve using recovery procedures that run after a crash. We present Argosy, a framework for machine-checked proofs of storage systems that supports layered recovery implementations with modular proofs. Reasoning about layered recovery procedures is especially challenging because the system can crash in the middle of a more abstract layer's recovery procedure and must start over with the lowest-level recovery procedure.

This paper introduces recovery refinement, a set of conditions that ensure proper implementation of an interface with a recovery procedure. Argosy includes a proof that recovery refinements compose, using Kleene algebra for concise definitions and metatheory. We implemented Crash Hoare Logic, the program logic used by FSCQ, to prove recovery refinement, and demonstrated the whole system by verifying an example of layered recovery featuring a write-ahead log running on top of a disk replication system. The metatheory of the framework, the soundness of the program logic, and these examples are all verified in the Coq proof assistant.

@inproceedings{argosy:pldi19,
  title        = {Argosy: Verifying Layered Storage Systems with
    Recovery Refinement},
  author       = {Tej Chajed and Joseph Tassarotti and M. Frans
    Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 40th {ACM} {SIGPLAN} Conference
    on Programming Language Design and Implementation ({PLDI} 2019)},
  year         = 2019,
  month        = jun,
  address      = {Phoenix, Arizona},
}

The benefits and costs of writing a posix kernel in a high-level language Cody Cutler. Ph.D. thesis, Massachusetts Institute of Technology, June 2019.

This dissertation presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.

This dissertation contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.

On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 582 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.

@phdthesis{biscuit:thesis,
  title        = {The Benefits and Costs of Writing a POSIX Kernel in
    a High-Level Language},
  author       = {Cody Cutler},
  school       = {Massachusetts Institute of Technology},
  year         = 2019,
  month        = jun,
}

Towards multiverse databases Alana Marzoev, Lara Timbó Araújo, Malte Schwarzkopf, Samyukta Yagati, Eddie Kohler, Robert Morris, M. Frans Kaashoek, and Sam Madden. HotOS XVII 2019.

A multiverse database transparently presents each application user with a flexible, dynamic, and independent view of shared data. This transformed view of the entire database contains only information allowed by a centralized and easily-auditable privacy policy. By enforcing the privacy policy once, in the database, multiverse databases reduce programmer burden and eliminate many frontend bugs that expose sensitive data.

Multiverse databases' per-user transformations risk expensive queries if applied dynamically on reads, or impractical storage requirements if the database proactively materializes policy-compliant views. We propose an efficient design based on a joint dataflow across "universes" that combines global, shared computation and cached state with individual, per-user processing and state. This design, which supports arbitrary SQL queries and complex policies, imposes no performance overhead on read queries. Our early prototype supports thousands of parallel universes on a single server.

@inproceedings{multiversedb:hotos19,
  title        = {Towards Multiverse Databases},
  author       = {Alana Marzoev and Lara Timb\'{o} Ara\'{u}jo and
    Malte Schwarzkopf and Samyukta Yagati and Eddie Kohler and Robert
    Morris and M. Frans Kaashoek and Sam Madden},
  pages        = {88-95},
  month        = may,
  year         = 2019,
  booktitle    = {Proceedings of the 17th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS XVII})},
  address      = {Bertinoro, Italy},
}

MCQC: Extracting and optimizing formally verified code for systems programming Eleftherios Ioannidis, Frans Kaashoek, and Nickolai Zeldovich. NFM 2019.

MCQC is a compiler for extracting verified systems programs to low-level assembly, with no runtime or garbage collection requirements and an emphasis on performance. MCQC targets the Gallina functional language used in the Coq proof assistant. MCQC translates pure and recursive functions into C++17, while compiling monadic effectful functions to imperative C++ system calls. With a few memory and performance optimizations, MCQC combines verifiability with memory and runtime performance. By handling effectful and pure functions separately MCQC can generate executable verified code directly from Gallina, reducing the effort of implementing and executing verified systems.

@inproceedings{mcqc:nfm19,
  title        = {{MCQC}: Extracting and optimizing formally verified
    code for systems programming},
  author       = {Eleftherios Ioannidis and Frans Kaashoek and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of the 11th {A}nnual {NASA} {F}ormal
    {M}ethods {S}ymposium ({NFM})},
  year         = 2019,
  month        = may,
  address      = {Houston, TX},
}

Conclave: secure multi-party computation on big data Nikolaj Volgushev, Malte Schwarzkopf, Ben Getchell, Mayank Varia, Andrei Lapets, and Azer Bestavros. EuroSys 2019.

Secure Multi-Party Computation (MPC) allows mutually distrusting parties to run joint computations without revealing private data. Current MPC algorithms scale poorly with data size, which makes MPC on "big data" prohibitively slow and inhibits its practical use.

Many relational analytics queries can maintain MPC's end-to-end security guarantee without using cryptographic MPC techniques for all operations. Conclave is a query compiler that accelerates such queries by transforming them into a combination of data-parallel, local cleartext processing and small MPC steps. When parties trust others with specific subsets of the data, Conclave applies new hybrid MPC-cleartext protocols to run additional steps outside of MPC and improve scalability further.

Our Conclave prototype generates code for cleartext processing in Python and Spark, and for secure MPC using the Sharemind and Obliv-C frameworks. Conclave scales to data sets between three and six orders of magnitude larger than state-of-the-art MPC frameworks support on their own. Thanks to its hybrid protocols and additional optimizations, Conclave also substantially outperforms SMCQL, the most similar existing system.

@inproceedings{conclave:eurosys19,
  title        = {{Conclave}: secure multi-party computation on big
    data},
  author       = {Nikolaj Volgushev and Malte Schwarzkopf and Ben
    Getchell and Mayank Varia and Andrei Lapets and Azer Bestavros},
  note         = {An extended version is available as arxiv:1902.06288
    [cs.CR]},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2019)},
  year         = 2019,
  month        = apr,
  address      = {Dresden, Germany},
}

Noria: A new take on fast web application backends Jon Gjengset, Malte Schwarzkopf, Jonathan Behrens, Lara Timbó Araújo, Martin Ek, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. ;login: The USENIX Magazine 44(1), March 2019.

Noria, first presented at OSDI’18, is a new web application back-end that delivers the same fast reads as an in-memory cache in front of the database, but without the application having to manage the cache. Even better, Noria still accepts SQL queries and allows changes to the queries without extra effort, just like a database. Noria performs well: it serves up to 14M requests per second on a single server, and supports a 5x higher load than carefully hand-tuned queries issued to MySQL.

@article{noria:login19,
  title        = {Noria: A New Take on Fast Web Application Backends},
  author       = {Jon Gjengset and Malte Schwarzkopf and Jonathan
    Behrens and Lara Timb{\'o} Ara{\'u}jo and Martin Ek and Eddie
    Kohler and M. Frans Kaashoek and Robert Morris},
  year         = 2019,
  pages        = {17--21},
  volume       = 44,
  number       = 1,
  month        = mar,
  journal      = {{{;login: The USENIX Magazine}}},
}

The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, and Robert T. Morris. ;login: The USENIX Magazine 44(1), March 2019.

This article summarizes the main benefits and downsides of using a garbage-collected high-level language to build an operating system kernel with enough functionality to run Redis and Nginx with good performance.

@article{biscuit:login19,
  title        = {The benefits and costs of writing a {POSIX} kernel
    in a high-level language},
  author       = {Cody Cutler and M. Frans Kaashoek and Robert T.
    Morris},
  year         = 2019,
  pages        = {6--10},
  volume       = 44,
  number       = 1,
  month        = mar,
  journal      = {{{;login: The USENIX Magazine}}},
}

Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. NSDI 2019.

Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today’s kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load.

Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity—every 5 µs—enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.

@inproceedings{shenango:nsdi19,
  title        = {Shenango: Achieving High {CPU} Efficiency for
    Latency-sensitive Datacenter Workloads},
  author       = {Amy Ousterhout and Joshua Fried and Jonathan Behrens
    and Adam Belay and Hari Balakrishnan},
  booktitle    = {Proceedings of the 16th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '19)},
  year         = 2019,
  month        = feb,
  address      = {Boston, MA},
}

Extracting and optimizing low-level bytecode from high-level verified Coq Eleftherios Ioannidis. Master's thesis, Massachusetts Institute of Technology, February 2019.

This document is an MEng thesis presenting MCQC, a compiler for extracting veri systems programs to low-level assembly, with no Runtime or Garbage Collection requirements and an emphasis on performance. MCQC targets the Gallina functional language used in the Coq proof assistant. MCQC translates pure and recursive functions into C++17, while compiling monadic effectful functions to imperative C++ system calls. With a series of memory and performance optimizations, MCQC combines verifiability with memory and runtime performance. By handling effectful and pure functions MCQC can generate executable code directly from Gallina and link it with trusted code, reducing the effort of implementing and executing verified systems.

@mastersthesis{elefthei-meng,
  title        = {Extracting and Optimizing Low-Level Bytecode from
    High-level Verified {Coq}},
  author       = {Eleftherios Ioannidis},
  school       = {Massachusetts Institute of Technology},
  year         = 2019,
  month        = feb,
}

2018

Noria: dynamic, partially-stateful data-flow for high-performance web applications Jon Gjengset, Malte Schwarzkopf, Jonathan Behrens, Lara Timbó Araújo, Martin Ek, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. OSDI 2018.

We introduce partially-stateful data-flow, a new streaming data-flow model that supports eviction and reconstruction of data-flow state on demand. By avoiding state explosion and supporting live changes to the data-flow graph, this model makes data-flow viable for building long-lived, low-latency applications, such as web applications. Our implementation, Noria, simplifies the back-end infrastructure for read-heavy web applications while improving their performance.

A Noria application supplies a relational schema and a set of parameterized queries, which Noria compiles into a data-flow program that pre-computes results for reads and incrementally applies writes. Noria makes it easy to write high-performance applications without manual performance tuning or complex-to-maintain caching layers. Partial statefulness helps Noria limit its in-memory state without prior data-flow systems' restriction to windowed state, and helps Noria adapt its data-flow to schema and query changes while on-line. Unlike prior data-flow systems, Noria also shares state and computation across related queries, eliminating duplicate work.

On a real web application's queries, our prototype scales to 5x higher load than a hand-optimized MySQL baseline. Noria also outperforms a typical MySQL/memcached stack and the materialized views of a commercial database. It scales to tens of millions of reads and millions of writes per second over multiple servers, outperforming a state-of-the-art streaming data-flow system.

@inproceedings{noria:osdi18,
  title        = {Noria: dynamic, partially-stateful data-flow for
    high-performance web applications},
  author       = {Jon Gjengset and Malte Schwarzkopf and Jonathan
    Behrens and Lara Timb{\'o} Ara{\'u}jo and Martin Ek and Eddie
    Kohler and M. Frans Kaashoek and Robert Morris},
  year         = 2018,
  pages        = {213--231},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
  month        = oct,
  address      = {Carlsbad, California},
}

Karaoke: Distributed private messaging immune to passive traffic analysis David Lazar, Yossi Gilad, and Nickolai Zeldovich. OSDI 2018.

Karaoke is a system for low-latency metadata-private communication. Karaoke provides differential privacy guarantees, and scales better with the number of users than prior such systems (Vuvuzela and Stadium). Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users. Overall, Karaoke’s latency is 5x to 10x better than Vuvuzela and Stadium.

@inproceedings{karaoke:osdi18,
  title        = {Karaoke: Distributed Private Messaging Immune to
    Passive Traffic Analysis},
  author       = {David Lazar and Yossi Gilad and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
  year         = 2018,
  month        = oct,
  address      = {Carlsbad, California},
}

Proving confidentiality in a file system using DiskSec Atalay Ileri, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. OSDI 2018.

SFSCQ is the first file system with a machine-checked proof of security. To develop, specify, and prove SFSCQ, this paper introduces DiskSec, a novel approach for reasoning about confidentiality of storage systems, such as a file system. DiskSec addresses the challenge of specifying confidentiality using the notion of data noninterference to find a middle ground between strong and precise information-flow-control guarantees and the weaker but more practical discretionary access control. DiskSec factors out reasoning about confidentiality from other properties (such as functional correctness) using a notion of sealed blocks. Sealed blocks enforce that the file system treats confidential file blocks as opaque in the bulk of the code, greatly reducing the effort of proving data noninterference. An evaluation of SFSCQ shows that its theorems preclude security bugs that have been found in real file systems, that DiskSec imposes little performance overhead, and that SFSCQ's incremental development effort, on top of DiskSec and DFSCQ, on which it is based, is moderate.

@inproceedings{disksec:osdi18,
  title        = {Proving confidentiality in a file system using
    {DiskSec}},
  author       = {Atalay Ileri and Tej Chajed and Adam Chlipala and M.
    Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
  year         = 2018,
  month        = oct,
  address      = {Carlsbad, California},
}

Verifying concurrent software using movers in CSPEC Tej Chajed, M. Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich. OSDI 2018.

Writing concurrent systems software is error-prone, because multiple processes or threads can interleave in many ways, and it is easy to forget about a subtle corner case. This paper introduces CSPEC, a framework for formal verification of concurrent software, which ensures that no corner cases are missed. The key challenge is to reduce the number of interleavings that developers must consider. CSPEC uses mover types to re-order commutative operations so that usually it's enough to reason about only sequential executions rather than all possible interleavings. CSPEC also makes proofs easier by making them modular using layers, and by providing a library of reusable proof patterns. To evaluate CSPEC, we implemented and proved the correctness of CMAIL, a simple concurrent Maildir-like mail server that speaks SMTP and POP3. The results demonstrate that CSPEC's movers and patterns allow reasoning about sophisticated concurrency styles in CMAIL.

@inproceedings{cspec:osdi18,
  title        = {Verifying concurrent software using movers in
    {CSPEC}},
  author       = {Tej Chajed and M. Frans Kaashoek and Butler Lampson
    and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
  year         = 2018,
  month        = oct,
  address      = {Carlsbad, California},
}

The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, and Robert T. Morris. OSDI 2018.

This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.

The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.

On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.

@inproceedings{biscuit:osdi18,
  title        = {The benefits and costs of writing a {POSIX} kernel
    in a high-level language},
  author       = {Cody Cutler and M. Frans Kaashoek and Robert T.
    Morris},
  year         = 2018,
  pages        = {89--105},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '18)},
  month        = oct,
  address      = {Carlsbad, California},
}

A revised and verified proof of the scalable commutativity rule Lillian Tsai, Eddie Kohler, M. Frans Kaashoek, and Nickolai Zeldovich. MIT CSAIL technical report, September 2018.

This paper explains a flaw in the published proof of the Scalable Commutativity Rule (SCR), presents a revised and formally verified proof of the SCR in the Coq proof assistant, and discusses the insights and open questions raised from our experience proving the SCR.

@techreport{tsai-scr-proof,
  title        = {A Revised and Verified Proof of the Scalable
    Commutativity Rule},
  author       = {Lillian Tsai and Eddie Kohler and M. Frans Kaashoek
    and Nickolai Zeldovich},
  year         = 2018,
  month        = sep,
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
  howpublished = {arXiv:1809.09550},
}

Multiverse databases for secure web applications Lara Timbó Araújo. Master's thesis, Massachusetts Institute of Technology, February 2018.

Most modern web applications authenticate users and enforce security policies in the ap- plication logic. Therefore, buggy applications can easily leak sensitive data. MultiverseDB addresses this problem in a new database architecture, where each user has her own private view of the database and declarative security policies restrict data-flow into a user’s private universe.

To support multi-user universes, MultiverseDB builds on ideas of streaming data-flow systems and low-overhead materialized views. When a new user session starts, the system creates data-flow nodes to support the user queries and automatically inserts special nodes to enforce security policies at universe boundaries. MultiverseDB provides fast reads by storing the pre-computed results of user queries with policies already applied in incremen- tally maintained materialized views. To reduce space overheads created by these views and avoid redundant processing, MultiverseDB reuses views and allows system administrators to specify security groups for users subjected to the same security policies.

@mastersthesis{larat-meng,
  title        = {Multiverse Databases for Secure Web Applications},
  author       = {Lara Timb\'{o} Ara\'{u}jo},
  school       = {Massachusetts Institute of Technology},
  year         = 2018,
  month        = feb,
}

Veil: Private browsing semantics without browser-side assistance Frank Wang, James Mickens, and Nickolai Zeldovich. NDSS 2018.

All popular web browsers offer a “private browsing mode.” After a private session terminates, the browser is supposed to remove client-side evidence that the session occurred. Unfortunately, browsers still leak information through the file system, the browser cache, the DNS cache, and on-disk reflections of RAM such as the swap file.

Veil is a new deployment framework that allows web developers to prevent these information leaks, or at least reduce their likelihood. Veil leverages the fact that, even though developers do not control the client-side browser implementation, developers do control 1) the content that is sent to those browsers, and 2) the servers which deliver that content. Veil web sites collectively store their content on Veil’s blinding servers instead of on individual, site-specific servers. To publish a new page, developers pass their HTML, CSS, and JavaScript files to Veil’s compiler; the compiler transforms the URLs in the content so that, when the page loads on a user’s browser, URLs are derived from a secret user key. The blinding service and the Veil page exchange encrypted data that is also protected by the user’s key. The result is that Veil pages can safely store encrypted content in the browser cache; furthermore, the URLs exposed to system interfaces like the DNS cache are unintelligible to attackers who do not possess the user’s key. To protect against post-session inspection of swap file artifacts, Veil uses heap walking (which minimizes the likelihood that secret data is paged out), content mutation (which garbles in-memory artifacts if they do get swapped out), and DOM hiding (which prevents the browser from learning site-specific HTML, CSS, and JavaScript content in the first place). Veil pages load on unmodified commodity browsers, allowing developers to provide stronger semantics for private browsing without forcing users to install or reconfigure their machines. Veil provides these guarantees even if the user does not visit a page using a browser’s native privacy mode; indeed, Veil’s protections are stronger than what the browser alone can provide.

@inproceedings{veil:ndss18,
  title        = {Veil: Private Browsing Semantics Without
    Browser-side Assistance},
  author       = {Frank Wang and James Mickens and Nickolai Zeldovich},
  booktitle    = {Proceedings of the {N}etwork and {D}istributed
    {S}ystem {S}ecurity {S}ymposium ({NDSS} '18)},
  year         = 2018,
  month        = feb,
  address      = {San Diego, California},
}

2017

Algorand: Scaling byzantine agreements for cryptocurrencies Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zeldovich. SOSP 2017.

Algorand is a new cryptocurrency that confirms transactions with latency on the order of a minute while scaling to many users. Algorand ensures that users never have divergent views of confirmed transactions, even if some of the users are malicious and the network is temporarily partitioned. In contrast, existing cryptocurrencies allow for temporary forks and therefore require a long time, on the order of an hour, to confirm transactions with high confidence.

Algorand uses a new Byzantine Agreement (BA) protocol to reach consensus among users on the next set of transactions. To scale the consensus to many users, Algorand uses a novel mechanism based on Verifiable Random Functions that allows users to privately check whether they are selected to participate in the BA to agree on the next set of transactions, and to include a proof of their selection in their network messages. In Algorand's BA protocol, users do not keep any private state except for their private keys, which allows Algorand to replace participants immediately after they send a message. This mitigates targeted attacks on chosen participants after their identity is revealed.

We implement Algorand and evaluate its performance on 1,000 EC2 virtual machines, simulating up to 500,000 users. Experimental results show that Algorand confirms transactions in under a minute, achieves 125× Bitcoin's throughput, and incurs almost no penalty for scaling to more users.

@inproceedings{algorand:sosp17,
  title        = {Algorand: Scaling Byzantine Agreements for
    Cryptocurrencies},
  author       = {Yossi Gilad and Rotem Hemo and Silvio Micali and
    Georgios Vlachos and Nickolai Zeldovich},
  pages        = {51--68},
  booktitle    = {Proceedings of the 26th ACM Symposium on Operating
    Systems Principles (SOSP 2017)},
  year         = 2017,
  month        = oct,
  address      = {Shanghai, China},
}

Stadium: A distributed metadata-private messaging system Nirvan Tyagi, Yossi Gilad, Derek Leung, Matei Zaharia, and Nickolai Zeldovich. SOSP 2017.

Private communication over the Internet remains a challenging problem. Even if messages are encrypted, it is hard to deliver them without revealing metadata about which pairs of users are communicating. Scalable anonymity systems, such as Tor, are susceptible to traffic analysis attacks that leak metadata. In contrast, the largest-scale systems with metadata privacy require passing all messages through a small number of providers, requiring a high operational cost for each provider and limiting their deployability in practice.

This paper presents Stadium, a point-to-point messaging system that provides metadata and data privacy while scaling its work efficiently across hundreds of low-cost providers operated by different organizations. Much like Vuvuzela, the current largest-scale metadata-private system, Stadium achieves its provable guarantees through differential privacy and the addition of noisy cover traffic. The key challenge in Stadium is limiting the information revealed from the many observable traffic links of a highly distributed system, without requiring an overwhelming amount of noise. To solve this challenge, Stadium introduces techniques for distributed noise generation and differentially private routing as well as a verifiable parallel mixnet design where the servers collaboratively check that others follow the protocol. We show that Stadium can scale to support 4× more users than Vuvuzela using servers that cost an order of magnitude less to operate than Vuvuzela nodes.

@inproceedings{stadium:sosp17,
  title        = {Stadium: A Distributed Metadata-Private Messaging
    System},
  author       = {Nirvan Tyagi and Yossi Gilad and Derek Leung and
    Matei Zaharia and Nickolai Zeldovich},
  pages        = {423--440},
  booktitle    = {Proceedings of the 26th ACM Symposium on Operating
    Systems Principles (SOSP 2017)},
  year         = 2017,
  month        = oct,
  address      = {Shanghai, China},
}

Verifying a high-performance crash-safe file system using a tree specification Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay Ileri, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2017.

DFSCQ is the first file system that (1) provides a precise specification for fsync and fdatasync, which allow applications to achieve high performance and crash safety, and (2) provides a machine-checked proof that its implementation meets this specification. DFSCQ's specification captures the behavior of sophisticated optimizations, including log-bypass writes, and DFSCQ's proof rules out some of the common bugs in file-system implementations despite the complex optimizations.

The key challenge in building DFSCQ is to write a specification for the file system and its internal implementation without exposing internal file-system details. DFSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, which roughly follows the behavior of Linux ext4. This specification uses a notion of tree sequences—logical sequences of file-system tree states—for succinct description of the possible states after a crash and to describe how data writes can be reordered with respect to metadata updates. This helps application developers prove the crash safety of their own applications, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory.

An evaluation shows that DFSCQ achieves 103 MB/s on large file writes to an SSD and durably creates small files at a rate of 1,618 files per second. This is slower than Linux ext4 (which achieves 295 MB/s for large file writes and 4,977 files/s for small file creation) but much faster than two recent verified file systems, Yggdrasil and FSCQ. Evaluation results from application-level benchmarks, including TPC-C on SQLite, mirror these microbenchmarks.

@inproceedings{dfscq:sosp17,
  title        = {Verifying a high-performance crash-safe file system
    using a tree specification},
  author       = {Haogang Chen and Tej Chajed and Alex Konradi and
    Stephanie Wang and Atalay Ileri and Adam Chlipala and M. Frans
    Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 26th ACM Symposium on Operating
    Systems Principles (SOSP 2017)},
  year         = 2017,
  month        = oct,
  address      = {Shanghai, China},
}

Scaling a file system to many cores using an operation log Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2017.

It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, even for commutative operations like creating different files in the same directory, current file systems introduce cache-line conflicts when updating an in-memory copy of the on-disk directory block, which limits scalability.

ScaleFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without cache conflicts and hence scale perfectly. ScaleFS logs operations in a per-core log so that it can delay propagating updates to the disk representation (and the cache-line conflicts involved in doing so) until an fsync. The fsync call merges the per-core logs and applies the operations to disk. ScaleFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache-line conflicts, absorption of logged operations, and dependency tracking across operations.

Experiments with a prototype of ScaleFS show that its implementation has no cache conflicts for 99% of test cases of commutative operations generated by Commuter, scales well on an 80-core machine, and provides on-disk performance that is comparable to that of Linux ext4.

@inproceedings{scalefs:sosp17,
  title        = {Scaling a file system to many cores using an
    operation log},
  author       = {Srivatsa S. Bhat and Rasha Eqbal and Austin T.
    Clements and M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 26th ACM Symposium on Operating
    Systems Principles (SOSP 2017)},
  year         = 2017,
  month        = oct,
  address      = {Shanghai, China},
}

Quboid: A workstation for safer web interaction Amol M. Bhave. Master's thesis, Massachusetts Institute of Technology, September 2017.

As more of the world moves towards online technologies, users are exposed to the increasing threat of cyberattacks. Studies show that most of these attacks begin with a phishing attack. Phishing emails and websites may compromise user credentials or download unsolicited and malicious software.

This thesis presents the design and implementation of Quboid, a workstation for safer web interaction. Quboid helps users better defend against phishing attacks by providing several security mechanisms. The design of Quboid is based on the principle of isolation and restricted communication. The system enforces isolation by using virtualization to restrict browser instances to show different websites in separate virtual machines. For example, Quboid isolates a user’s bank website and social networking website in separate VMs. It uses deep-packet inspection to implement a HTTP/HTTPS proxy filter to ensure virtual machines only communicate with specific web servers. It also provides users with a secure interface and provides cues to help them recognize phishing attacks.

@mastersthesis{amol-meng,
  title        = {{Quboid}: A workstation for safer Web interaction},
  author       = {Amol M. Bhave},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = sep,
}

The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. CACM 60(8), August 2017.

Developing software that scales on multicore processors is an inexact science dominated by guesswork, measure- ment, and expensive cycles of redesign and reimplementation. Current approaches are workload-driven and, hence, can reveal scalability bottlenecks only for known workloads and available software and hardware. This paper introduces an interface-driven approach to building scalable software. This approach is based on the scalable commutativity rule, which, informally stated, says that whenever interface operations commute, they can be implemented in a way that scales. We formalize this rule and prove it correct for any machine on which conflict-free operations scale, such as current cache-coherent multicore machines. The rule also enables a better design process for scalable software: programmers can now reason about scalability from the earli- est stages of interface definition through software design, implementation, and evaluation.

@article{commutativity:cacm,
  title        = {The Scalable Commutativity Rule: Designing Scalable
    Software for Multicore Processors},
  author       = {Austin T. Clements and M. Frans Kaashoek and
    Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
  journal      = {CACM},
  volume       = 60,
  number       = 8,
  month        = aug,
  year         = 2017,
}

Performance Optimization of the VDFS Verified File System Alex Konradi. Master's thesis, Massachusetts Institute of Technology, June 2017.

Formal verification of software has become a powerful tool for creating software systems and proving their correctness. While such systems provide strong guarantees about their behavior, they frequently exhibit poor performance relative to their unverified counterparts. Verified file systems are not excepted, and their poor performance limits their utility. These limitations, however, are not intrinsic to verification techniques, but are the result of designing for proofs, not performance.

This thesis proposes a design for large files and in-memory caches that are amenable both to a high-performance implementation and proofs of correctness. It then describes their usage in VDFS, a verified high-performance file system with deferred durability guarantees. The evaluation of VDFS’ performance shows that these additions measurably improve performance over previous verified file systems, and make VDFS competitive with unverified file system implementations. This thesis contributes proven implementation techniques for large files and in-memory caches that can be applied to increase performance of verified systems.

@mastersthesis{akonradi-meng,
  title        = {Performance {Optimization} of the {VDFS} {Verified}
    {File} {System}},
  author       = {Alex Konradi},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = jun,
}

Compiling Gallina to Go for the FSCQ file system Daniel Ziegler. Master's thesis, Massachusetts Institute of Technology, June 2017.

Over the last decade, systems software verification has become increasingly practical. Many verified systems have been written in the language of a proof assistant, proved correct, and then made runnable using code extraction. However, due to the rigidity of extraction and the overhead of the target languages, the resulting code’s CPU performance can suffer, with limited opportunity for optimization. This thesis contributes CoqGo, a proof-producing compiler from Coq's Gallina language to Go. We created Go′, a stylized semantics of Go which enforce linearity, and implemented proofproducing compilation tactics from Gallina to Go′ plus a straightforward translation from Go′ to Go. Applying a prototype of CoqGo, we compiled a system call in the FSCQ file system, with minimal changes to FSCQ's source code. Taking advantage of the increased control given by CoqGo, we implemented three optimizations, bringing the system call’s CPU performance to 19% faster than the extracted version.

@mastersthesis{ziegler-meng,
  title        = {Compiling {Gallina} to {Go} for the {FSCQ} File
    System},
  author       = {Daniel Ziegler},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = jun,
}

CoqIOA: A formalization of IO Automata in the Coq Proof Assistant Anish Athalye. Master's thesis, Massachusetts Institute of Technology, June 2017.

Implementing distributed systems correctly is difficult. Designing correct distributed systems protocols is challenging because designs must account for concurrent operation and handle network and machine failures. Implementing these protocols is challenging as well: it is difficult to avoid subtle bugs in implementations of complex protocols. Formal verification is a promising approach to ensuring distributed systems are free of bugs, but verification is challenging and time-consuming. Unfortunately, current approaches to mechanically verifying distributed systems in proof assistants using deductive verification do not allow for modular reasoning, which could greatly reduce the effort required to implement verified distributed systems by enabling reuse of code and proofs.

This thesis presents CoqIOA, a framework for reasoning about distributed systems in a compositional way. CoqIOA builds on the theory of input/output automata to support specification, proof, and composition of systems within the proof assistant. The framework's implementation of the theory of IO automata, including refinement, simulation relations, and composition, are all machine-checked in the Coq proof assistant. An evaluation of CoqIOA demonstrates that the framework enables compositional reasoning about distributed systems within the proof assistant.

@mastersthesis{aathalye-meng,
  title        = {{CoqIOA}: A Formalization of {IO} {Automata} in the
    {Coq} {Proof} {Assistant}},
  author       = {Anish Athalye},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = jun,
}

Designing multicore scalable filesystems with durability and crash consistency Srivatsa S. Bhat. Master's thesis, Massachusetts Institute of Technology, June 2017.

It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, data structures that are on separate cache lines in memory (e.g., directory entries) are grouped together in a transaction log when the file system writes them to disk. This grouping results in cache line conflicts, thereby limiting scalability.

McoreFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without conflicts and hence scale perfectly. McoreFS logs operations in a per-core log so that it can delay propagating updates to the disk representation until an fsync. The fsync call merges the per-core logs and applies the operations to disk. McoreFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache line conflicts, absorption of logged operations, and dependency tracking across operations.

Experiments with a prototype of McoreFS show that its implementation is conflict-free for 99% of test cases involving commutative operations generated by Commuter, scales well on an 80-core machine, and provides disk performance that matches or exceeds that of Linux ext4.

@mastersthesis{srivatsa-sm,
  title        = {Designing multicore scalable filesystems with
    durability and crash consistency},
  author       = {Srivatsa S. Bhat},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = jun,
}

Certifying a file system using Crash Hoare Logic: Correctness in the presence of crashes Tej Chajed, Haogang Chen, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich, and Daniel Ziegler. CACM 60(4), April 2017.

FSCQ is the first file system with a machine-checkable proof that its implementation meets a specification, even in the presence of fail-stop crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover its state correctly without losing data.

To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

@article{fscq:cacm,
  title        = {Certifying a File System Using {Crash Hoare Logic}:
    Correctness in the Presence of Crashes},
  author       = {Tej Chajed and Haogang Chen and Adam Chlipala and M.
    Frans Kaashoek and Nickolai Zeldovich and Daniel Ziegler},
  journal      = {CACM},
  volume       = 60,
  number       = 4,
  month        = apr,
  year         = 2017,
}

Splinter: Practical private queries on public data Frank Wang, Catherine Yun, Shafi Goldwasser, Vinod Vaikuntanathan, and Matei Zaharia. NSDI 2017.

Many online services let users query public datasets such as maps, flight prices, or restaurant reviews. Unfortunately, the queries to these services reveal highly sensitive information that can compromise users’ privacy. This paper presents Splinter, a system that protects users’ queries on public data and scales to realistic applications. A user splits her query into multiple parts and sends each part to a different provider that holds a copy of the data. As long as any one of the providers is honest and does not collude with the others, the providers cannot determine the query. Splinter uses and extends a new cryptographic primitive called Function Secret Sharing (FSS) that makes it up to an order of magnitude more efficient than prior systems based on Private Information Retrieval and garbled circuits. We develop protocols extending FSS to new types of queries, such as MAX and TOPK queries. We also provide an optimized implementation of FSS using AES-NI instructions and multicores. Splinter achieves end-to-end latencies below 1.6 seconds for realistic workloads including a Yelp clone, flight search, and map routing.

@inproceedings{splinter:nsdi17,
  title        = {Splinter: Practical Private Queries on Public Data},
  author       = {Frank Wang and Catherine Yun and Shafi Goldwasser
    and Vinod Vaikuntanathan and Matei Zaharia},
  booktitle    = {Proceedings of the 14th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '17)},
  year         = 2017,
  month        = mar,
  address      = {Boston, MA},
}

Verifying an I/O-concurrent file system Tej Chajed. Master's thesis, Massachusetts Institute of Technology, February 2017.

Systems software is a good target for verification due to its prevalent usage and its complexity, which can lead to tricky bugs that are hard to test for. One source of complexity in systems software is concurrency, but thus far verification techniques have struggled to enable large-scale verification of concurrent systems. This thesis contributes a verified file system, CIO-FSCQ, with I/O concurrency: if a file system call experiences a miss in the buffer cache and starts a disk I/O, the file system overlaps the I/O with the execution of another file system call.

CIO-FSCQ re-uses the implementation, specifications, and proofs of an existing verified sequential file, FSCQ, and turns it into an I/O-concurrent file system. This re-use is enabled by CIO-FSCQ's optimistic system calls. An optimistic system call runs sequentially if all the data it needs is in the buffer cache. If some data is not in the cache, CIO-FSCQ issues I/Os to retrieve the data from disk and returns an error code. In the miss case, a system call wrapper reverts any partial changes and yields the processor so that another system call can run in parallel with the I/O. CIO-FSCQ retries the system call later, at which point the data is likely in the buffer cache. A directory-isolation protocol guarantees that FSCQ's specifications and proofs can be re-used even if optimistic system calls are retried. An evaluation of IO-FSCQ shows that it speeds up a simple file-system workload by overlapping disk I/O with computation, and that the effort of building and verifying CIO-FSCQ is small compared to the effort of verifying FSCQ.

@mastersthesis{tchajed-sm,
  title        = {Verifying an {I/O}-Concurrent File System},
  author       = {Tej Chajed},
  school       = {Massachusetts Institute of Technology},
  year         = 2017,
  month        = feb,
}

2016

Oort: User-centric cloud storage with global queries Tej Chajed, Jon Gjengset, M. Frans Kaashoek, James Mickens, Robert Morris, and Nickolai Zeldovich. MIT CSAIL technical report, December 2016.

In principle, the web should provide the perfect stage for user-generated content, allowing users to share their data seamlessly with other users across services and applications. In practice, the web fragments a user’s data over many sites, each exposing only limited APIs for sharing.

This paper describes Oort, a new cloud storage system that organizes data primarily by user rather than by application or web site. Oort allows users to choose which web software to use with their data and which other users to share it with, while giving applications powerful tools to query that data. Users rent space from providers that cooperate to provide a global, federated, general-purpose storage system. To support large-scale, multi-user applications such as Twitter and e-mail, Oort provides global queries that find and combine data from relevant users across all providers.

Oort makes global query execution efficient by recognizing and merging similar queries issued by many users’ application instances, largely eliminating the per-user factor in the global complexity of queries. Our evaluation predicts that an Oort implementation could handle traffic similar to that seen by Twitter using a hundred cooperating Oort servers, and that applications with other sharing patterns, like e-mail, can also be executed efficiently.

@techreport{oort:tr16,
  title        = {{Oort}: User-Centric Cloud Storage with Global
    Queries},
  author       = {Tej Chajed and Jon Gjengset and M. Frans Kaashoek
    and James Mickens and Robert Morris and Nickolai Zeldovich},
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
  number       = {MIT-CSAIL-TR-2016-015},
  year         = 2016,
  month        = dec,
}

Firmament: fast, centralized cluster scheduling at scale Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. OSDI 2016.

Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization.

This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at sub-second placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations.

Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20× over Quincy [22], a prior centralized scheduler using the same MCMF optimization. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used centralized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6x.

@inproceedings{firmament:osdi16,
  title        = {Firmament: fast, centralized cluster scheduling at
    scale},
  author       = {Ionel Gog and Malte Schwarzkopf and Adam Gleave and
    Robert N. M. Watson and Steven Hand},
  pages        = {99--115},
  booktitle    = {Proceedings of the 12th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)},
  year         = 2016,
  month        = nov,
  address      = {Savannah, Georgia},
}

Alpenhorn: Bootstrapping secure communication without leaking metadata David Lazar, and Nickolai Zeldovich. OSDI 2016.

Alpenhorn is the first system for initiating an encrypted connection between two users that provides strong privacy and forward secrecy guarantees for metadata (i.e., information about which users connected to each other) and that does not require out-of-band communication other than knowing the other user's Alpenhorn username (email address). This resolves a significant shortcoming in all prior works on private messaging, which assume an out-of-band key distribution mechanism.

Alpenhorn's design builds on three ideas. First, Alpenhorn provides each user with an address book of friends that the user can call to establish a connection. Second, when a user adds a friend for the first time, Alpenhorn ensures the adversary does not learn the friend's identity, by using identity-based encryption in a novel way to privately determine the friend's public key. Finally, when calling a friend, Alpenhorn ensures forward secrecy of metadata by storing pairwise shared secrets in friends' address books, and evolving them over time, using a new keywheel construction. Alpenhorn relies on a number of servers, but operates in an anytrust model, requiring just one of the servers to be honest.

We implemented a prototype of Alpenhorn, and integrated it into the Vuvuzela private messaging system (which did not previously provide privacy or forward secrecy of metadata when initiating conversations). Experimental results show that Alpenhorn can scale to many users, supporting 10 million users on three Alpenhorn servers with an average call latency of 150 seconds and a client bandwidth overhead of 3.7 KB/sec.

@inproceedings{alpenhorn:osdi16,
  title        = {Alpenhorn: Bootstrapping Secure Communication
    without Leaking Metadata},
  author       = {David Lazar and Nickolai Zeldovich},
  pages        = {571--586},
  booktitle    = {Proceedings of the 12th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '16)},
  year         = 2016,
  month        = nov,
  address      = {Savannah, Georgia},
}

Certifying a crash-safe file system Haogang Chen. Ph.D. thesis, Massachusetts Institute of Technology, September 2016.

File systems are a cornerstone for storing and retrieving permanent data, yet they are complex enough to have bugs that might cause data loss, especially in the face of system crashes.

FSCQ is the first file system that (1) provides a precise specification for the core subset of POSIX file-system APIs; and the APIs include fsync and fdatasync, which allow applications to achieve high I/O performance and crash safety, and that (2) provides a machine-checked proof that its I/O-efficient implementation meets this precise specification. FSCQ’s proofs avoid crash-safety bugs that have plagued file systems, such as forgetting to insert a disk-write barrier between writing the data from the log and writing the log’s commit block. FSCQ’s specification also allows applications to prove their own crash safety, avoiding application-level bugs such as forgetting to invoke fsync on both the file and the containing directory. As a result, applications on FSCQ can provide strong guarantees: they will not lose data under any sequence of crashes.

To state FSCQ’s theorems, FSCQ introduces the Crash Hoare Logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, the thesis developed, specified, and proved the correctness of the FSCQ file system. FSCQ introduces a metadata-prefix specification that captures the properties of fsync and fdatasync, based on Linux ext4’s behavior. FSCQ also introduces disk sequences and disk relations to help formalize the metadata-prefix specification. The evaluation shows that FSCQ enables end-to-end verification of application crash safety, and that FSCQ’s optimizations achieve I/O performance on par with that of Linux ext4.

@phdthesis{hchen-phd,
  title        = {Certifying a Crash-safe File System},
  author       = {Haogang Chen},
  school       = {Massachusetts Institute of Technology},
  year         = 2016,
  month        = sep,
}

Certifying checksum-based logging in the RapidFSCQ crash-safe filesystem Stephanie Wang. Master's thesis, Massachusetts Institute of Technology, June 2016.

As more and more software is written every day, so too are bugs. Formal verification is a way of using mathematical methods to prove that a program has no bugs. However, if formal verification is to see widespread use, it must be able to compete with unverified software in performance. Unfortunately, many of the optimizations that we take for granted in unverified software depend on assumptions that are difficult to verify. One such optimization is data checksums in logging systems, used to improve I/O efficiency while still ensuring data integrity after a crash.

This thesis explores a novel method of modeling the probabilistic guarantees of a hash function. This method is then applied to the logging system underlying RapidFSCQ, a certified crash-safe filesystem, to support formally verified checksums. An evaluation of RapidFSCQ shows that it enables end-to-end verification of application and filesystem crash safety, and that RapidFSCQ's optimizations, including checksumming, achieve I/O performance on par with Linux ext4. Thus, this thesis contributes a formal model of hash function behavior with practical application to certified computer systems.

@mastersthesis{stephanie-meng,
  title        = {Certifying Checksum-Based Logging in the {RapidFSCQ}
    Crash-Safe Filesystem},
  author       = {Stephanie Wang},
  school       = {Massachusetts Institute of Technology},
  year         = 2016,
  month        = jun,
}

A differential approach to undefined behavior detection Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama. CACM 60(3), March 2016.

This paper studies undefined behavior arising in systems programming languages such as C/C++. Undefined behavior bugs lead to unpredictable and subtle systems behavior, and their effects can be further amplified by compiler optimizations. Undefined behavior bugs are present in many systems, including the Linux kernel and the Postgres database. The consequences range from incorrect functionality to missing security checks.

This paper proposes a formal and practical approach, which finds undefined behavior bugs by finding “unstable code” in terms of optimizations that leverage undefined behavior. Using this approach, we introduce a new static checker called Stack that precisely identifies undefined behavior bugs. Applying Stack to widely used systems has uncovered 161 new bugs that have been confirmed and fixed by developers.

@article{kstack:cacm,
  title        = {A Differential Approach to Undefined Behavior
    Detection},
  author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek
    and Armando Solar-Lezama},
  journal      = {CACM},
  volume       = 60,
  number       = 3,
  month        = mar,
  year         = 2016,
}

Sieve: Cryptographically enforced access control for user data in untrusted clouds Frank Wang, James Mickens, Nickolai Zeldovich, and Vinod Vaikuntanathan. NSDI 2016.

Modern web services rob users of low-level control over cloud storage — a user's single logical data set is scattered across multiple storage silos whose access controls are set by web services, not users. The consequence is that users lack the ultimate authority to determine how their data is shared with other web services.

In this paper, we introduce Sieve, a new platform which selectively (and securely) exposes user data to web services. Sieve has a user-centric storage model: each user uploads encrypted data to a single cloud store, and by default, only the user knows the decryption keys. Given this storage model, Sieve defines an infrastructure to support rich, legacy web applications. Using attribute-based encryption, Sieve allows users to define intuitively understandable access policies that are cryptographically enforceable. Using key homomorphism, Sieve can reencrypt user data on storage providers in situ, revoking decryption keys from web services without revealing new keys to the storage provider. Using secret sharing and two-factor authentication, Sieve protects cryptographic secrets against the loss of user devices like smartphones and laptops. The result is that users can enjoy rich, legacy web applications, while benefiting from cryptographically strong controls over which data a web service can access.

@inproceedings{sieve:nsdi16,
  title        = {Sieve: Cryptographically Enforced Access Control for
    User Data in Untrusted Clouds},
  author       = {Frank Wang and James Mickens and Nickolai Zeldovich
    and Vinod Vaikuntanathan},
  booktitle    = {Proceedings of the 13th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '16)},
  year         = 2016,
  month        = mar,
  address      = {Santa Clara, CA},
}

2015

Vuvuzela: Scalable private messaging resistant to traffic analysis Jelle van den Hooff, David Lazar, Matei Zaharia, and Nickolai Zeldovich. SOSP 2015.

Private messaging over the Internet has proven challenging to implement, because even if message data is encrypted, it is difficult to hide metadata about who is communicating in the face of traffic analysis. Systems that offer strong privacy guarantees, such as Dissent, scale to only several thousand clients, because they use techniques with superlinear cost in the number of clients (e.g., each client broadcasts their message to all other clients). On the other hand, scalable systems, such as Tor, do not protect against traffic analysis, making them ineffective in an era of pervasive network monitoring.

Vuvuzela is a new scalable messaging system that offers strong privacy guarantees, hiding both message data and metadata. Vuvuzela is secure against adversaries that observe and tamper with all network traffic, and that control all nodes except for one server. Vuvuzela's key insight is to minimize the number of variables observable by an attacker, and to use differential privacy techniques to add noise to all observable variables in a way that provably hides information about which users are communicating. Vuvuzela has a linear cost in the number of clients, and experiments show that it can achieve a throughput of 68,000 messages per second for 1 million users with a 37-second end-to-end latency on commodity servers.

@inproceedings{vuvuzela:sosp15,
  title        = {Vuvuzela: Scalable Private Messaging Resistant to
    Traffic Analysis},
  author       = {Jelle van den Hooff and David Lazar and Matei
    Zaharia and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 25th ACM Symposium on Operating
    Systems Principles (SOSP 2015)},
  year         = 2015,
  month        = oct,
  address      = {Monterey, California},
}

Using Crash Hoare Logic for certifying the FSCQ file system Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. SOSP 2015.

FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ's theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data.

To state FSCQ's theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ's design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ's specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

@inproceedings{fscq:sosp15,
  title        = {Using {Crash} {Hoare} {Logic} for certifying the
    {FSCQ} file system},
  author       = {Haogang Chen and Daniel Ziegler and Tej Chajed and
    Adam Chlipala and M. Frans Kaashoek and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 25th ACM Symposium on Operating
    Systems Principles (SOSP 2015)},
  year         = 2015,
  month        = oct,
  address      = {Monterey, California},
}

Reducing pause times with clustered collection Cody Cutler, and Robert Morris. ISMM 2015.

Each full garbage collection in a program with millions of objects can pause the program for multiple seconds. Much of this work is typically repeated, as the collector re-traces parts of the object graph that have not changed since the last collection. Clustered Collection reduces full collection pause times by eliminating much of this repeated work.

Clustered Collection identifies clusters: regions of the object graph that are reachable from a single “head” object, so that reachability of the head implies reachability of the whole cluster. As long as it is not written, a cluster need not be re-traced by successive full collections. The main design challenge is coping with program writes to clusters while ensuring safe, complete, and fast collections. In some cases program writes require clusters to be dissolved, but in most cases Clustered Collection can handle writes without having to re-trace the affected cluster. Clustered Collection chooses clusters likely to suffer few writes and to yield high savings from re-trace avoidance.

Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site (which suffers from significant garbage collection pauses) and a Twitter-like application show that Clustered Collection decreases full collection pause times by a factor of three and six respectively. This improvement is possible because both applications have gigabytes of live data, modify only a small fraction of it, and usually write in ways that do not result in cluster dissolution. Identifying clusters takes more time than a full collection, but happens much less frequently than full collection.

@inproceedings{cc:ismm15,
  title        = {Reducing Pause Times with Clustered Collection},
  author       = {Cody Cutler and Robert Morris},
  booktitle    = {Proceedings of the 15th {I}nternational {S}ymposium
    on {M}emory {M}anagement ({ISMM15})},
  year         = 2015,
  month        = jun,
  address      = {Portland, Oregon},
}

Parallel execution for conflicting transactions Neha Narula. Ph.D. thesis, Massachusetts Institute of Technology, June 2015.

Multi-core in-memory databases only obtain parallel performance when transactions do not conflict. Conflicting transactions are executed one at a time in order to ensure that they have serializable effects. Sequential execution on contended data leaves cores idle and reduces throughput. In other parallel programming contexts—not serializable transactions—techniques have been developed that can reduce contention on shared variables using per-core state. This thesis asks the question, can these techniques apply to a general serializable database?

This work introduces a new concurrency control technique, phase reconciliation, that uses per-core state to greatly reduce contention on popular database records for many important workloads. Phase reconciliation uses the idea of synchronized phases to amortize the cost of combining per-core data and to extract parallelism.

Doppel, our phase reconciliation database, repeatedly cycles through joined and partitioned phases. Joined phases use traditional concurrency control and allow any transaction to execute. When workload contention causes unnecessary sequential execution, Doppel switches to a split phase. During a split phase, commutative operations on popular records act on per-core state, and thus proceed in parallel on different cores. By explicitly using phases, phase reconciliation realizes two important performance benefits: First, it amortizes the potentially high costs of aggregating per-core state over many transactions. Second, it can dynamically split data or not based on observed contention, handling challenging, varying workloads. Doppel achieves higher performance because it parallelizes transactions on popular data that would be run sequentially by conventional concurrency control.

Phase reconciliation helps most when there are many updates to a few popular database records. On an 80-core machine, its throughput is up to 38x higher than conventional concurrency control protocols on microbenchmarks, and up to 3x on a larger application, at the cost of increased latency for some transactions.

@phdthesis{neha-phd,
  title        = {Parallel Execution for Conflicting Transactions},
  author       = {Neha Narula},
  school       = {Massachusetts Institute of Technology},
  year         = 2015,
  month        = jun,
}

Amber: Decoupling user data from web applications Tej Chajed, Jon Gjengset, Jelle van den Hooff, M. Frans Kaashoek, James Mickens, Robert Morris, and Nickolai Zeldovich. HotOS XV 2015.

User-generated content is becoming increasingly common on the Web, but current web applications isolate their users' data, enabling only restricted sharing and cross-service integration. We believe users should be able to share their data seamlessly between their applications and with other users. To that end, we propose Amber, an architecture that decouples users' data from applications, while providing applications with powerful global queries to find user data. We demonstrate how multi-user applications, such as e-mail, can use these global queries to efficiently collect and monitor relevant data created by other users. Amber puts users in control of which applications they use with their data and with whom it is shared, and enables a new class of applications by removing the artificial partitioning of users' data by application.

@inproceedings{amber:hotos15,
  title        = {Amber: Decoupling User Data from Web Applications},
  author       = {Tej Chajed and Jon Gjengset and Jelle van den Hooff
    and M. Frans Kaashoek and James Mickens and Robert Morris and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of the 15th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS XV})},
  year         = 2015,
  month        = may,
  address      = {Kartause Ittingen, Switzerland},
}

Specifying crash safety for storage systems Haogang Chen, Daniel Ziegler, Adam Chlipala, M. Frans Kaashoek, Eddie Kohler, and Nickolai Zeldovich. HotOS XV 2015.

@inproceedings{fscq:hotos15,
  title        = {Specifying Crash Safety for Storage Systems},
  author       = {Haogang Chen and Daniel Ziegler and Adam Chlipala
    and M. Frans Kaashoek and Eddie Kohler and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 15th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS XV})},
  year         = 2015,
  month        = may,
  address      = {Kartause Ittingen, Switzerland},
}

Hare: a file system for non-cache-coherent multicores Charles Gruenwald III, Filippo Sironi, M. Frans Kaashoek, and Nickolai Zeldovich. EuroSys 2015.

Hare is a new file system that provides a POSIX-like interface on multicore processors without cache coherence. Hare allows applications on different cores to share files, directories, and file descriptors. The challenge in designing Hare is to support the shared abstractions faithfully enough to run applications that run on traditional shared-memory operating systems, with few modifications, and to do so while scaling with an increasing number of cores.

To achieve this goal, Hare must support features (such as shared file descriptors) that traditional network file systems don't support, as well as implement them in a way that scales (e.g., shard a directory across servers to allow concurrent operations in that directory). Hare achieves this goal through a combination of new protocols (including a 3-phase commit protocol to implement directory operations correctly and scalably) and leveraging properties of non-cache-coherent multiprocessors (e.g., atomic low-latency message delivery and shared DRAM).

An evaluation on a 40-core machine demonstrates that Hare can run many challenging Linux applications (including a mail server and a Linux kernel build) with minimal or no modifications. The results also show these applications achieve good scalability on Hare, and that Hare's techniques are important to achieving scalability.

@inproceedings{gruenwald:hare,
  author       = {Gruenwald III, Charles and Sironi, Filippo and
    Kaashoek, M. Frans and Zeldovich, Nickolai},
  title        = {Hare: a file system for non-cache-coherent
    multicores},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2015)},
  year         = 2015,
  month        = apr,
  address      = {Bordeaux, France},
}

What opportunities for multicore scalability are latent in software interfaces, such as system call APIs? Can scalability challenges and opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions, we introduce the scalable commutativity rule: whenever interface operations commute, they can be implemented in a way that scales. This rule is useful throughout the development process for scalable multicore software, from the interface design through implementation, testing, and evaluation.

This article formalizes the scalable commutativity rule. This requires defining a novel form of commutativity, SIM commutativity, that lets the rule apply even to complex and highly stateful software interfaces.

We also introduce a suite of software development tools based on the rule. Our Commuter tool accepts high-level interface models, generates tests of interface operations that commute and hence could scale, and uses these tests to systematically evaluate the scalability of implementations. We apply Commuter to a model of 18 POSIX file and virtual memory system operations. Using the resulting 26,238 scalability tests, Commuter highlights Linux kernel problems previously observed to limit application scalability and identifies previously unknown bottlenecks that may be triggered by future workloads or hardware.

Finally, we apply the scalable commutativity rule and Commuter to the design and implementation sv6, a new POSIX-like operating system. sv6’s novel file and virtual memory system designs enable it to scale for 99% of the tests generated by Commuter. These results translate to linear scalability on an 80-core x86 machine for applications built on sv6’s commutative operations.

@article{commutativity:tocs,
  title        = {The Scalable Commutativity Rule: Designing Scalable
    Software for Multicore Processors},
  author       = {Austin T. Clements and M. Frans Kaashoek and
    Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
  journal      = {ACM Trans. Comput. Syst.},
  volume       = 32,
  number       = 4,
  month        = jan,
  year         = 2015,
  pages        = {10:1--10:47},
}

2014

VerSum: Verifiable computations over large public logs Jelle van den Hooff, M. Frans Kaashoek, and Nickolai Zeldovich. CCS 2014.

VerSum allows lightweight clients to outsource expensive computations over large and frequently changing data structures, such as the Bitcoin or Namecoin blockchains, or a Certificate Transparency log. VerSum clients ensure that the output is correct by comparing the outputs from multiple servers. VerSum assumes that at least one server is honest, and crucially, when servers disagree, VerSum uses an efficient conflict resolution protocol to determine which server(s) made a mistake and thus obtain the correct output.

VerSum's contribution lies in achieving low server-side overhead for both incremental re-computation and conflict resolution, using three key ideas: (1) representing the computation as a functional program, which allows memoization of previous results; (2) recording the evaluation trace of the functional program in a carefully designed computation history to help clients determine which server made a mistake; and (3) introducing a new authenticated data structure for sequences, called SeqHash, that makes it efficient for servers to construct summaries of computation histories in the presence of incremental re-computation. Experimental results with an implementation of VerSum show that VerSum can be used for a variety of computations, that it can support many clients, and that it can easily keep up with Bitcoin's rate of new blocks with transactions.

@inproceedings{versum:ccs14,
  title        = {{VerSum}: Verifiable computations over large public
    logs},
  author       = {Jelle van den Hooff and M. Frans Kaashoek and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of the 21st {ACM} Conference on Computer
    and Communications Security ({CCS 2014})},
  year         = 2014,
  month        = nov,
  address      = {Scottsdale, AZ},
}

Nail: A practical tool for parsing and generating data formats Julian Bangert, and Nickolai Zeldovich. OSDI 2014.

Nail is a tool that greatly reduces the programmer effort for safely parsing and generating data formats defined by a grammar. Nail introduces several key ideas to achieve its goal. First, Nail uses a protocol grammar to define not just the data format, but also the internal object model of the data. Second, Nail eliminates the notion of semantic actions, used by existing parser generators, which reduces the expressive power but allows Nail to both parse data formats and generate them from the internal object model, by establishing a semantic bijection between the data format and the object model. Third, Nail introduces dependent fields and stream transforms to capture protocol features such as size and offset fields, checksums, and compressed data, which are impractical to express in existing protocol languages. Using Nail, we implement an authoritative DNS server in C in under 300 lines of code and grammar, and an unzip program in C in 220 lines of code and grammar, demonstrating that Nail makes it easy to parse complex real-world data formats. Performance experiments show that a Nail-based DNS server can outperform the widely used BIND DNS server on an authoritative workload, demonstrating that systems built with Nail can achieve good performance.

@inproceedings{nail:osdi14,
  title        = {Nail: A Practical Tool for Parsing and Generating
    Data Formats},
  author       = {Julian Bangert and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
  year         = 2014,
  month        = oct,
  address      = {Broomfield, Colorado},
}

Identifying information disclosure in web applications with retroactive auditing Haogang Chen, Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2014.

Rail is a framework for building web applications that can precisely identify inappropriately disclosed data after a vulnerability is discovered. To do so, Rail introduces retroactive disclosure auditing: re-running the application with previous inputs once the vulnerability is fixed to determine what data should have been disclosed. A key challenge for Rail is to reconcile state divergence between the original and replay executions, so that the differences between executions precisely correspond to inappropriately disclosed data. Rail provides application developers with APIs to address this challenge, by identifying sensitive data, assigning semantic names to non-deterministic inputs, and tracking dependencies.

Results from a prototype of Rail built on top of the Meteor framework show that Rail can quickly and precisely identify data disclosure from complex attacks, including programming bugs, administrative mistakes, and stolen passwords. Rail incurs up to 22% throughput overhead and 0.5 KB storage overhead per request. Porting three existing web applications required fewer than 25 lines of code changes per application.

@inproceedings{rail:osdi14,
  title        = {Identifying information disclosure in web
    applications with retroactive auditing},
  author       = {Haogang Chen and Taesoo Kim and Xi Wang and Nickolai
    Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
  year         = 2014,
  month        = oct,
  address      = {Broomfield, Colorado},
}

Jitk: A trustworthy in-kernel interpreter infrastructure Xi Wang, David Lazar, Nickolai Zeldovich, Adam Chlipala, and Zachary Tatlock. OSDI 2014.

Modern operating systems run multiple interpreters in the kernel, which enable user-space applications to add new functionality or specialize system policies. The correctness of such interpreters is critical to the overall system security: bugs in interpreters could allow adversaries to compromise user-space applications and even the kernel.

Jitk is a new infrastructure for building in-kernel interpreters that guarantee functional correctness as they compile user-space policies down to native instructions for execution in the kernel. To demonstrate Jitk, we implement two interpreters in the Linux kernel, BPF and INET-DIAG, which are used for network and system call filtering and socket monitoring, respectively. To help application developers write correct filters, we introduce a high-level rule language, along with a proof that Jitk correctly translates high-level rules all the way to native machine code, and demonstrate that this language can be integrated into OpenSSH with tens of lines of code. We built a prototype of Jitk on top of the CompCert verified compiler and integrated it into the Linux kernel. Experimental results show that Jitk is practical, fast, and trustworthy.

@inproceedings{jitk:osdi14,
  title        = {Jitk: A Trustworthy In-Kernel Interpreter
    Infrastructure},
  author       = {Xi Wang and David Lazar and Nickolai Zeldovich and
    Adam Chlipala and Zachary Tatlock},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
  year         = 2014,
  month        = oct,
  address      = {Broomfield, Colorado},
}

Phase reconciliation for contended in-memory transactions Neha Narula, Cody Cutler, Eddie Kohler, and Robert Morris. OSDI 2014.

Multicore main-memory database performance can collapse when many transactions contend on the same data. Contending transactions are executed serially—either by locks or by optimistic concurrency control aborts—in order to ensure that they have serializable effects. This leaves many cores idle and performance poor. We introduce a new concurrency control technique, phase reconciliation, that solves this problem for many important workloads. Doppel, our phase reconciliation database, repeatedly cycles through joined, split, and reconciliation phases.

Joined phases use traditional concurrency control and allow any transaction to execute. When workload contention causes unnecessary serial execution, Doppel switches to a split phase. There, updates to contended items modify per-core state, and thus proceed in parallel on different cores. Not all transactions can execute in a split phase; for example, all modifications to a contended item must commute. A reconciliation phase merges these per-core states into the global store, producing a complete database ready for joined-phase transactions. A key aspect of this design is determining which items to split, and which operations to allow on split items.

Phase reconciliation helps most when there are many updates to a few popular database records. Its throughput is up to 38x higher than conventional concurrency control protocols on microbenchmarks, and up to 3x higher on a larger application, at the cost of increased latency for some transactions.

@inproceedings{doppel:osdi14,
  title        = {Phase Reconciliation for Contended In-Memory
    Transactions},
  author       = {Neha Narula and Cody Cutler and Eddie Kohler and
    Robert Morris},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '14)},
  year         = 2014,
  month        = oct,
  address      = {Broomfield, Colorado},
}

A differential approach to undefined behavior detection Xi Wang. Ph.D. thesis, Massachusetts Institute of Technology, September 2014.

This thesis studies undefined behavior arising in systems programming languages such as C/C++. Undefined behavior bugs lead to unpredictable and subtle systems behavior, and their effects can be further amplified by compiler optimizations. Undefined behavior bugs are present in many systems, including the Linux kernel and the Postgres database. The consequences range from incorrect functionality to missing security checks.

This thesis proposes a formal and practical approach, which finds undefined behavior bugs by finding “unstable code” in terms of optimizations that leverage undefined behavior. Using this approach, we introduce a new static checker called Stack that precisely identifies undefined behavior bugs. Applying Stack to widely used systems has uncovered 161 new bugs that have been confirmed and fixed by developers.

@phdthesis{xi-phd,
  title        = {A Differential Approach to Undefined Behavior
    Detection},
  author       = {Xi Wang},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = sep,
}

ScaleFS: A multicore-scalable file system Rasha Eqbal. Master's thesis, Massachusetts Institute of Technology, August 2014.

It is difficult to achieve durability and crash consistency in file systems along with multicore scalability. Commutative file system operations, which should scale according to the Scalable Commutativity Property, conflict on shared resources like coarse-grained locks and pages present in the page cache or buffer cache. Furthermore, data structures that are on separate cache lines in memory (e.g., directory entries) are grouped together when the file system writes them to disk for durability. This grouping results in additional conflicts.

This thesis introduces a new design approach that decouples the in-memory file system from the on-disk file system, using per core operation logs. This facilitates the use of highly concurrent data structures for the in-memory representation, which is essential for commutative operations to proceed conflict free and hence scale perfectly. The in-memory representation does not propagate updates to the disk representation immediately, instead it simply logs the operation in a per core logical log. A sync or an fsync call processes these operations and applies them to the disk. Techniques based on time stamping linearization points of file system operations ensure crash consistency, and dependency tracking ensures good disk performance.

A prototype file system, ScaleFS, implements this new approach and techniques. Experiments using Commuter and ScaleFS show that the implementation is conflict free for 99% of test cases involving commutative operations.

@mastersthesis{rashae-sm-thesis,
  title        = {{ScaleFS}: A Multicore-Scalable File System},
  author       = {Rasha Eqbal},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = aug,
}

Why does cryptographic software fail? A case study and open problems David Lazar, Haogang Chen, Xi Wang, and Nickolai Zeldovich. APSys 2014.

Mistakes in cryptographic software implementations often undermine the strong security guarantees offered by cryptography. This paper presents a systematic study of cryptographic vulnerabilities in practice, an examination of state-of-the-art techniques to prevent such vulnerabilities, and a discussion of open problems and possible future research directions. Our study covers 269 cryptographic vulnerabilities reported in the CVE database from January 2011 to May 2014. The results show that just 17% of the bugs are in cryptographic libraries (which often have devastating consequences), and the remaining 83% are misuses of cryptographic libraries by individual applications. We observe that preventing bugs in different parts of a system requires different techniques, and that no effective techniques exist to deal with certain classes of mistakes, such as weak key generation.

@inproceedings{cryptobugs:apsys14,
  title        = {Why does cryptographic software fail? {A} case study
    and open problems},
  author       = {David Lazar and Haogang Chen and Xi Wang and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of 5th ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2014)},
  year         = 2014,
  month        = jun,
  address      = {Beijing, China},
}

Providing a shared file system in the Hare POSIX multikernel Charles Gruenwald III. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.

Hare is a new multikernel operating system that provides a single system image for multicore processors without cache coherence. Hare allows applications on different cores to share files, directories, file descriptors, sockets, and processes. The main challenge in designing Hare is to support shared abstractions faithfully enough to run applications that run on traditional shared-memory operating systems with few modifications, and to do so while scaling with an increasing number of cores. To achieve this goal, Hare must support shared abstractions (e.g., file descriptors shared between processes) that appear consistent to processes running on any core, but without relying on hardware cache coherence between cores. Moreover, Hare must implement these abstractions in a way that scales (e.g., sharded directories across servers to allow concurrent operations in that directory). Hare achieves this goal through a combination of new protocols (e.g., a 3-phase commit protocol to implement directory operations correctly and scalably) and leveraging properties of non-cache coherent multiprocessors (e.g., atomic low-latency message delivery and shared DRAM). An evaluation on a 40-core machine demonstrates that Hare can run many challenging Linux applications (including a mail server and a Linux kernel build) with minimal or no modifications. The results also show these applications achieve good scalability on Hare, and that Hare's techniques are important to achieving scalability.

@phdthesis{charlesg3-phd,
  title        = {Providing a Shared File System in the {Hare} {POSIX}
    Multikernel},
  author       = {Charles Gruenwald III},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = jun,
}

The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.

What fundamental opportunities for multicore scalability are latent in software interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this dissertation introduces the scalable commutativity rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building scalable multicore software starting with interface design and carrying on through implementation, testing, and evaluation.

This dissertation formalizes the scalable commutativity rule and defines a novel form of commutativity named SIM commutativity that makes it possible to fruitfully apply the rule to complex and highly stateful software interfaces.

To help developers apply the rule, this dissertation introduces an automated method embodied in a new tool named Commuter, which accepts high-level interface models, generates tests of operations that commute and hence could scale, and uses these tests to systematically evaluate the scalability of implementations. We apply Commuter to a model of 18 POSIX file and virtual memory system operations. Using the resulting 26,238 scalability tests, Commuter systematically pinpoints many problems in the Linux kernel that past work has observed to limit application scalability and identifies previously unknown bottlenecks that may be triggered by future hardware or workloads.

Finally, this dissertation applies the scalable commutativity rule and Commuter to the design and implementation of a new POSIX-like operating system named sv6. sv6's novel file and virtual memory system designs enable it to scale for 99% of the tests generated by Commuter. These results translate to linear scalability on an 80-core x86 machine for applications built on sv6's commutative operations.

@phdthesis{aclements-phd,
  title        = {The scalable commutativity rule: Designing scalable
    software for multicore processors},
  author       = {Austin T. Clements},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = jun,
}

Automatic intrusion recovery with system-wide history Taesoo Kim. Ph.D. thesis, Massachusetts Institute of Technology, June 2014.

Compromises of our computer systems are inevitable. New software vulnerabilities are discovered and exploited daily, but even if the software is bug-free, administrators may inadvertently make mistakes in configuring permissions, or unaware users may click on buttons in application installers with little understanding of its consequences. Unfortunately, recovering from those inevitable compromises leads to days and weeks of wasted effort by users or system administrators, with no conclusive guarantee that all traces of the attack have been cleaned up.

This dissertation presents , an automatic recovery system that repairs a computer after an adversary compromises it, by undoing the adversary's changes while preserving legitimate user actions, with minimal user involvement. During normal operation, records an action history graph to describe the system's execution, enabling to trace the adversary's changes and their effects. During repair, uses the action history graph to undo an unwanted action and its indirect effects by first rolling back its direct effects, and then re-executing legitimate actions that were influenced by that change. To minimize re-execution and user involvement, uses predicates to selectively re-execute only actions that were semantically affected by the adversary's changes, uses refinement to represent high level semantics into the action history graph, and uses compensating actions to handle external effects.

An evaluation of a prototype of for Linux with 2 real-world attacks, 2 synthesized challenge attacks, and 6 attacks from previous work, shows that can handle a wide range of real attacks with minimal user involvement, and preserve user's changes by efficiently re-executing parts of an action history graph. These benefits come at the cost of 35–127% in execution time overhead and of 4–150 GB of log space per day, depending on the workload. For example, a HotCRP paper submission web site incurs 35% slowdown and generates 4 GB of logs per day under the workload from 30 minutes prior to the SOSP 2007 deadline. We believe those overheads are acceptable in systems whose integrity is critical in their operations.

@phdthesis{taesoo-phd,
  title        = {Automatic intrusion recovery with system-wide
    history},
  author       = {Taesoo Kim},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = jun,
}

Reducing pause times with Clustered Collection Cody Cutler. Master's thesis, Massachusetts Institute of Technology, June 2014.

Clustered Collection reduces garbage collection pauses in programs with large amounts of live data. A full collection of millions of live objects can pause the program for multiple seconds. Much of this work, however, is repeated from one collection to the next, particularly for programs that modify only a small fraction of their object graphs between collections.

Clustered Collection reduces redundant work by identifying regions of the object graph which, once traced, need not be traced by subsequent collections. Each of these regions, or “clusters,” consists of objects reachable from a single head object. If the collector can reach a cluster's head object, it skips over the cluster, and resumes tracing at the pointers that leave the cluster. If a cluster's head object is not reachable, or an object within a cluster has been written, the cluster collector may have to trace within the cluster. Clustered Collection is complete despite not tracing within clusters: it frees all unreachable objects.

Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site show that Clustered Collection decreases full collection pause times by a factor of three. Hacker News works well with Clustered Collection because it keeps gigabytes of data in memory but modifies only a small fraction of that data. Other experiments demonstrate the ability of Clustered Collection to tolerate certain kinds of writes, and quantify the cost of finding clusters.

@mastersthesis{ccutler-sm-thesis,
  title        = {Reducing pause times with {Clustered Collection}},
  author       = {Cody Cutler},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = jun,
}

Scaling address-space operations on Linux with TSX Christopher R. Johnson. Master's thesis, Massachusetts Institute of Technology, June 2014.

Concurrent programming is important due to increasing core counts, but scalable concurrency control is difficult and error prone to implement. Hardware Transactional Memory (HTM) addresses this problem by providing hardware support for concurrently executing arbitrary read-modify-write memory transactions. Intel released Transactional Synchronization eXtensions (TSX), a HTM implementation, in select processors to support scalable concurrency control.

This thesis contributes a case study in applying TSX to the Linux virtual memory system, which currently serializes address-space operations with a lock. TSX should provide scalability by supporting concurrent address-space operations. Achieving scalability with TSX, however, turned out to be difficult due to transactional aborts. This thesis details how to identify and resolve abort problems, and it describes the necessary modifications to make address-space operations scale in Linux.

This thesis also describes a new TLB shootdown algorithm, TxShootDown, which removes TLB shootdown from a transactional critical section while avoiding races due to concurrent address-space operations.

@mastersthesis{txvm:crjohns-thesis,
  title        = {Scaling Address-Space Operations on {Linux} with
    {TSX}},
  author       = {Christopher R. Johnson},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = jun,
}

Fast bug finding in lock-free data structures with CB-DPOR Jelle van den Hooff. Master's thesis, Massachusetts Institute of Technology, May 2014.

This thesis describes CB-DPOR, an algorithm for quickly finding bugs in lock-free data structures. CB-DPOR is a combination of the CHESS and DPOR model checking algorithms. CB-DPOR performs similar to the concurrently developed preemption-bounded BPOR algorithm.

Codex is a tool for finding bugs in lock-free data structures. Codex implements CB-DPOR and this thesis demonstrates how to use Codex to find bugs. This thesis describes new bugs in open-source lock-free data structures, and compares the performance of CB-DPOR with the earlier model checking algorithms CHESS, DPOR, and PCT. CB-DPOR find bugs one to two orders of magnitude faster than earlier algorithms.

@mastersthesis{codex:jelle-meng,
  title        = {Fast Bug Finding in Lock-Free Data Structures with
    {CB-DPOR}},
  author       = {Jelle van den Hooff},
  school       = {Massachusetts Institute of Technology},
  year         = 2014,
  month        = may,
}

Building web applications on top of encrypted data using Mylar Raluca Ada Popa, Emily Stark, Jonas Helfer, Steven Valdez, Nickolai Zeldovich, M. Frans Kaashoek, and Hari Balakrishnan. NSDI 2014.

Web applications rely on servers to store and process confidential information. However, anyone who gains access to the server (e.g., an attacker, a curious administrator, or a government) can obtain all of the data stored there. This paper presents Mylar, a platform for building web applications, which protects data confidentiality against attackers with full access to servers. Mylar stores sensitive data encrypted on the server, and decrypts that data only in users' browsers. Mylar addresses three challenges in making this approach work. First, Mylar allows the server to perform keyword search over encrypted documents, even if the documents are encrypted with different keys. Second, Mylar allows users to share keys and encrypted data securely in the presence of an active adversary. Finally, Mylar ensures that client-side application code is authentic, even if the server is malicious. Results with a prototype of Mylar built on top of the Meteor framework are promising: porting 6 applications required changing just 36 lines of code on average, and the performance overheads are modest, amounting to a 17% throughput loss and a 50 ms latency increase for sending a message in a chat application.

@inproceedings{mylar:nsdi14,
  title        = {Building web applications on top of encrypted data
    using {Mylar}},
  author       = {Raluca Ada Popa and Emily Stark and Jonas Helfer and
    Steven Valdez and Nickolai Zeldovich and M. Frans Kaashoek and
    Hari Balakrishnan},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '14)},
  year         = 2014,
  month        = apr,
  address      = {Seattle, WA},
}

Easy freshness with Pequod cache joins Bryan Kate, Eddie Kohler, Michael Kester, Neha Narula, Yandong Mao, and Robert Morris. NSDI 2014.

Pequod is a distributed application-level key-value cache that supports declaratively defined, incrementally maintained, dynamic, partially-materialized views. These views, which we call cache joins, can simplify application development by shifting the burden of view maintenance onto the cache. Cache joins define relationships among key ranges; using cache joins, Pequod calculates views on demand, incrementally updates them as required, and in many cases improves performance by reducing client communication. To build Pequod, we had to design a view abstraction for volatile, relationless key-value caches and make it work across servers in a distributed system. Pequod performs as well as other inmemory key-value caches and, like those caches, outperforms databases with view support.

@inproceedings{pequod:nsdi2014,
  title        = {Easy Freshness with {Pequod} Cache Joins},
  author       = {Bryan Kate and Eddie Kohler and Michael Kester and
    Neha Narula and Yandong Mao and Robert Morris},
  booktitle    = {Proceedings of the 11th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '14)},
  year         = 2014,
  month        = apr,
  address      = {Seattle, WA},
}

2013

Towards optimization-safe systems: Analyzing the impact of undefined behavior Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama. SOSP 2013.

This paper studies an emerging class of software bugs called optimization-unstable code: code that is unexpectedly discarded by compiler optimizations due to undefined behavior in the program. Unstable code is present in many systems, including the Linux kernel and the Postgres database. The consequences of unstable code range from incorrect functionality to missing security checks.

To reason about unstable code, this paper proposes a novel model, which views unstable code in terms of optimizations that leverage undefined behavior. Using this model, we introduce a new static checker called Stack that precisely identifies unstable code. Applying Stack to widely used systems has uncovered 160 new bugs that have been confirmed and fixed by developers.

@inproceedings{stack:sosp13,
  title        = {Towards Optimization-Safe Systems: Analyzing the
    Impact of Undefined Behavior},
  author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek
    and Armando Solar-Lezama},
  booktitle    = {Proceedings of the 24th ACM Symposium on Operating
    Systems Principles (SOSP 2013)},
  year         = 2013,
  month        = nov,
  address      = {Farmington, Pennsylvania},
}

Asynchronous intrusion recovery for interconnected web services Ramesh Chandra, Taesoo Kim, and Nickolai Zeldovich. SOSP 2013.

Recovering from attacks in an interconnected system is difficult, because an adversary that gains access to one part of the system may propagate to many others, and tracking down and recovering from such an attack requires significant manual effort. Web services are an important example of an interconnected system, as they are increasingly using protocols such as OAuth and REST APIs to integrate with one another. This paper presents Aire, an intrusion recovery system for such web services. Aire addresses several challenges, such as propagating repair across services when some servers may be unavailable, and providing appropriate consistency guarantees when not all servers have been repaired yet. Experimental results show that Aire can recover from four realistic attacks, including one modeled after a recent Facebook OAuth vulnerability; that porting existing applications to Aire requires little effort; and that Aire imposes a 19–30% CPU overhead and 6–9 KB/request storage cost for Askbot, an existing web application.

@inproceedings{aire:sosp13,
  title        = {Asynchronous Intrusion Recovery for Interconnected
    Web Services},
  author       = {Ramesh Chandra and Taesoo Kim and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 24th ACM Symposium on Operating
    Systems Principles (SOSP 2013)},
  year         = 2013,
  month        = nov,
  address      = {Farmington, Pennsylvania},
}

The scalable commutativity rule: Designing scalable software for multicore processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. SOSP 2013.

What fundamental opportunities for scalability are latent in interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this paper introduces the following rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building more scalable software starting from interface design and carrying on through implementation, testing, and evaluation.

To help developers apply the rule, a new tool named Commuter accepts high-level interface models and generates tests of operations that commute and hence could scale. Using these tests, Commuter can evaluate the scalability of an implementation. We apply Commuter to 18 POSIX calls and use the results to guide the implementation of a new research operating system kernel called sv6. Linux scales for 68% of the 13,664 tests generated by Commuter for these calls, and Commuter finds many problems that have been observed to limit application scalability. sv6 scales for 99% of the tests.

@inproceedings{commutativity:sosp13,
  title        = {The Scalable Commutativity Rule: Designing Scalable
    Software for Multicore Processors},
  author       = {Austin T. Clements and M. Frans Kaashoek and
    Nickolai Zeldovich and Robert T. Morris and Eddie Kohler},
  booktitle    = {Proceedings of the 24th ACM Symposium on Operating
    Systems Principles (SOSP 2013)},
  year         = 2013,
  month        = nov,
  address      = {Farmington, Pennsylvania},
}

Optimizing communication bottlenecks in multiprocessor operating system kernels Silas Boyd-Wickizer. Ph.D. thesis, Massachusetts Institute of Technology, September 2013.

One difficulty of programming multicore processors is achieving performance that scales with the number of cores in the system. A common performance optimization is to increase inter-core parallelism. If the application is sufficiently parallelized, developers might hope that performance would scale as core count increases. Unfortunately for some applications, such as operating system kernels, parallelization reveals inter-core communication as a performance bottleneck. When data is updated on one core and read or written on other cores, the cache coherence protocol serializes accesses to the data. The result is that each access to the shared data can take hundreds to thousands of cycles, depending on how many cores are contending for the data.

This dissertation focuses on optimizing communication bottlenecks caused by update-heavy workloads, where a data structure is frequently updated but rarely read. Such data structures are commonly used for operating system kernel bookkeeping, such as LRU lists, reverse maps in virtual memory, and file system notification queues. This dissertation identifies bottlenecks in the Linux kernel caused by update-heavy data structures, presents a general approach for optimizing communication in update-heavy data structures, and presents a library called OpLog that embodies this approach and helps developers achieve good scalability for update-heavy data structures. OpLog achieves scalability by logging update operations in per-core logs, and combining the logs only when required by a read to the data structure. Measurements on a 48-core AMD server show that applying OpLog to update-heavy data structures in the Linux kernel significantly improves application performance under certain workloads.

@phdthesis{sbw-phd,
  title        = {Optimizing Communication Bottlenecks in
    Multiprocessor Operating System Kernels},
  author       = {Silas Boyd-Wickizer},
  school       = {Massachusetts Institute of Technology},
  year         = 2013,
  month        = sep,
}

Processing analytical queries over encrypted data Stephen Tu, M. Frans Kaashoek, Sam Madden, and Nickolai Zeldovich. VLDB 2013.

Monomi is a system for securely executing analytical workloads over sensitive data on an untrusted database server. Monomi works by encrypting the entire database and running queries over the encrypted data. Monomi introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, including per-row precomputation, space-efficient encryption, grouped homomorphic addition, and pre-filtering. Since these optimizations are good for some queries but not others, Monomi introduces a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime. A prototype of Monomi running on top of Postgres can execute most of the queries from the TPC-H benchmark with a median overhead of only 1.24× (ranging from 1.03× to 2.33×) compared to an un-encrypted Postgres database where a compromised server would reveal all data.

@inproceedings{monomi:vldb13,
  title        = {Processing Analytical Queries over Encrypted Data},
  author       = {Stephen Tu and M. Frans Kaashoek and Sam Madden and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of the 39th International Conference on
    Very Large Data Bases (VLDB 2013)},
  year         = 2013,
  month        = aug,
}

Security bugs in embedded interpreters Haogang Chen, Cody Cutler, Taesoo Kim, Yandong Mao, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2013.

Because embedded interpreters offer flexibility and performance, they are becoming more prevalent, and can be found at nearly every level of the software stack. As one example, the Linux kernel defines languages to describe packet filtering rules and uses embedded interpreters to filter packets at run time. As another example, the RAR archive format allows embedding bytecode in compressed files to describe reversible transformations for decompression. This paper presents an analysis of common pitfalls in embedded interpreter implementations, which can lead to security vulnerabilities, and their impact. We hope that these results are useful both in augmenting existing embedded interpreters and in aiding developers in building new, more secure embedded interpreters.

@inproceedings{vm:apsys13,
  title        = {Security Bugs in Embedded Interpreters},
  author       = {Haogang Chen and Cody Cutler and Taesoo Kim and
    Yandong Mao and Xi Wang and Nickolai Zeldovich and M. Frans
    Kaashoek},
  booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2013)},
  year         = 2013,
  month        = jul,
  address      = {Singapore},
}

Optimizing RAM-latency dominated applications Yandong Mao, Cody Cutler, and Robert Morris. APSys 2013.

Many apparently CPU-limited programs are actually bottlenecked by RAM fetch latency, often because they follow pointer chains in working sets that are much bigger than the CPU’s on-chip cache. For example, garbage collectors that identify live objects by tracing inter-object pointers can spend much of their time stalling due to RAM fetches.

We observe that for such workloads, programmers should view RAM much as they view disk. The two situations share not just high access latency, but also a common set of approaches to coping with that latency. Relatively general-purpose techniques such as batching, sorting, and "I/O" concurrency work to hide RAM latency much as they do for disk.

This paper studies several RAM-latency dominated programs and shows how we apply general-purpose approaches to hide RAM latency. The evaluation shows that these optimizations improve performance by a factor up to 1.4×. Counter-intuitively, even though these programs are not limited by CPU cycles, we found that adding more cores can yield better performance.

@inproceedings{ram-latency:apsys13,
  title        = {Optimizing {RAM}-latency Dominated Applications},
  author       = {Yandong Mao and Cody Cutler and Robert Morris},
  booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2013)},
  year         = 2013,
  month        = jul,
  address      = {Singapore},
}

Optimizing unit test execution in large software programs using dependency analysis Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. APSys 2013.

Tao is a system that optimizes the execution of unit tests in large software programs and reduces the programmer wait time from minutes to seconds. Tao is based on two key ideas: First, Tao focuses on efficiency, unlike past work that focused on avoiding false negatives. Tao implements simple and fast function-level dependency tracking that identifies tests to run on a code change; any false negatives missed by this dependency tracking are caught by running the entire test suite on a test server once the code change is committed. Second, to make it easy for programmers to adopt Tao, it incorporates the dependency information into the source code repository. This paper describes an early prototype of Tao and demonstrates that Tao can reduce unit test execution time in two large Python software projects by over 96% while incurring few false negatives.

@inproceedings{tao:apsys13,
  title        = {Optimizing Unit Test Execution in Large Software
    Programs using Dependency Analysis},
  author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
  booktitle    = {Proceedings of 4th ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2013)},
  year         = 2013,
  month        = jul,
  address      = {Singapore},
}

Practical and effective sandboxing for non-root users Taesoo Kim, and Nickolai Zeldovich. USENIX 2013.

Mbox is a lightweight sandboxing mechanism for non-root users in commodity OSes. Mbox's sandbox usage model executes a program in the sandbox and prevents the program from modifying the host filesystem by layering the sandbox filesystem on top of the host filesystem. At the end of program execution, the user can examine changes in the sandbox filesystem and selectively commit them back to the host filesystem. Mbox implements this by interposing on system calls and provides a variety of useful applications: installing system packages as a non-root user, running unknown binaries safely without network accesses, checkpointing the host filesystem instantly, and setting up a virtual development environment without special tools. Our performance evaluation shows that Mbox imposes CPU overheads of 0.1–45.2% for various workloads. In this paper, we present Mbox's design, efficient techniques for interposing on system calls, our experience avoiding common system call interposition pitfalls, and Mbox's performance evaluation.

@inproceedings{mbox:usenix13,
  title        = {Practical and Effective Sandboxing for Non-root
    Users},
  author       = {Taesoo Kim and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 2013 USENIX Annual Technical
    Conference (USENIX '13)},
  year         = 2013,
  month        = jun,
  address      = {San Jose, CA},
}

Graduate admissions at MIT & comparison-based rank aggregation: A case study Katherine Szeto. Master's thesis, Massachusetts Institute of Technology, June 2013.

Admission to the Graduate program in EECS at MIT is done via an all-electronic system. Applicants submit materials through a web interface and faculty "reviewers" read the applications and make notes via a different interface. Among other comments, reviewers provide a numerical score between 1 and 4 to each application. Admissions decisions are reached via a lengthy process involving discussion among faculty, which is often guided by the numerical scores of the applications. Past data show the scores of the applicants still under consideration during the final stages of the process are almost all 4's. Because of this uniformity, the scores do not provide much resolution into the differences in quality of the applications, and are therefore not as helpful as they could be in guiding admissions decisions.

In this work, we present the use of an additional scoring system that is based on pairwise comparisons between applicants made by faculty reviewers. We present the design we created for this scheme and the code written to implement it, and analyze the data we obtained when the code went live during the 2012–2013 admissions cycle.

@mastersthesis{szeto-meng,
  title        = {Graduate Admissions at {MIT} \& Comparison-based
    Rank Aggregation: A Case Study},
  author       = {Katherine Szeto},
  school       = {Massachusetts Institute of Technology},
  year         = 2013,
  month        = jun,
}

Finding linearization violations in lock-free concurrent data structures Sebastien Dabdoub. Master's thesis, Massachusetts Institute of Technology, May 2013.

Finding bugs in lock-free concurrent programs is hard. This is due in part to the difficulty of reasoning about the correctness of concurrent algorithms and the timing-sensitive nature of concurrent programs. One of the most widely used tools for reasoning about the correctness of concurrent algorithms is the linearization property. This thesis presents a tool for automatic dynamic checking of concurrent programs under the Total-Store-Order (TSO) memory model and a methodology for finding linearization violations automatically with the tool.

@mastersthesis{codex:sdabdoub-meng,
  title        = {Finding Linearization Violations in Lock-Free
    Concurrent Data Structures},
  author       = {Sebastien Dabdoub},
  school       = {Massachusetts Institute of Technology},
  year         = 2013,
  month        = may,
}

RadixVM: Scalable address spaces for multithreaded applications (revised 2014-08-05) Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. EuroSys 2013.

RadixVM is a new virtual memory system design that enables fully concurrent operations on shared address spaces for multithreaded processes on cache-coherent multicore computers. Today, most operating systems serialize operations such as mmap and munmap, which forces application developers to split their multithreaded applications into multiprocess applications, hoard memory to avoid the overhead of returning it, and so on. RadixVM removes this burden from application developers by ensuring that address space operations on non-overlapping memory regions scale perfectly. It does so by combining three techniques: 1) it organizes metadata in a radix tree instead of a balanced tree to avoid unnecessary cache line movement; 2) it uses a novel memory-efficient distributed reference counting scheme; and 3) it uses a new scheme to target remote TLB shootdowns and to often avoid them altogether. Experiments on an 80 core machine show that RadixVM achieves perfect scalability for non-overlapping regions: if several threads mmap or munmap pages in parallel, they can run completely independently and induce no cache coherence traffic.

@inproceedings{radixvm:eurosys13,
  title        = {{RadixVM}: Scalable address spaces for multithreaded
    applications (revised 2014-08-05)},
  author       = {Austin T. Clements and M. Frans Kaashoek and
    Nickolai Zeldovich},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2013)},
  year         = 2013,
  month        = apr,
  address      = {Prague, Czech Republic},
}

2012

Improving integer security for systems with KINT Xi Wang, Haogang Chen, Zhihao Jia, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2012.

@inproceedings{kint:osdi12,
  title        = {Improving Integer Security for Systems with {KINT}},
  author       = {Xi Wang and Haogang Chen and Zhihao Jia and Nickolai
    Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 10th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '12)},
  year         = 2012,
  month        = oct,
  address      = {Hollywood, California},
}

Efficient patch-based auditing for web application vulnerabilities Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. OSDI 2012.

Poirot is a system that, given a patch for a newly discovered security vulnerability in a web application, helps administrators detect past intrusions that exploited the vulnerability. Poirot records all requests to the server during normal operation, and given a patch, re-executes requests using both patched and unpatched software, and reports to the administrator any request that executes differently in the two cases. A key challenge with this approach is the cost of re-executing all requests, and Poirot introduces several techniques to reduce the time required to audit past requests, including filtering requests based on their control flow and memoization of intermediate results across different requests.

A prototype of Poirot for PHP accurately detects attacks on older versions of MediaWiki and HotCRP, given subsequently released patches. Poirot's techniques allow it to audit past requests 12–51× faster than the time it took to originally execute the same requests, for patches to code executed by every request, under a realistic MediaWiki workload.

@inproceedings{poirot:osdi12,
  title        = {Efficient Patch-based Auditing for Web Application
    Vulnerabilities},
  author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 10th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '12)},
  year         = 2012,
  month        = oct,
  address      = {Hollywood, California},
}

Undefined behavior: What happened to my code? Xi Wang, Haogang Chen, Alvin Cheung, Zhihao Jia, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2012.

@inproceedings{ub:apsys12,
  title        = {Undefined Behavior: What Happened to My Code?},
  author       = {Xi Wang and Haogang Chen and Alvin Cheung and Zhihao
    Jia and Nickolai Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of 3rd ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2012)},
  year         = 2012,
  month        = jul,
  address      = {Seoul, South Korea},
}

Recovering from intrusions in distributed systems with Dare Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich. APSys 2012.

Dare is a system that recovers system integrity after intrusions that spread between machines in a distributed system. Dare extends the rollback-and-reexecute recovery model of Retro to distributed system recovery by solving the following challenges: tracking dependencies across machines, repairing network connections, minimizing distributed repair, and dealing with long-running daemon processes. This paper describes an early prototype of Dare, presents some preliminary results, and discusses open problems.

@inproceedings{dare:apsys12,
  title        = {Recovering from Intrusions in Distributed Systems
    with {Dare}},
  author       = {Taesoo Kim and Ramesh Chandra and Nickolai Zeldovich},
  booktitle    = {Proceedings of 3rd ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2012)},
  year         = 2012,
  month        = jul,
  address      = {Seoul, South Korea},
}

Non-scalable locks are dangerous Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. Linux Symposium 2012.

@inproceedings{locks:linuxsymp,
  title        = {Non-scalable locks are dangerous},
  author       = {Silas Boyd-Wickizer and M. Frans Kaashoek and Robert
    Morris and Nickolai Zeldovich},
  booktitle    = {Proceedings of the Linux Symposium},
  year         = 2012,
  month        = jul,
  address      = {Ottawa, Canada},
}

Executing web application queries on a partitioned database Neha Narula, and Robert Morris. USENIX Webapps 2012.

Partitioning data over multiple storage servers is an attractive way to increase throughput for web-like workloads. However, there is often no one partitioning that yields good performance for all queries, and it can be challenging for the web developer to determine how best to execute queries over partitioned data.

This paper presents Dixie, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. Dixie focuses on increasing inter-query parallel speedup by involving as few servers as possible in each query. One way it does this is by supporting tables with multiple copies partitioned on different columns, in order to expand the set of queries that can be satisified from a single server. Dixie automatically transforms SQL queries to execute over a partitioned database, using a cost model and plan generator that exploit multiple table copies.

We evaluate Dixie on a database and query stream taken from Wikipedia, partitioned across ten MySQL servers. By adding one copy of a 13 MB table and using Dixie's query optimizer, we achieve a throughput improvement of 3.2X over a single optimized partitioning of each table and 8.5X over the same data on a single server. On specific queries Dixie with table copies increases throughput linearly with the number of servers, while the best single-table-copy partitioning achieves little scaling. For a large class of joins, which traditional wisdom suggests requires tables partitioned on the join keys, Dixie can find higher-performance plans using other partitionings.

@inproceedings{dixie:usenixwebapps2012,
  title        = {Executing Web Application Queries on a Partitioned
    Database},
  author       = {Neha Narula and Robert Morris},
  booktitle    = {Proceedings of the 3rd USENIX Conference on Web
    Application Development (USENIX Webapps '12)},
  year         = 2012,
  month        = jun,
  address      = {Boston, Massachusetts},
}

Improving network connection locality on multicore systems Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. EuroSys 2012.

Incoming and outgoing processing for a given TCP connection often execute on different cores: an incoming packet is typically processed on the core that receives the interrupt, while outgoing data processing occurs on the core running the relevant user code. As a result, accesses to read/write connection state (such as TCP control blocks) often involve cache invalidations and data movement between cores' caches. These can take hundreds of processor cycles, enough to significantly reduce performance.

We present a new design, called Affinity-Accept, that causes all processing for a given TCP connection to occur on the same core. Affinity-Accept arranges for the network interface to determine the core on which application processing for each new connection occurs, in a lightweight way; it adjusts the card's choices only in response to imbalances in CPU scheduling. Measurements show that for the Apache web server serving static files on a 48-core AMD system, Affinity-Accept reduces time spent in the TCP stack by 30% and improves overall throughput by 24%.

@inproceedings{affinity-accept:eurosys12,
  title        = {Improving Network Connection Locality on Multicore
    Systems},
  author       = {Aleksey Pesterev and Jacob Strauss and Nickolai
    Zeldovich and Robert T. Morris},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2012)},
  year         = 2012,
  month        = apr,
  address      = {Bern, Switzerland},
}

Cache craftiness for fast multicore key-value storage Yandong Mao, Eddie Kohler, and Robert Morris. EuroSys 2012.

@inproceedings{masstree:eurosys12,
  title        = {Cache Craftiness for Fast Multicore Key-Value
    Storage},
  author       = {Yandong Mao and Eddie Kohler and Robert Morris},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2012)},
  year         = 2012,
  month        = apr,
  address      = {Bern, Switzerland},
}

UFlood: High-throughput flooding over wireless mesh networks Jayashree Subramanian, Robert Morris, and Hari Balakrishnan. Infocom 2012.

This paper proposes UFlood, a flooding protocol for wireless mesh networks. UFlood targets situations such as software updates where all nodes need to receive the same large file of data, and where limited radio range requires forwarding. UFlood's goals are high throughput and low airtime, defined respectively as rate of completion of a flood to the slowest receiving node and total time spent transmitting. The key to achieving these goals is good choice of sender for each transmission opportunity. The best choice evolves as a flood proceeds in ways that are difficult to predict.

UFlood's core new idea is a distributed heuristic to dynamically choose the senders likely to lead to all nodes receiving the flooded data in the least time. The mechanism takes into account which data nearby receivers already have as well as inter-node channel quality. The mechanism includes a novel bit-rate selection algorithm that trades off the speed of high bit-rates against the larger number of nodes likely to receive low bit-rates. Unusually, UFlood uses both random network coding to increase the usefulness of each transmission and detailed feedback about what data each receiver already has; the feedback is critical in deciding which node's coded transmission will have the most benefit to receivers. The required feedback is potentially voluminous, but UFlood includes novel techniques to reduce its cost.

The paper presents an evaluation on a 25-node 802.11 test-bed. UFlood achieves 150% higher throughput than MORE, a high-throughput flooding protocol, using 65% less airtime. UFlood uses 54% less airtime than MNP, an existing efficient protocol, and achieves 300% higher throughput.

@inproceedings{uflood:infocom12,
  title        = {{UFlood}: High-Throughput Flooding over Wireless
    Mesh Networks},
  author       = {Jayashree Subramanian and Robert Morris and Hari
    Balakrishnan},
  booktitle    = {Proceedings of the 31st Infocom},
  year         = 2012,
  month        = mar,
  address      = {Orlando, FL},
}

Scalable address spaces using RCU balanced trees Austin T. Clements, Frans Kaashoek, and Nickolai Zeldovich. ASPLOS 2012.

Software developers commonly exploit multicore processors by building multithreaded software in which all threads of an application share a single address space. This shared address space has a cost: kernel virtual memory operations such as handling soft page faults, growing the address space, mapping files, etc. can limit the scalability of these applications. In widely-used operating systems, all of these operations are synchronized by a single per-process lock. This paper contributes a new design for increasing the concurrency of kernel operations on a shared address space by exploiting read-copy-update (RCU) so that soft page faults can both run in parallel with operations that mutate the same address space and avoid contending with other page faults on shared cache lines. To enable such parallelism, this paper also introduces an RCU-based binary balanced tree for storing memory mappings. An experimental evaluation using three multithreaded applications shows performance improvements on 80 cores ranging from 1.7× to 3.4× for an implementation of this design in the Linux 2.6.37 kernel. The RCU-based binary tree enables soft page faults to run at a constant cost with an increasing number of cores, suggesting that the design will scale well beyond 80 cores.

@inproceedings{rcuvm:asplos12,
  title        = {Scalable Address Spaces Using {RCU} Balanced Trees},
  author       = {Austin T. Clements and Frans Kaashoek and Nickolai
    Zeldovich},
  booktitle    = {Proceedings of the 17th International Conference on
    Architectural Support for Programming Languages and Operating
    Systems ({ASPLOS})},
  year         = 2012,
  month        = mar,
  address      = {London, UK},
}

2011

CPHash: a cache-partitioned hash table Zviad Metreveli, Nickolai Zeldovich, and Frans Kaashoek. MIT CSAIL technical report, November 2011.

@techreport{cphash:tr,
  title        = {{CPHash}: a cache-partitioned hash table},
  author       = {Zviad Metreveli and Nickolai Zeldovich and Frans
    Kaashoek},
  number       = {MIT-CSAIL-TR-2011-051},
  year         = 2011,
  month        = nov,
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
}

Software fault isolation with API integrity and multi-principal modules Yandong Mao, Haogang Chen, Dong Zhou, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. SOSP 2011.

@inproceedings{lxfi:sosp11,
  title        = {Software fault isolation with {API} integrity and
    multi-principal modules},
  author       = {Yandong Mao and Haogang Chen and Dong Zhou and Xi
    Wang and Nickolai Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 23rd ACM Symposium on Operating
    Systems Principles (SOSP 2011)},
  year         = 2011,
  month        = oct,
  address      = {Cascais, Portugal},
}

Intrusion recovery for database-backed web applications Ramesh Chandra, Taesoo Kim, Meelap Shah, Neha Narula, and Nickolai Zeldovich. SOSP 2011.

Warp is a system that helps users and administrators of web applications recover from intrusions such as SQL injection, cross-site scripting, and clickjacking attacks, while preserving legitimate user changes. Warp repairs from an intrusion by rolling back parts of the database to a version before the attack, and replaying subsequent legitimate actions. Warp allows administrators to retroactively patch security vulnerabilities – i.e., apply new security patches to past executions – to recover from intrusions without requiring the administrator to track down or even detect attacks. Warp's time-travel database allows fine-grained rollback of database rows, and enables repair to proceed concurrently with normal operation of a web application. Finally, Warp captures and replays user input at the level of a browser's DOM, to recover from attacks that involve a user's browser. For a web server running MediaWiki, Warp requires no application source code changes to recover from a range of common web application vulnerabilities with minimal user input at a cost of 24–27% in throughput and 2–3.2 GB/day in storage.

@inproceedings{warp:sosp11,
  title        = {Intrusion Recovery for Database-backed Web
    Applications},
  author       = {Ramesh Chandra and Taesoo Kim and Meelap Shah and
    Neha Narula and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 23rd ACM Symposium on Operating
    Systems Principles (SOSP 2011)},
  year         = 2011,
  month        = oct,
  address      = {Cascais, Portugal},
}

Linux kernel vulnerabilities: State-of-the-art defenses and open problems Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2011.

@inproceedings{vulnerabilities:chen,
  title        = {Linux kernel vulnerabilities: State-of-the-art
    defenses and open problems},
  author       = {Haogang Chen and Yandong Mao and Xi Wang and Dong
    Zhou and Nickolai Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of 2nd ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2011)},
  year         = 2011,
  month        = jul,
  address      = {Shanghai, China},
}

Retroactive auditing Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. APSys 2011.

@inproceedings{rad:apsys11,
  title        = {Retroactive Auditing},
  author       = {Xi Wang and Nickolai Zeldovich and M. Frans Kaashoek},
  booktitle    = {Proceedings of 2nd ACM SIGOPS Asia-Pacific Workshop
    on Systems (APSys 2011)},
  year         = 2011,
  month        = jul,
  address      = {Shanghai, China},
}

Eyo: Device-transparent personal storage Jacob Strauss, Justin Mazzola Paluska, Chris Lesniewski-Laas, Bryan Ford, Robert Morris, and Frans Kaashoek. USENIX 2011.

Users increasingly store data collections such as digital photographs on multiple personal devices, each of which typically offers a storage management interface oblivious to the contents of the user's other devices. As a result, collections become disorganized and drift out of sync. This paper presents Eyo, a novel personal storage system that provides device transparency: a user can think in terms of "file X", rather than "file X on device Y ", and will see the same set of files on all personal devices. Eyo allows a user to view and manage the entire collection of objects from any of their devices, even from disconnected devices and devices with too little storage to hold all the object content. Eyo synchronizes these collections across any network topology, including direct peer-to-peer links. Eyo provides applications with a storage API with first-class access to object version history in order to resolve update conflicts automatically. Experiments with several applications Eyo usingmedia players, a photo editor, a podcast manager, and an interface emailshow that device transparency requires only minor application changes, and matches the storage and bandwidth capabilities of typical portable devices.

@inproceedings{eyo:usenix11,
  title        = {Eyo: Device-Transparent Personal Storage},
  author       = {Jacob Strauss and Justin Mazzola Paluska and Chris
    Lesniewski-Laas and Bryan Ford and Robert Morris and Frans
    Kaashoek},
  booktitle    = {Proceedings of the 2011 USENIX Annual Technical
    Conference (USENIX '11)},
  year         = 2011,
  month        = jun,
  address      = {Portland, Oregon},
}

A software approach to unifying multicore caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. MIT CSAIL technical report, June 2011.

@techreport{mfc:tr11,
  title        = {A Software Approach to Unifying Multicore Caches},
  author       = {Silas Boyd-Wickizer and M. Frans Kaashoek and Robert
    Morris and Nickolai Zeldovich},
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
  number       = {MIT-CSAIL-TR-2011-032},
  year         = 2011,
  month        = jun,
}

CPHash: A cache-partitioned hash table with LRU eviction Zviad Metreveli. Master's thesis, Massachusetts Institute of Technology, June 2011.

In this thesis we introduce CPHash – a scalable fixed size hash table that supports eviction using an LRU list, and CPServer – a scalable in memory key/value cache server that uses CPHash to implement its hash table. CPHash uses computation migration to avoid transferring data between cores. Experiments on a 48 core machine show that CPHash has 2 to 3 times higher throughput than a hash table implemented using scalable fine-grained locks. CPServer achieves 1.2 to 1.7 times higher throughput than a key/value cache server that uses a hash table with scalable fine-grained locks and 1.5 to 2.6 times higher throughput than Memcached.

@mastersthesis{cphash:zviad-meng-thesis,
  title        = {{CPHash}: A Cache-Partitioned Hash Table with {LRU}
    Eviction},
  author       = {Zviad Metreveli},
  school       = {Massachusetts Institute of Technology},
  year         = 2011,
  month        = jun,
}

2010

Intrusion recovery using selective re-execution Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. OSDI 2010.

Retro repairs a desktop or server after an adversary compromises it, by undoing the adversary's changes while preserving legitimate user actions, with minimal user involvement. During normal operation, Retro records an action history graph, which is a detailed dependency graph describing the system's execution. Retro uses refinement to describe graph objects and actions at multiple levels of abstraction, which allows for precise dependencies. During repair, Retro uses the action history graph to undo an unwanted action and its indirect effects by first rolling back its direct effects, and then re-executing legitimate actions that were influenced by that change. To minimize user involvement and re-execution, Retro uses predicates to selectively re-execute only actions that were semantically affected by the adversary's changes, and uses compensating actions to handle external effects.

An evaluation of a prototype of Retro for Linux with 2 real-world attacks, 2 synthesized challenge attacks, and 6 attacks from previous work, shows that Retro can often repair the system without user involvement, and avoids false positives and negatives from previous solutions. These benefits come at the cost of 35–127% in execution time overhead and of 4–150 GB of log space per day, depending on the workload. For example, a HotCRP paper submission web site incurs 35% slowdown and generates 4 GB of logs per day under the workload from 30 minutes prior to the SOSP 2007 deadline.

@inproceedings{retro:osdi10,
  title        = {Intrusion recovery using selective re-execution},
  author       = {Taesoo Kim and Xi Wang and Nickolai Zeldovich and M.
    Frans Kaashoek},
  booktitle    = {Proceedings of the 9th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '10)},
  year         = 2010,
  month        = oct,
  address      = {Vancouver, Canada},
}

An analysis of Linux scalability to many cores Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. OSDI 2010.

This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48-core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques—this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.

@inproceedings{linux:osdi10,
  title        = {An Analysis of {Linux} Scalability to Many Cores},
  author       = {Silas Boyd-Wickizer and Austin T. Clements and
    Yandong Mao and Aleksey Pesterev and M. Frans Kaashoek and Robert
    Morris and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 9th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '10)},
  year         = 2010,
  month        = oct,
  address      = {Vancouver, Canada},
}

Design and applications of a secure and decentralized distributed hash table Christopher T. Lesniewski-Laas. Ph.D. thesis, Massachusetts Institute of Technology, September 2010.

@phdthesis{ctl-phd,
  title        = {Design and Applications of a Secure and
    Decentralized Distributed Hash Table},
  author       = {Christopher T. Lesniewski-Laas},
  school       = {Massachusetts Institute of Technology},
  year         = 2010,
  month        = sep,
}

Device-transparent personal storage Jacob Strauss. Ph.D. thesis, Massachusetts Institute of Technology, September 2010.

@phdthesis{jastr-phd,
  title        = {Device-Transparent Personal Storage},
  author       = {Jacob Strauss},
  school       = {Massachusetts Institute of Technology},
  year         = 2010,
  month        = sep,
}

Making Linux protection mechanisms egalitarian with UserFS Taesoo Kim, and Nickolai Zeldovich. USENIX Security 2010.

UserFS provides egalitarian OS protection mechanisms in Linux. UserFS allows any user—not just the system administrator—to allocate Unix user IDs, to use chroot, and to set up firewall rules in order to confine untrusted code. One key idea in UserFS is representing user IDs as files in a /proc-like file system, thus allowing applications to manage user IDs like any other files, by setting permissions and passing file descriptors over Unix domain sockets. UserFS addresses several challenges in making user IDs egalitarian, including accountability, resource allocation, persistence, and UID reuse. We have ported several applications to take advantage of UserFS; by changing just tens to hundreds of lines of code, we prevented attackers from exploiting application-level vulnerabilities, such as code injection or missing ACL checks in a PHP-based wiki application. Implementing UserFS requires minimal changes to the Linux kernel-a single 3,000-line kernel module-and incurs no performance overhead for most operations, making it practical to deploy on real systems.

@inproceedings{userfs:sec19,
  title        = {Making {Linux} Protection Mechanisms Egalitarian
    with {UserFS}},
  author       = {Taesoo Kim and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 19th {USENIX} {S}ecurity
    {S}ymposium},
  year         = 2010,
  month        = aug,
  address      = {Washington, D.C.},
}

Separating web applications from user data storage with BStore Ramesh Chandra, Priya Gupta, and Nickolai Zeldovich. USENIX WebApps 2010.

This paper presents BStore, a framework that allows developers to separate their web application code from user data storage. With BStore, storage providers implement a standard file system API, and applications access user data through that same API without having to worry about where the data might be stored. A file system manager allows the user and applications to combine multiple file systems into a single namespace, and to control what data each application can access. One key idea in BStore's design is the use of tags on files, which allows applications both to organize data in different ways, and to delegate fine-grained access to other applications. We have implemented a prototype of BStore in Javascript that runs in unmodified Firefox and Chrome browsers. We also implemented three file systems and ported three different applications to BStore. Our prototype incurs an acceptable performance overhead of less than 5% on a 10Mbps network connection, and porting existing client-side applications to BStore required small amounts of source code changes.

@inproceedings{bstore:webapps10,
  author       = {Ramesh Chandra and Priya Gupta and Nickolai
    Zeldovich},
  title        = {Separating Web Applications from User Data Storage
    with {BS}tore},
  booktitle    = {Proceedings of the 2010 USENIX Conference on Web
    Application Development (USENIX WebApps '10)},
  year         = 2010,
  month        = jun,
  address      = {Boston, Massachusetts},
}

Tolerating malicious device drivers in Linux Silas Boyd-Wickizer, and Nickolai Zeldovich. USENIX 2010.

@inproceedings{sud:usenix10,
  title        = {Tolerating Malicious Device Drivers in {Linux}},
  author       = {Silas Boyd-Wickizer and Nickolai Zeldovich},
  booktitle    = {Proceedings of the 2010 USENIX Annual Technical
    Conference (USENIX '10)},
  year         = 2010,
  month        = jun,
  address      = {Boston, Massachusetts},
}

WhanuaSIP: A secure peer-to-peer communication platform Raymond Chen. Master's thesis, Massachusetts Institute of Technology, June 2010.

This thesis presents a novel mechanism for achieving secure and reliable peer-to-peer communications on the Internet. WhanauSIP merges a Sybil-proof distributed hash table with mature SIP technology to enable instant messaging, audio chat, and video conferencing that is resilient to censoring, eavesdropping, and forgery. Performance and security evaluations performed on the PlanetLab network demonstrate that the majority of resource lookups return within 5 seconds. These results indicate that WhanauSIP delivers practical performance with respect to call session initialization latency for VoIP telephony. Furthermore, the tests demonstrated that lookup performance was minimally affected during a Sybil cluster ID attack, illustrating the network's resilience to malicious adversaries. The thesis delivers three software packages for public use: a general Whanau distributed hash table implementation, a WhanauSIP gateway, and a desktop IM/VoIP client.

@mastersthesis{whanuasip:raymond-meng-thesis,
  title        = {{WhanuaSIP}: A Secure Peer-to-Peer Communication
    Platform},
  author       = {Raymond Chen},
  school       = {Massachusetts Institute of Technology},
  year         = 2010,
  month        = jun,
}

Optimizing MapReduce for multicore architectures Yandong Mao, Robert Morris, and Frans Kaashoek. MIT CSAIL technical report, May 2010.

@techreport{metis:tr,
  title        = {Optimizing {MapReduce} for Multicore Architectures},
  author       = {Yandong Mao and Robert Morris and Frans Kaashoek},
  number       = {MIT-CSAIL-TR-2010-020},
  year         = 2010,
  month        = may,
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
}

Whānau: A Sybil-proof distributed hash table Chris Lesniewski-Laas, and M. Frans Kaashoek. NSDI 2010.

Whānau is a novel routing protocol for distributed hash tables (DHTs) that is efficient and strongly resistant to the Sybil attack. Whānau uses the social connections between users to build routing tables that enable Sybil-resistant lookups. The number of Sybils in the social network does not affect the protocol's performance, but links between honest users and Sybils do. When there are n well-connected honest nodes, Whānau can tolerate up to O(n/log n) such "attack edges". This means that an adversary must convince a large fraction of the honest users to make a social connection with the adversary's Sybils before any lookups will fail.

Whānau uses ideas from structured DHTs to build routing tables that contain O(√n log n) entries per node. It introduces the idea of layered identifiers to counter clustering attacks, a class of Sybil attacks challenging for previous DHTs to handle. Using the constructed tables, lookups provably take constant time. Simulation results, using social network graphs from LiveJournal, Flickr, YouTube, and DBLP, confirm the analytic results. Experimental results on PlanetLab confirm that the protocol can handle modest churn.

@inproceedings{whanau:nsdi10,
  title        = {Wh\=anau: A {Sybil}-Proof Distributed Hash Table},
  author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 7th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '10)},
  year         = 2010,
  month        = apr,
  address      = {San Jose, CA},
}

Locating cache performance bottlenecks using data profiling Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. EuroSys 2010.

Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.

DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16--57% throughput improvement on a range of memcached and Apache workloads.

@inproceedings{dprof:eurosys10,
  title        = {Locating Cache Performance Bottlenecks Using Data
    Profiling},
  author       = {Aleksey Pesterev and Nickolai Zeldovich and Robert
    T. Morris},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2010)},
  year         = 2010,
  month        = apr,
  address      = {Paris, France},
}

2009

Device transparency: a new model for mobile storage Jacob Strauss, Chris Lesniewski-Laas, Justin Mazzola Paluska, Bryan Ford, Robert Morris, and Frans Kaashoek. HotStorage 2009.

This paper proposes a new storage model, device transparency, in which users view and manage their entire data collection from any of their devices, even from disconnected storage-limited devices holding only a subset of the entire collection.

@inproceedings{eyo:hotstorage09,
  title        = {Device Transparency: a New Model for Mobile Storage},
  author       = {Jacob Strauss and Chris Lesniewski-Laas and Justin
    Mazzola Paluska and Bryan Ford and Robert Morris and Frans
    Kaashoek},
  booktitle    = {Proceedings of the {W}orkshop on {H}ot {T}opics in
    {S}torage and {F}ile {S}ystems ({HotStorage} '09)},
  year         = 2009,
  month        = oct,
  address      = {Big Sky, Montana},
}

Improving application security with data flow assertions Alexander Yip, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. SOSP 2009.

Resin is a new language runtime that helps prevent security vulnerabilities, by allowing programmers to specify application-level data flow assertions. Resin provides policy objects, which programmers use to specify assertion code and metadata; data tracking, which allows programmers to associate assertions with application data, and to keep track of assertions as the data flow through the application; and filter objects, which programmers use to define data flow boundaries at which assertions are checked. Resin's runtime checks data flow assertions by propagating policy objects along with data, as that data moves through the application, and then invoking filter objects when data crosses a data flow boundary, such as when writing data to the network or a file.

Using Resin, Web application programmers can prevent a range of problems, from SQL injection and cross-site scripting, to inadvertent password disclosure and missing access control checks. Adding a Resin assertion to an application requires few changes to the existing application code, and an assertion can reuse existing code and data structures. For instance, 23 lines of code detect and prevent three previously-unknown missing access control vulnerabilities in phpBB, a popular Web forum application. Other assertions comprising tens of lines of code prevent a range of vulnerabilities in Python and PHP applications. A prototype of Resin incurs a 33% CPU overhead running the HotCRP conference management application.

@inproceedings{resin:sosp09,
  title        = {Improving Application Security with Data Flow
    Assertions},
  author       = {Alexander Yip and Xi Wang and Nickolai Zeldovich and
    M. Frans Kaashoek},
  booktitle    = {Proceedings of the 22th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '09)},
  year         = 2009,
  month        = oct,
  address      = {Big Sky, Montana},
}

Whanaungatanga: Sybil-proof routing with social networks Chris Lesniewski-Laas, and M. Frans Kaashoek. MIT CSAIL technical report, September 2009.

@techreport{whanau:tr09,
  title        = {Whanaungatanga: {Sybil}-proof routing with social
    networks},
  author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
  number       = {MIT-CSAIL-TR-2009-045},
  year         = 2009,
  month        = sep,
}

Flexible, wide-area storage for distributed systems using semantic cues Jeremy Stribling. Ph.D. thesis, Massachusetts Institute of Technology, September 2009.

@phdthesis{strib-phd,
  title        = {Flexible, Wide-Area Storage for Distributed Systems
    Using Semantic Cues},
  author       = {Jeremy Stribling},
  school       = {Massachusetts Institute of Technology},
  year         = 2009,
  month        = sep,
}

Improving web site security with data flow management Alexander Yip. Ph.D. thesis, Massachusetts Institute of Technology, September 2009.

@phdthesis{yipal-phd,
  title        = {Improving Web Site Security with Data Flow
    Management},
  author       = {Alexander Yip},
  school       = {Massachusetts Institute of Technology},
  year         = 2009,
  month        = sep,
}

In defense of wireless carrier sense Micah Brodsky, and Robert Morris. SIGCOMM 2009.

Carrier sense is often used to regulate concurrency in wireless medium access control (MAC) protocols, balancing interference protection and spatial reuse. Carrier sense is known to be imperfect, and many improved techniques have been proposed. Is the search for a replacement justified? This paper presents a theoretical model for average case two-sender carrier sense based on radio propagation theory and Shannon capacity. Analysis using the model shows that carrier sense performance is surprisingly close to optimal for radios with adaptive bitrate. The model suggests that hidden and exposed terminals usually cause modest reductions in throughput rather than dramatic decreases. Finally, it is possible to choose a fixed sense threshold which performs well across a wide range of scenarios, in large part due to the role of the noise floor. Experimental results from an indoor 802.11 testbed support these claims.

@inproceedings{cs:sigcomm09,
  title        = {In Defense of Wireless Carrier Sense},
  author       = {Micah Brodsky and Robert Morris},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '09 Conference},
  year         = 2009,
  month        = aug,
  address      = {Barcelona, Spain},
}

Reinventing scheduling for multicore systems Silas Boyd-Wickizer, Robert Morris, and M. Frans Kaashoek. HotOS 2009.

@inproceedings{o2:hotos09,
  title        = {Reinventing Scheduling for Multicore Systems},
  author       = {Silas Boyd-Wickizer and Robert Morris and M. Frans
    Kaashoek},
  booktitle    = {Proceedings of the 12th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-XII})},
  year         = 2009,
  month        = may,
  address      = {Monte Verit{\`a}, Switzerland},
}

Efficient file distribution in a flexible, wide-area file system Irene Zhang. Master's thesis, Massachusetts Institute of Technology, May 2009.

WheelFS is a wide-area distributed file system designed to help applications cope with the challenges of sharing data over the wide-area network. A wide range of applications can use WheelFS as a storage layer because applications can control various trade-offs in WheelFS, such as consistency versus availability, using \textit{semantic cues}. One key feature that many applications require from any storage system is efficient file distribution. The storage system needs to be able to serve files quickly, even large or popular ones, and allow users and applications to quickly browse files. Wide-area links with high latency and low throughput make achieving these goals difficult for most distributed storage systems.

This thesis explores using prefetching, a traditional file system optimization technique, in wide-area file systems for more efficient file distribution. This thesis focuses on \textit{Tread}, a prefetcher for WheelFS. Tread includes several types of prefetching to improve the performance of reading files and directories in WheelFS: read-ahead prefetching, whole file prefetching, directory prefetching and a prefetching optimization for WheelFS's built-in cooperative caching. To makes the best use of scarce wide-area resources, Tread adaptively rate-limits prefetching and gives applications control over what and how prefetching is done using WheelFS's semantic cues.

Experiments show that Tread can reduce the time to read a 10MB file in WheelFS by 40% and the time to list a directory with 100 entries by more than 80%. In addition, experiments on Planetlab show that using prefetching with cooperative caching to distribute a 10MB file to 270 clients reduces the average latency for each client to read the file by almost 45%.

@mastersthesis{wheelfs:irene-meng,
  title        = {Efficient File Distribution in a Flexible, Wide-area
    File System},
  author       = {Irene Zhang},
  school       = {Massachusetts Institute of Technology},
  year         = 2009,
  month        = may,
}

Securing wide-area storage in WheelFS Xavid Pretzer. Master's thesis, Massachusetts Institute of Technology, May 2009.

@mastersthesis{wheelfs:xavid-thesis,
  title        = {Securing Wide-area Storage in {WheelFS}},
  author       = {Xavid Pretzer},
  school       = {Massachusetts Institute of Technology},
  year         = 2009,
  month        = may,
}

Flexible, wide-area storage for distributed systems with WheelFS Jeremy Stribling, Yair Sovran, Irene Zhang, Xavid Pretzer, Jinyang Li, M. Frans Kaashoek, and Robert Morris. NSDI 2009.

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement.

WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS's file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

@inproceedings{wheelfs:nsdi09,
  title        = {Flexible, Wide-Area Storage for Distributed Systems
    with {WheelFS}},
  author       = {Jeremy Stribling and Yair Sovran and Irene Zhang and
    Xavid Pretzer and Jinyang Li and M. Frans Kaashoek and Robert
    Morris},
  booktitle    = {Proceedings of the 6th {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '09)},
  year         = 2009,
  month        = apr,
  address      = {Boston, MA},
}

Privacy-preserving browser-side scripting with BFlow Alexander Yip, Neha Narula, Maxwell Krohn, and Robert Morris. EuroSys 2009.

Some web sites provide interactive extensions using browser scripts, often without inspecting the scripts to verify that they are benign and bug-free. Others handle users' confidential data and display it via the browser. Such new features contribute to the power of online services, but their combination would allow attackers to steal confidential data. This paper presents BFlow, a security system that uses information flow control to allow the combination while preventing attacks on data confidentiality.

BFlow allows untrusted JavaScript to compute with, render, and store confidential data, while preventing leaks of that data. BFlow tracks confidential data as it flows within the browser, between scripts on a page and between scripts and web servers. Using these observations and assistance from participating web servers, BFlow prevents scripts that have seen confidential data from leaking it, all without disrupting the JavaScript communication techniques used in complex web pages. To achieve these ends, BFlow augments browsers with a new "protection zone" abstraction.

We have implemented a BFlow browser reference monitor and server support. To evaluate BFlow's confidentiality protection and flexibility, we have built a BFlow-protected blog that supports Blogger's third party JavaScript extensions. BFlow is compatible with every legitimate Blogger extension that we have found, yet it prevents malicious extensions from leaking confidential data.

@inproceedings{bflow:eurosys09,
  title        = {Privacy-Preserving Browser-Side Scripting With
    {BFlow}},
  author       = {Alexander Yip and Neha Narula and Maxwell Krohn and
    Robert Morris},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2009)},
  year         = 2009,
  month        = mar,
  address      = {Nuremberg, Germany},
}

Ksplice: Automatic rebootless kernel updates Jeff Arnold, and M. Frans Kaashoek. EuroSys 2009.

@inproceedings{ksplice:eurosys,
  title        = {Ksplice: Automatic rebootless kernel updates},
  author       = {Jeff Arnold and M. Frans Kaashoek},
  booktitle    = {Proceedings of the ACM EuroSys Conference (EuroSys
    2009)},
  year         = 2009,
  month        = mar,
  address      = {Nuremberg, Germany},
}

2008

Corey: An operating system for many cores Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. OSDI 2008.

@inproceedings{corey:osdi08,
  title        = {Corey: An Operating System for Many Cores},
  author       = {Silas Boyd-Wickizer and Haibo Chen and Rong Chen and
    Yandong Mao and Frans Kaashoek and Robert Morris and Aleksey
    Pesterev and Lex Stein and Ming Wu and Yuehua Dai and Yang Zhang
    and Zheng Zhang},
  pages        = {43--57},
  booktitle    = {Proceedings of the 8th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '08)},
  year         = 2008,
  month        = dec,
  address      = {San Diego, California},
}

An extension-oriented compiler Russ Cox. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.

@phdthesis{rsc-phd,
  title        = {An Extension-Oriented Compiler},
  author       = {Russ Cox},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = sep,
}

Information flow control for secure web sites Maxwell Krohn. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.

@phdthesis{krohn-phd,
  title        = {Information Flow Control for Secure Web Sites},
  author       = {Maxwell Krohn},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = sep,
}

UIA: A global connectivity architecture for mobile personal devices Bryan Alexander Ford. Ph.D. thesis, Massachusetts Institute of Technology, September 2008.

@phdthesis{ford-phd,
  title        = {{UIA}: A Global Connectivity Architecture for Mobile
    Personal Devices},
  author       = {Bryan Alexander Ford},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = sep,
}

Vx32: Lightweight user-level sandboxing on the x86 Bryan Ford, and Russ Cox. USENIX 2008.

Code sandboxing is useful for many purposes, but most sandboxing techniques require kernel modifications, do not completely isolate guest code, or incur substantial performance costs. Vx32 is a multipurpose user-level sandbox that enables any application to load and safely execute one or more guest plug-ins, confining each guest to a system call API controlled by the host application and to a restricted memory region within the host’s address space. Vx32 runs guest code efficiently on several widespread operating systems without kernel extensions or special privileges; it protects the host program from both reads and writes by its guests; and it allows the host to restrict the instruction set available to guests. The key to vx32’s combination of portability, flexibility, and efficiency is its use of x86 segmentation hardware to sandbox the guest’s data accesses, along with a lightweight instruction translator to sandbox guest instructions.

We evaluate vx32 using microbenchmarks and whole system benchmarks, and we examine four applications based on vx32: an archival storage system, an extensible public-key infrastructure, an experimental user-level operating system running atop another host OS, and a Linux system call jail. The first three applications export custom APIs independent of the host OS to their guests, making their plug-ins binary-portable across host systems. Compute-intensive workloads for the first two applications exhibit between a 30% slowdown and a 30% speedup on vx32 relative to native execution; speedups result from vx32’s instruction translator improving the cache locality of guest code. The experimental user-level operating system allows the use of the guest OS’s applications alongside the host’s native applications and runs faster than whole-system virtual machine monitors such as VMware and QEMU. The Linux system call jail incurs up to 80% overhead but requires no kernel modifications and is delegation-based, avoiding concurrency vulnerabilities present in other interposition mechanisms.

@inproceedings{vx32:usenix08,
  title        = {Vx32: Lightweight User-level Sandboxing on the x86},
  author       = {Bryan Ford and Russ Cox},
  booktitle    = {Proceedings of the 2008 USENIX Annual Technical
    Conference (USENIX '08)},
  year         = 2008,
  month        = jun,
  address      = {Boston, Massachusetts},
}

Storing and managing data in a distributed hash table Emil Sit. Ph.D. thesis, Massachusetts Institute of Technology, June 2008.

@phdthesis{pt:sit,
  title        = {Storing and Managing Data in a Distributed Hash
    Table},
  author       = {Emil Sit},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = jun,
}

Ksplice: An automatic system for rebootless kernel security updates Jeffrey Brian Arnold. Master's thesis, Massachusetts Institute of Technology, May 2008.

Ksplice allows system administrators to apply security patches to their operating system kernels without having to reboot. Based on a source code patch and the kernel source code to be patched, Ksplice applies the patch to the corresponding running kernel, without requiring work from a programmer. To be fully automatic, Ksplice's design is limited to patches that do not introduce semantic changes to data structures, but a study of all significant x86-32 Linux security patches from May 2005 to December 2007 finds that only eight patches of 50 make semantic changes. An evaluation with Debian and kernel.org Linux kernels shows that Ksplice can automatically apply the remaining 42 patches, which means that 84% of the Linux kernel vulnerabilities from this interval can be corrected by Ksplice without the need for rebooting.

@mastersthesis{ksplice:jbarnold-meng,
  title        = {{Ksplice}: An Automatic System for Rebootless Kernel
    Security Updates},
  author       = {Jeffrey Brian Arnold},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = may,
}

UsenetDHT: A low-overhead design for Usenet Emil Sit, Robert Morris, and M. Frans Kaashoek. NSDI 2008.

Usenet is a popular distributed messaging and file sharing service: servers in Usenet flood articles over an overlay network to fully replicate articles across all servers. However, replication of Usenet's full content requires that each server pay the cost of receiving (and storing) over 1 Tbyte/day. This paper presents the design and implementation of UsenetDHT, a Usenet system that allows a set of cooperating sites to keep a shared, distributed copy of Usenet articles. UsenetDHT consists of client-facing Usenet NNTP front-ends and a distributed hash table (DHT) that provides shared storage of articles across the wide area. This design allows participating sites to partition the storage burden, rather than replicating all Usenet articles at all sites.

UsenetDHT requires a DHT that maintains durability despite transient and permanent failures, and provides high storage performance. These goals can be difficult to provide simultaneously: even in the absence of failures, verifying adequate replication levels of large numbers of objects can be resource intensive, and interfere with normal operations. This paper introduces Passing Tone, a new replica maintenance algorithm for DHash that minimizes the impact of monitoring replication levels on memory and disk resources by operating with only pairwise communication. Passing Tone's implementation provides performance by using data structures that avoid disk accesses and enable batch operations.

Microbenchmarks over a local gigabit network demonstrate that the total system throughput scales linearly as servers are added, providing 5.7 Mbyte/s of write bandwidth and 7 Mbyte/s of read bandwidth per server. UsenetDHT is currently deployed on a 12-server network at 7 sites running Passing Tone over the wide-area: this network supports our research laboratory's live 2.5 Mbyte/s Usenet feed and 30.6 Mbyte/s of synthetic read traffic. These results suggest a DHT-based design may be a viable way to redesign Usenet and globally reduce costs.

@inproceedings{usenetdht:nsdi08,
  title        = {{UsenetDHT}: A Low-Overhead Design for {Usenet}},
  author       = {Emil Sit and Robert Morris and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 5rd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '08)},
  year         = 2008,
  month        = apr,
  address      = {San Francisco, CA},
}

An offline foundation for online accountable pseudonyms Bryan Ford, and Jacob Strauss. SocialNets 2008.

Online anonymity often appears to undermine accountability, offering little incentive for civil behavior, but accountability failures usually result not from anonymity itself but from the disposability of virtual identities. A user banned for misbehavior--- e.g., spamming from a free E-mail account or stuffing an online ballot box--- can simply open other accounts or connect from other IP addresses. Instead of curtailing freedom of expression by giving up anonymity, online services and communities should support accountable pseudonyms: virtual personas that can provide both anonymity and accountability. We propose Pseudonym parties, a scheme for creating accountable pseudonyms, which combine in-person social occasions (parties) with technical infrastructure (a pseudonymous sign-on service) to enforce the rule that one real person gets one virtual persona on any participating online service. Pseudonym parties enable the user to adopt different personas in different online spaces without revealing the connection between them, while ensuring that each user has only one accountable pseudonym in each space. Pseudonym parties can be started incrementally in a fully decentralized fashion, can run on volunteer labor with minimal funds, and may even be fun.

@inproceedings{accountable:socialnets08,
  title        = {An Offline Foundation for Online Accountable
    Pseudonyms},
  author       = {Bryan Ford and Jacob Strauss},
  booktitle    = {Proceedings of the First International Workshop on
    Social Network Systems (SocialNets 2008)},
  year         = 2008,
  month        = apr,
  address      = {Glasgow, Scotland},
}

A Sybil-proof one-hop DHT Chris Lesniewski-Laas. SocialNets 2008.

Decentralized systems, such as structured overlays, are subject to the Sybil attack, in which an adversary creates many false identities to increase its influence. This paper describes a one-hop distributed hash table which uses the social links between users to strongly resist the Sybil attack. The social network is assumed to be fast mixing, meaning that a random walk in the honest part of the network quickly approaches the uniform distribution. As in the related SybilLimit system, with a social network of n honest nodes and m honest edges, the protocol can tolerate up to o(n/log n) attack edges (social links from honest nodes to compromised nodes). The routing tables contain O(√m log m) entries per node and are constructed efficiently by a distributed protocol. This is the first sublinear solution to this problem. Preliminary simulation results are presented to demonstrate the approach's effectiveness.

@inproceedings{sybil:socialnets08,
  title        = {A {Sybil}-proof one-hop {DHT}},
  author       = {Chris Lesniewski-Laas},
  booktitle    = {Proceedings of the First International Workshop on
    Social Network Systems (SocialNets 2008)},
  year         = 2008,
  month        = apr,
  address      = {Glasgow, Scotland},
}

Xoc, an extension-oriented compiler for systems programming Russ Cox, Tom Bergan, Austin T. Clements, Frans Kaashoek, and Eddie Kohler. ASPLOS 2008.

Today's system programmers go to great lengths to extend the languages in which they program. For instance, system-specific compilers find errors in Linux and other systems, and add support for specialized control flow to Qt and event-based programs. These compilers are difficult to build and cannot always understand each other's language changes. However, they can greatly improve code understandability and correctness, advantages that should be accessible to all programmers.

We describe an extension-oriented compiler for C called xoc. An extension-oriented compiler, unlike a conventional extensible compiler, implements new features via many small extensions that are loaded together as needed. Xoc gives extension writers full control over program syntax and semantics while hiding many compiler internals. Xoc programmers concisely define powerful compiler extensions that, by construction, can be combined; even some parts of the base compiler, such as GNU C compatibility, are structured as extensions.

Xoc is based on two key interfaces. Syntax patterns allow extension writers to manipulate language fragments using concrete syntax. Lazy computation of attributes allows extension writers to use the results of analyses by other extensions or the core without needing to worry about pass scheduling.

Extensions built using xoc include xsparse, a 345-line extension that mimics Sparse, Linux's C front end, and xlambda, a 170-line extension that adds function expressions to C. An evaluation of xoc using these and 13 other extensions shows that xoc extensions are typically more concise than equivalent extensions written for conventional extensible compilers and that it is possible to compose extensions.

@inproceedings{xoc:asplos08,
  author       = {Russ Cox and Tom Bergan and Austin T. Clements and
    Frans Kaashoek and Eddie Kohler},
  title        = {Xoc, an Extension-Oriented Compiler for Systems
    Programming},
  pages        = {244--254},
  booktitle    = {Proceedings of the 13th International Conference on
    Architectural Support for Programming Languages and Operating
    Systems ({ASPLOS})},
  year         = 2008,
  month        = mar,
  address      = {Seattle, Washington},
}

A comparison of designs for extensible and extension-oriented compilers Austin T. Clements. Master's thesis, Massachusetts Institute of Technology, February 2008.

This thesis considers four extensible and extension-oriented compilers: CIL, Polyglot, xtc, and Xoc. These four compilers represent four distinctly different approaches to the problem of bridging the gap between language design and system implementation. Taking an extension author's point of view, this thesis compares the design of each compiler's extension interface in terms of extension structure, syntactic analysis, semantic analysis, and rewriting.

To perform the comparison, this thesis uses three extensions implemented variously in the four compilers: a bitwise rotation operator, function expressions, and lock checking. These extensions are designed to span a representative space of analysis and rewriting needs.

Based on this comparison, this thesis identifies the following implications of the design decisions of each extension interface: the expressiveness, understandability, and correctness of extension implementations can benefit from domain specific languages and language features tailored to the extension interface; compiler-managed scheduling trades loss of control for automatic extension composability; unifying internal and external program representation improves ease of use and extension composability, but gives up potentially useful control over the internal representation; concrete syntax patterns provide a natural interface to internal program representation, but must be more powerful than simple tree matching to be practical; grammars, types, and syntax interfaces have a natural correspondence; and accounting for semantic information in the final output enables hygienic rewriting, which can simplify extensions.

@mastersthesis{xoc:aclements-meng,
  title        = {A Comparison of Designs for Extensible and
    Extension-Oriented Compilers},
  author       = {Austin T. Clements},
  school       = {Massachusetts Institute of Technology},
  year         = 2008,
  month        = feb,
}

2007

A World Wide Web without walls Maxwell Krohn, Alex Yip, Micah Brodsky, Robert Morris, and Michael Walfish. HotNets 2007.

@inproceedings{w5:hotnets07,
  author       = {Maxwell Krohn and Alex Yip and Micah Brodsky and
    Robert Morris and Michael Walfish},
  title        = {A {World Wide Web} Without Walls},
  booktitle    = {Proceedings of the Sixth {W}orkshop on {H}ot
    {T}opics in {N}etworks ({HotNets-VI})},
  year         = 2007,
  month        = nov,
  organization = {{ACM SIGCOMM}},
  address      = {Atlanta, Georgia},
}

Alpaca: Extensible authorization for distributed services Chris Lesniewski-Laas, Bryan Ford, Jacob Strauss, Robert Morris, and M. Frans Kaashoek. CCS 2007.

Traditional Public Key Infrastructures (PKI) have not lived up to their promise because there are too many ways to define PKIs, too many cryptographic primitives to build them with, and too many administrative domains with incompatible roots of trust. Alpaca is an authentication and authorization framework that embraces PKI diversity by enabling one PKI to "plug in" another PKI's credentials and cryptographic algorithms, allowing users of the latter to authenticate themselves to services using the former using their existing, unmodified certificates. Alpaca builds on Proof-Carrying Authorization (PCA), expressing a credential as an explicit proof of a logical claim. Alpaca generalizes PCA to express not only delegation policies but also the cryptographic primitives, credential formats, and namespace structures needed to use foreign credentials directly. To achieve this goal, Alpaca introduces a method of creating and naming new principals which behave according to arbitrary rules, a modular approach to logical axioms, and a domain-specific language specialized for reasoning about authentication. We have implemented Alpaca as a Python module that assists applications in generating proofs (e.g., in a client requesting access to a resource), and in verifying those proofs via a compact 800-line TCB (e.g., in a server providing that resource). We present examples demonstrating Alpaca's extensibility in scenarios involving inter-organization PKI interoperability and secure remote PKI upgrade.

@inproceedings{alpaca:ccs07,
  title        = {Alpaca: Extensible Authorization for Distributed
    Services},
  author       = {Chris Lesniewski-Laas and Bryan Ford and Jacob
    Strauss and Robert Morris and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 14th {ACM} Conference on Computer
    and Communications Security ({CCS-2007})},
  year         = 2007,
  month        = oct,
  address      = {Alexandria, VA},
}

Information flow control for standard OS abstractions Maxwell Krohn, Alexander Yip, Micah Brodsky, Natan Cliffer, M. Frans Kaashoek, Eddie Kohler, and Robert Morris. SOSP 2007.

@inproceedings{KYBCKMM07,
  author       = {Maxwell Krohn and Alexander Yip and Micah Brodsky
    and Natan Cliffer and M. Frans Kaashoek and Eddie Kohler and
    Robert Morris},
  title        = {Information Flow Control for Standard {OS}
    Abstractions},
  booktitle    = {Proceedings of the 21st {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '07)},
  year         = 2007,
  month        = oct,
  address      = {Stevenson, WA},
}

Structured streams: a new transport abstraction Bryan Ford. SIGCOMM 2007.

Internet applications currently have a choice between stream and datagram transport abstractions. Datagrams efficiently support small transactions and streams are suited for long-running conversations, but neither abstraction adequately supports applications like HTTP that exhibit a mixture of transaction sizes, or applications like FTP and SIP that use multiple transport instances. Structured Stream Transport (SST) enhances the traditional stream abstraction with a hierarchical hereditary structure, allowing applications to create lightweight child streams from any existing stream. Unlike TCP streams, these lightweight streams incur neither 3-way handshaking delays on startup nor TIME-WAIT periods on close. Each stream offers independent data transfer and flow control, allowing different transactions to proceed in parallel without head-of-line blocking, but all streams share one congestion control context. SST supports both reliable and best-effort delivery in a way that semantically unifies datagrams with streams and solves the classic “large datagram” problem, where a datagram's loss probability increases exponentially with fragment count. Finally, an application can prioritize its streams relative to each other and adjust priorities dynamically through out-of-band signaling. A user-space prototype shows that SST is TCP-friendly to within 2%, and performs comparably to a user-space TCP and to within 10% of kernel TCP on a WiFi network.

@inproceedings{sst:sigcomm07,
  title        = {Structured Streams: a New Transport Abstraction},
  author       = {Bryan Ford},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '07 Conference},
  year         = 2007,
  month        = aug,
  address      = {Kyoto, Japan},
}

Events can make sense Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. USENIX 2007.

@inproceedings{krohn:tame,
  author       = {Maxwell Krohn and Eddie Kohler and M. Frans Kaashoek},
  title        = {Events Can Make Sense},
  booktitle    = {Proceedings of the 2007 USENIX Annual Technical
    Conference (USENIX '07)},
  year         = 2007,
  month        = jun,
  address      = {Santa Clara, CA},
}

Don't give up on distributed file systems Jeremy Stribling, Emil Sit, M. Frans Kaashoek, Jinyang Li, and Robert Morris. IPTPS 2007.

Wide-area distributed applications often reinvent the wheel for their storage needs, each incorporating its own special-purpose storage manager to cope with distribution, intermittent failures, limited bandwidth, and high latencies. This paper argues that a distributed file system could provide a reusable solution to these problems by coupling a standard interface with a design suited to wide-area distribution. For concreteness, this paper presents such a file system, called WheelFS, which allows applications to control consistency through the use of semantic cues, and minimizes communication costs by adhering to the slogan read globally, write locally. WheelFS could simplify distributed experiments, CDNs, and Grid applications.

@inproceedings{wheelfs:iptps07,
  title        = {Don't Give Up on Distributed File Systems},
  author       = {Jeremy Stribling and Emil Sit and M. Frans Kaashoek
    and Jinyang Li and Robert Morris},
  booktitle    = {Proceedings of the 6th International Workshop on
    Peer-to-Peer Systems (IPTPS07)},
  year         = 2007,
  month        = feb,
  address      = {Bellevue, WA},
}

2006

Persistent personal names for globally connected mobile devices Bryan Ford, Jacob Strauss, Chris Lesniewski-Laas, Sean Rhea, Frans Kaashoek, and Robert Morris. OSDI 2006.

The Unmanaged Internet Architecture (UIA) provides zero-configuration connectivity among mobile devices through personal names. Users assign personal names through an ad hoc device introduction process requiring no central allocation. Once assigned, names bind securely to the global identities of their target devices independent of network location. Each user manages one namespace, shared among all the user's devices and always available on each device. Users can also name other users to share resources with trusted acquaintances. Devices with naming relationships automatically arrange connectivity when possible, both in ad hoc networks and using global infrastructure when available. A UIA prototype demonstrates these capabilities using optimistic replication for name resolution and group management and a routing algorithm exploiting the user's social network for connectivity.

Demo Video:

AVI: 20MB, H.264 codec; playable with VLC or MPlayer for example.
DVD: 262MB, MPEG-2; burn this ISO image onto a DVD and pop it into a DVD player.

@inproceedings{uia:osdi06,
  title        = {Persistent Personal Names for Globally Connected
    Mobile Devices},
  author       = {Bryan Ford and Jacob Strauss and Chris
    Lesniewski-Laas and Sean Rhea and Frans Kaashoek and Robert Morris},
  booktitle    = {Proceedings of the 7th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '06)},
  year         = 2006,
  month        = nov,
  address      = {Seattle, Washington},
}

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, Frans Kaashoek, John Kubiatowicz, and Robert Morris. NSDI 2006.

This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.

The following insights in designing an efficient replication algorithm emerge from the paper's analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency.

Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

@inproceedings{carbonite:nsdi06,
  title        = {Efficient Replica Maintenance for Distributed
    Storage Systems},
  author       = {Byung-Gon Chun and Frank Dabek and Andreas Haeberlen
    and Emil Sit and Hakim Weatherspoon and Frans Kaashoek and John
    Kubiatowicz and Robert Morris},
  booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
  year         = 2006,
  month        = may,
  address      = {San Jose, CA},
}

Overcite: A distributed, cooperative citeseer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, and Robert Morris. NSDI 2006.

CiteSeer is a popular online resource for the computer science research community, allowing users to search and browse a large archive of research papers. CiteSeer is expensive: it generates 35 GB of network traffic per day, requires nearly one terabyte of disk storage, and needs significant human maintenance.

OverCite is a new digital research library system that aggregates donated resources at multiple sites to provide CiteSeer-like document search and retrieval. OverCite enables members of the community to share the costs of running CiteSeer. The challenge facing OverCite is how to provide scalable and load-balanced storage and query processing with automatic data management. OverCite uses a three-tier design: presentation servers provide an identical user interface to CiteSeer's; application servers partition and replicate a search index to spread the work of answering each query among several nodes; and a distributed hash table stores documents and meta-data, and coordinates the activities of the servers.

Evaluation of a prototype shows that OverCite increases its query throughput by a factor of seven with a nine-fold increase in the number of servers. OverCite requires more total storage and network bandwidth than centralized CiteSeer, but spreads these costs over all the sites. OverCite can exploit the resources of these sites to support new features such as document alerts and to scale to larger data sets.

@inproceedings{overcite:nsdi06,
  title        = {OverCite: A Distributed, Cooperative CiteSeer},
  author       = {Jeremy Stribling and Jinyang Li and Isaac G.
    Councill and M. Frans Kaashoek and Robert Morris},
  booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
  year         = 2006,
  month        = may,
  address      = {San Jose, CA},
}

Pastwatch: A distributed version control system Alexander Yip, Benjie Chen, and Robert Morris. NSDI 2006.

@inproceedings{pastwatch:nsdi06,
  title        = {Pastwatch: A Distributed Version Control System},
  author       = {Alexander Yip and Benjie Chen and Robert Morris},
  booktitle    = {Proceedings of the 3rd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '06)},
  year         = 2006,
  month        = may,
  address      = {San Jose, CA},
}

Asbestos: Operating system security for mobile devices Martijn Stevenson. Master's thesis, Massachusetts Institute of Technology, May 2006.

@mastersthesis{armasb:martijn-meng,
  title        = {Asbestos: Operating System Security for Mobile
    Devices},
  author       = {Martijn Stevenson},
  school       = {Massachusetts Institute of Technology},
  year         = 2006,
  month        = may,
}

Proactive replication for data durability Emil Sit, Andreas Haeberlen, Frank Dabek, Byung-Gon Chun, Hakim Weatherspoon, Frans Kaashoek, Robert Morris, and John Kubiatowicz. IPTPS 2006.

@inproceedings{tempo:iptps06,
  title        = {Proactive replication for data durability},
  author       = {Emil Sit and Andreas Haeberlen and Frank Dabek and
    Byung-Gon Chun and Hakim Weatherspoon and Frans Kaashoek and
    Robert Morris and John Kubiatowicz},
  booktitle    = {Proceedings of the 5th International Workshop on
    Peer-to-Peer Systems (IPTPS06)},
  year         = 2006,
  month        = feb,
  address      = {Santa Barbara, CA},
}

User-relative names for globally connected personal devices Bryan Ford, Jacob Strauss, Chris Lesniewski-Laas, Sean Rhea, Frans Kaashoek, and Robert Morris. IPTPS 2006.

Nontechnical users who own increasingly ubiquitous network-enabled personal devices such as laptops, digital cameras, and smart phones need a simple, intuitive, and secure way to share information and services between their devices. User Information Architecture, or UIA, is a novel naming and peer-to-peer connectivity architecture addressing this need. Users assign UIA names by "introducing" devices to each other on a common local-area network, but these names remain securely bound to their target as devices migrate. Multiple devices owned by the same user, once introduced, automatically merge their namespaces to form a distributed personal cluster that the owner can access or modify from any of his devices. Instead of requiring users to allocate globally unique names from a central authority, UIA enables users to assign their own user-relative names both to their own devices and to other users. With UIA, for example, Alice can always access her iPod from any of her own personal devices at any location via the name ipod, and her friend Bob can access her iPod via a relative name like ipod.Alice.

@inproceedings{uia:iptps06,
  title        = {User-Relative Names for Globally Connected Personal
    Devices},
  author       = {Bryan Ford and Jacob Strauss and Chris
    Lesniewski-Laas and Sean Rhea and Frans Kaashoek and Robert Morris},
  booktitle    = {Proceedings of the 5th International Workshop on
    Peer-to-Peer Systems (IPTPS06)},
  year         = 2006,
  month        = feb,
  address      = {Santa Barbara, CA},
}

2005

VXA: A virtual architecture for durable compressed archives Bryan Ford. FAST 2005.

Data compression algorithms change frequently, and obsolete decoders do not always run on new hardware and operating systems, threatening the long-term usability of content archived using those algorithms. Re-encoding content into new formats is cumbersome, and highly undesirable when lossy compression is involved. Processor architectures, in contrast, have remained comparatively stable over recent decades. VXA, an archival storage system designed around this observation, archives executable decoders along with the encoded content it stores. VXA decoders run in a specialized virtual machine that implements an OS-independent execution environment based on the standard x86 architecture. The VXA virtual machine strictly limits access to host system services, making decoders safe to run even if an archive contains malicious code. VXA's adoption of a "native" processor architecture instead of type-safe language technology allows reuse of existing "hand-optimized" decoders in C and assembly language, and permits decoders access to performance-enhancing architecture features such as vector processing instructions. The performance cost of VXA's virtualization is typically less than 15% compared with the same decoders running natively. The storage cost of archived decoders, typically 30-130KB each, can be amortized across many archived files sharing the same compression method.

@inproceedings{vxa:fast05,
  title        = {{VXA}: A Virtual Architecture for Durable Compressed
    Archives},
  author       = {Bryan Ford},
  booktitle    = {Proceedings of the 4th USENIX Conference on File and
    Storage Technologies (FAST '05)},
  year         = 2005,
  month        = dec,
  address      = {San Francisco, California},
}

Routing tradeoffs in dynamic peer-to-peer networks Jinyang Li. Ph.D. thesis, Massachusetts Institute of Technology, November 2005.

@phdthesis{accordion:jinyang,
  title        = {Routing Tradeoffs in Dynamic Peer-to-peer Networks},
  author       = {Jinyang Li},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = nov,
}

A distributed hash table Frank Dabek. Ph.D. thesis, Massachusetts Institute of Technology, November 2005.

@phdthesis{dht:dabek,
  title        = {A Distributed Hash Table},
  author       = {Frank Dabek},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = nov,
}

Labels and event processes in the Asbestos operating system Petros Efstathopoulos, Maxwell Krohn, Steve VanDeBogart, Cliff Frey, David Ziegler, Eddie Kohler, David Mazières, Frans Kaashoek, and Robert Morris. SOSP 2005.

@inproceedings{EKVFZKMKM05,
  author       = {Petros Efstathopoulos and Maxwell Krohn and Steve
    VanDeBogart and Cliff Frey and David Ziegler and Eddie Kohler and
    David Mazi\`eres and Frans Kaashoek and Robert Morris},
  title        = {Labels and Event Processes in the {Asbestos}
    Operating System},
  booktitle    = {Proceedings of the 20th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '05)},
  year         = 2005,
  month        = oct,
  address      = {Brighton, UK},
}

Sybil-resistant DHT routing George Danezis, Chris Lesniewski-Laas, M. Frans Kaashoek, and Ross Anderson. ESORICS 2005.

Distributed Hash Tables (DHTs) are very efficient distributed systems for routing, but at the same time vulnerable to disruptive nodes. Designers of such systems want them used in open networks, where an adversary can perform a sybil attack by introducing a large number of corrupt nodes in the network, considerably degrading its performance. We introduce a routing strategy that alleviates some of the effects of such an attack by making sure that lookups are performed using a diverse set of nodes. This ensures that at least some of the nodes queried are good, and hence the search makes forward progress. This strategy makes use of latent social information present in the introduction graph of the network.

@inproceedings{sybil:esorics05,
  title        = {Sybil-resistant {DHT} routing},
  author       = {George Danezis and Chris Lesniewski-Laas and M.
    Frans Kaashoek and Ross Anderson},
  booktitle    = {Proceedings of the 10th European Symposium On
    Research In Computer Security},
  year         = 2005,
  month        = sep,
  address      = {Milan, Italy},
}

Integrity and access control in untrusted content distribution networks Kevin Fu. Ph.D. thesis, Massachusetts Institute of Technology, September 2005.

@phdthesis{fu-phd,
  title        = {Integrity and access control in untrusted content
    distribution networks},
  author       = {Kevin Fu},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = sep,
}

SoftECC: A system for software memory integrity checking Dave Dopson. Master's thesis, Massachusetts Institute of Technology, September 2005.

@mastersthesis{softecc:ddopson-meng,
  title        = {{SoftECC}: A System for Software Memory Integrity
    Checking},
  author       = {Dave Dopson},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = sep,
}

Overcite: A cooperative digital research library Jeremy Stribling. Master's thesis, Massachusetts Institute of Technology, September 2005.

OverCite is a design for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture harnesses donated resources at many sites to provide document search and retrieval service to researchers worldwide. A preliminary evaluation of an initial OverCite prototype shows that it can service more queries per second than a centralized system, and that it increases total storage capacity by a factor of n/4 in a system of n nodes. OverCite can exploit these additional resources by supporting new features such as document alerts, and by scaling to larger data sets.

@mastersthesis{overcite:strib-ms,
  title        = {OverCite: A Cooperative Digital Research Library},
  author       = {Jeremy Stribling},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = sep,
}

Building fast and secure web services with OKWS Maxwell Krohn. Master's thesis, Massachusetts Institute of Technology, September 2005.

@mastersthesis{okws:krohn-ms,
  author       = {Maxwell Krohn},
  title        = {Building Fast and Secure Web Services With {OKWS}},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = sep,
}

SSL splitting: securely serving data from untrusted caches Chris Lesniewski-Laas, and M. Frans Kaashoek. Computer Networks 48(5), August 2005.

A popular technique for reducing the bandwidth load on Web servers is to serve the content from proxies. Typically these hosts are trusted by the clients and server not to modify the data that they proxy. SSL splitting is a new technique for guaranteeing the integrity of data served from proxies without requiring changes to Web clients. Instead of relaying an insecure HTTP connection, an SSL splitting proxy simulates a normal Secure Sockets Layer (SSL) connection with the client by merging authentication records from the server with data records from a cache. This technique reduces the bandwidth load on the server, while allowing an unmodified Web browser to verify that the data served from proxies is endorsed by the originating server. SSL splitting is implemented as a patch to the industry-standard OpenSSL library, with which the server is linked. In experiments replaying two-hour access.log traces taken from LCS Web sites over an ADSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@article{ssl-splitting:compnet05,
  title        = {{SSL} splitting: securely serving data from
    untrusted caches},
  author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
  journal      = {Computer Networks},
  publisher    = {Elsevier},
  volume       = 48,
  number       = 5,
  pages        = {763--779},
  year         = 2005,
  month        = aug,
}

Architecture and evaluation of an unplanned 802.11b mesh network John Bicket, Daniel Aguayo, Sanjit Biswas, and Robert Morris. MobiCom 2005.

This paper evaluates the ability of a wireless mesh architecture to provide high performance Internet access while demanding little deployment planning or operational management. The architecture considered in this paper has unplanned node placement (rather than planned topology), omni-directional antennas (rather than directional links), and multi-hop routing (rather than single-hop base stations). These design decisions contribute to ease of deployment, an important requirement for community wireless networks. However, this architecture carries the risk that lack of planning might render the network's performance unusably low.

For example, it might be necessary to place nodes carefully to ensure connectivity; the omni-directional antennas might provide uselessly short radio ranges; or the inefficiency of multi-hop forwarding might leave some users effectively disconnected.

The paper evaluates this unplanned mesh architecture with a case study of the Roofnet 802.11b mesh network. Roofnet consists of 37 nodes spread over four square kilometers of an urban area. The network provides users with usable performance despite lack of planning: the average inter-node throughput is 627 kbits/second, even though the average route has three hops.

The paper evaluates multiple aspects of the architecture: the effect of node density on connectivity and throughput; the characteristics of the links that the routing protocol elects to use; the usefulness of the highly connected mesh afforded by omni-directional antennas for robustness and throughput; and the potential performance of a single-hop network using the same nodes as Roofnet.

@inproceedings{roofnet:mobicom05,
  title        = {Architecture and Evaluation of an Unplanned 802.11b
    Mesh Network},
  author       = {John Bicket and Daniel Aguayo and Sanjit Biswas and
    Robert Morris},
  booktitle    = {Proceedings of the 11th {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '05)},
  year         = 2005,
  month        = aug,
  address      = {Cologne, Germany},
}

Opportunistic routing in multi-hop wireless networks Sanjit Biswas, and Robert Morris. SIGCOMM 2005.

This paper describes ExOR, an integrated routing and MAC protocol that increases the throughput of large unicast transfers in multi-hop wireless networks. ExOR chooses each hop of a packet's route after the transmission for that hop, so that the choice can reflect which intermediate nodes actually received the transmission. This deferred choice gives each transmission multiple opportunities to make progress. As a result ExOR can use long radio links with high loss rates, which would be avoided by traditional routing. ExOR increases a connection's throughput while using no more network capacity than traditional routing.

ExOR's design faces the following challenges. The nodes that receive each packet must agree on their identities and choose one forwarder. The agreement protocol must have low overhead, but must also be robust enough that it rarely forwards a packet zero times or more than once. Finally, ExOR must choose the forwarder with the lowest remaining cost to the ultimate destination.

Measurements of an implementation on a 38-node 802.11b test-bed show that ExOR increases throughput for most node pairs when compared with traditional routing. For pairs between which traditional routing uses one or two hops, ExOR's robust acknowledgments prevent un necessary retransmissions, increasing throughput by nearly 35%. For more distant pairs, ExOR takes advantage of the choice of forwarders to provide throughput gains of a factor of two to four.

@inproceedings{roofnet:exor-sigcomm05,
  title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
  author       = {Sanjit Biswas and Robert Morris},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '05 Conference},
  year         = 2005,
  month        = aug,
  address      = {Philadelphia, Pennsylvania},
}

Make least privilege a right (not a privilege) Maxwell Krohn, Petros Efstathopoulos, Cliff Frey, Frans Kaashoek, Eddie Kohler, David Mazières, Robert Morris, Michelle Osborne, Steve VanDeBogart, and David Ziegler. HotOS 2005.

@inproceedings{KEFKKMMOVZ05,
  author       = {Maxwell Krohn and Petros Efstathopoulos and Cliff
    Frey and Frans Kaashoek and Eddie Kohler and David Mazi\`eres and
    Robert Morris and Michelle Osborne and Steve VanDeBogart and David
    Ziegler},
  title        = {Make Least Privilege a Right (Not a Privilege)},
  booktitle    = {Proceedings of the 10th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-X})},
  year         = 2005,
  month        = jun,
  address      = {Santa Fe, NM},
}

Bandwidth-efficient management of DHT routing tables Jinyang Li, Jeremy Stribling, Robert Morris, and M. Frans Kaashoek. NSDI 2005.

Today an application developer using a distributed hash table (DHT) with n nodes must choose a DHT protocol from the spectrum between O(1) lookup protocols and O(log n) protocols. O(1) protocols achieve low latency lookups on small or low-churn networks because lookups take only a few hops, but incur high maintenance traffic on large or high-churn networks. O(log n) protocols incur less maintenance traffic on large or high-churn networks but require more lookup hops in small networks. Accordion is a new routing protocol that does not force the developer to make this choice: Accordion adjusts itself to provide the best performance across a range of network sizes and churn rates while staying within a bounded bandwidth budget.

The key challenges in the design of Accordion are the algorithms that choose the routing table's size and content. Each Accordion node learns of new neighbors opportunistically, in a way that causes the density of its neighbors to be inversely proportional to their distance in ID space from the node. This distribution allows Accordion to vary the table size along a continuum while still guaranteeing at most O(log n) lookup hops. The user-specified bandwidth budget controls the rate at which a node learns about new neighbors. Each node limits its routing table size by evicting neighbors that it judges likely to have failed. High churn (i.e., short node lifetimes) leads to a high eviction rate. The equilibrium between the learning and eviction processes determines the table size.

Simulations show that Accordion maintains an efficient lookup latency versus bandwidth tradeoff over a wider range of operating conditions than existing DHTs.

@inproceedings{accordion:nsdi05,
  title        = {Bandwidth-efficient management of {DHT} routing
    tables},
  author       = {Jinyang Li and Jeremy Stribling and Robert Morris
    and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 2nd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '05)},
  year         = 2005,
  month        = may,
  address      = {Boston, Massachusetts},
}

Improving web availability for clients with MONET David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Rohit Rao. NSDI 2005.

@inproceedings{monet:nsdi05,
  title        = {Improving Web Availability for Clients with {MONET}},
  author       = {David G. Andersen and Hari Balakrishnan and M. Frans
    Kaashoek and Rohit Rao},
  booktitle    = {Proceedings of the 2nd {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '05)},
  year         = 2005,
  month        = may,
  address      = {Boston, Massachusetts},
}

Mandatory security and performance of services in Asbestos David Patrick Ziegler. Master's thesis, Massachusetts Institute of Technology, May 2005.

@mastersthesis{asbestos:ziegler-meng,
  title        = {Mandatory Security and Performance of Services in
    {Asbestos}},
  author       = {David Patrick Ziegler},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = may,
}

Acetone: A system call interface for Asbestos labels Clifford A. Frey. Master's thesis, Massachusetts Institute of Technology, May 2005.

@mastersthesis{acetone:frey-meng,
  title        = {Acetone: A System Call Interface for {Asbestos}
    Labels},
  author       = {Clifford A. Frey},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = may,
}

Peer-to-peer communication across network address translators Bryan Ford, Pyda Srisuresh, and Dan Kegel. USENIX 2005.

@inproceedings{p2pnat:usenix05,
  title        = {Peer-to-Peer Communication Across Network Address
    Translators},
  author       = {Bryan Ford and Pyda Srisuresh and Dan Kegel},
  booktitle    = {Proceedings of the 2005 USENIX Annual Technical
    Conference (USENIX '05)},
  year         = 2005,
  month        = apr,
  address      = {Anaheim, California},
}

A performance vs. cost framework for evaluating DHT design tradeoffs under churn Jinyang Li, Jeremy Stribling, Robert Morris, M. Frans Kaashoek, and Thomer M. Gil. Infocom 2005.

Protocols for distributed hash tables (DHTs) incorporate features to achieve low latency for lookup requests in the face of churn, continuous changes in membership. These protocol features can include a directed identifier space, parallel lookups, pro-active flooding of membership changes, and stabilization protocols for maintaining accurate routing. In addition, DHT protocols have parameters that can be tuned to achieve different tradeoffs between lookup latency and communication cost due to maintenance traffic. The relative importance of the features and parameters is not well understood, because most previous work evaluates protocols on static networks.

This paper presents a performance versus cost framework (PVC) that allows designers to compare the effects of different protocol features and parameter values. PVC views a protocol as consuming a certain amount of network bandwidth in order to achieve a certain lookup latency, and helps reveal the efficiency with which protocols use additional network resources to improve latency. To demonstrate the value of PVC, this paper simulates Chord, Kademlia, Kelips, OneHop, and Tapestry under different workloads and uses PVC to understand which features are more important under churn. PVC analysis shows that the key to efficiently using additional bandwidth is for a protocol to adjust its routing table size. It also shows that routing table stabilization is wasteful and can be replaced with opportunistic learning through normal lookup traffic. These insights combined demonstrate that PVC is a valuable tool for DHT designers.

@inproceedings{dhtcomparison:infocom05,
  title        = {A performance vs. cost framework for evaluating
    {DHT} design tradeoffs under churn},
  author       = {Jinyang Li and Jeremy Stribling and Robert Morris
    and M. Frans Kaashoek and Thomer M. Gil},
  booktitle    = {Proceedings of the 24th Infocom},
  year         = 2005,
  month        = mar,
  address      = {Miami, FL},
}

Opportunistic routing in multi-hop wireless networks Sanjit Zubin Biswas. Master's thesis, Massachusetts Institute of Technology, March 2005.

This thesis describes ExOR, an integrated routing and MAC protocol for bulk transfers in multi-hop wireless networks. ExOR exploits the broadcast nature of radios by making forwarding decisions based on which nodes receive each transmission. The spatial diversity among receivers provides each transmission multiple opportunities to make progress in the face of packet losses. As a result ExOR can use long links with high loss rates, which would be avoided by unicast routing.

ExOR operates on batches of packets. The source node includes a list of candidate forwarders in each packet, prioritized by closeness to the destination. Receiving nodes buffer successfully received packets and await the end of the batch. The highest priority forwarder then broadcasts the packets in its buffer, including its copy of the ``batch map'' in each packet. The batch map contains the sender's best guess of the highest priority node to have received each packet. The remaining forwarders then transmit in order, sending only packets which were not acknowledged in the batch maps of higher priority nodes. The forwarders continue to cycle through the priority list until the destination has enough packets to recover the original data using forward error correction.

An evaluation of an implementation on a 38-node 802.11b test-bed shows that ExOR improves bulk data transfer throughput for most node pairs when compared with unicast routing. For pairs between which unicast uses one or two hops, ExOR's robust batch maps prevent unnecessary retransmissions, increasing throughput by nearly 50\%. For longer unicast routes, ExOR takes advantage of spatial diversity, providing gains of a factor of two to four when using a batch size of 10 packets.

@mastersthesis{roofnet:biswas-ms,
  title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
  author       = {Sanjit Zubin Biswas},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = mar,
}

Arpeggio: Metadata searching and content sharing with Chord Austin T. Clements, Dan R. K. Ports, and David R. Karger. IPTPS 2005.

Arpeggio is a peer-to-peer file-sharing network based on the Chord lookup primitive. Queries for data whose metadata matches a certain criterion are performed efficiently by using a distributed keyword-set index, augmented with index-side filtering. We introduce index gateways, a technique for minimizing index maintenance overhead. Because file data is large, Arpeggio employs subrings to track live source peers without the cost of inserting the data itself into the network. Finally, we introduce postfetching, a technique that uses information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization, and a content distribution system tuned to the requirements and capabilities of a peer-to-peer network.

@inproceedings{arpeggio:iptps05,
  title        = {Arpeggio: Metadata Searching and Content Sharing
    with {C}hord},
  author       = {Austin T. Clements and Dan R. K. Ports and David R.
    Karger},
  booktitle    = {Proceedings of the 4th International Workshop on
    Peer-to-Peer Systems (IPTPS05)},
  year         = 2005,
  month        = feb,
  address      = {Ithaca, NY},
}

Overcite: A cooperative digital research library Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David R. Karger, Robert Morris, and Scott Shenker. IPTPS 2005.

CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a proposal for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture will harness resources at many sites, and thereby be able to support new features such as document alerts and scale to larger data sets.

@inproceedings{overcite:iptps05,
  title        = {OverCite: A Cooperative Digital Research Library},
  author       = {Jeremy Stribling and Isaac G. Councill and Jinyang
    Li and M. Frans Kaashoek and David R. Karger and Robert Morris and
    Scott Shenker},
  booktitle    = {Proceedings of the 4th International Workshop on
    Peer-to-Peer Systems (IPTPS05)},
  year         = 2005,
  month        = feb,
  address      = {Ithaca, NY},
}

Bit-rate selection in wireless networks John Bicket. Master's thesis, Massachusetts Institute of Technology, February 2005.

@mastersthesis{roofnet:jbicket-ms,
  title        = {Bit-rate Selection in Wireless Networks},
  author       = {John Bicket},
  school       = {Massachusetts Institute of Technology},
  year         = 2005,
  month        = feb,
}

2004

Middleboxes no longer considered harmful Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, and Scott Shenker. OSDI 2004.

Intermediate network elements, such as network address translators (NATs), firewalls, and transparent caches are now commonplace. The usual reaction in the network architecture community to these so-called middleboxes is a combination of scorn (because they violate important architectural principles) and dismay (because these violations make the Internet less flexible). While we acknowledge these concerns, we also recognize that middleboxes have become an Internet fact of life for important reasons. To retain their functions while eliminating their dangerous side-effects, we propose an extension to the Internet architecture, called the Delegation-Oriented Architecture (DOA), that not only allows, but also facilitates, the deployment of middleboxes. DOA involves two relatively modest changes to the current architecture: (a) a set of references that are carried in packets and serve as persistent host identifiers and (b) a way to resolve these references to delegates chosen by the referenced host.

@inproceedings{doa:osdi04,
  title        = {Middleboxes No Longer Considered Harmful},
  author       = {Michael Walfish and Jeremy Stribling and Maxwell
    Krohn and Hari Balakrishnan and Robert Morris and Scott Shenker},
  booktitle    = {Proceedings of the 6th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '04)},
  year         = 2004,
  month        = dec,
  address      = {San Francisco, California},
}

An empirical study of spam traffic and the use of DNS black lists Jaeyeon Jung, and Emil Sit. SIGCOMM 2004.

@inproceedings{empircal:imc04,
  author       = {Jaeyeon Jung and Emil Sit},
  title        = {An Empirical Study of Spam Traffic and the Use of
    {DNS} Black Lists},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
    Measurement Conference '04},
  year         = 2004,
  month        = oct,
  address      = {Taormina, Sicily, Italy},
}

Multiq: Automated detection of multiple bottleneck capacities along a path Sachin Katti, Dina Katabi, Charles Blake, Eddie Kohler, and Jacob Strauss. SIGCOMM 2004.

@inproceedings{multiq:imc04,
  title        = {MultiQ: Automated Detection of Multiple Bottleneck
    Capacities Along a Path},
  author       = {Sachin Katti and Dina Katabi and Charles Blake and
    Eddie Kohler and Jacob Strauss},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
    Measurement Conference '04},
  year         = 2004,
  month        = oct,
  address      = {Taormina, Sicily, Italy},
}

User authentication and remote execution across administrative domains Michael Kaminsky. Ph.D. thesis, Massachusetts Institute of Technology, September 2004.

@phdthesis{kaminsky-phd,
  title        = {User Authentication and Remote Execution Across
    Administrative Domains},
  author       = {Michael Kaminsky},
  school       = {Massachusetts Institute of Technology},
  year         = 2004,
  month        = sep,
}

Link-level measurements from an 802.11b mesh network Daniel Aguayo, John Bicket, Sanjit Biswas, Glenn Judd, and Robert Morris. SIGCOMM 2004.

This paper analyzes the causes of packet loss in a 38-node urban multi-hop 802.11b network. The patterns and causes of loss are important in the design of routing and error-correction protocols, as well as in network planning.

This paper makes the following observations. The distribution of inter-node loss rates is relatively uniform over the whole range of loss rates; there is no clear theshold separating "in range" and "out of range." Most links have relatively stable loss rates from one second to the next, though a small minotiry have very bursty losses at that time scale. Signal-to-noise ratio and distance have little predictive value for loss rate. The large number of links with intermediate loss rates is probably due to multi-path fading rather than attenuation or interference.

The prenomena discussed here are all well-known. The contributions of this paper are an understanding of their relative importance, of how they interact, and of the implications for MAC and routing protocol design.

@inproceedings{roofnet:sigcomm04,
  title        = {Link-level Measurements from an 802.11b Mesh Network},
  author       = {Daniel Aguayo and John Bicket and Sanjit Biswas and
    Glenn Judd and Robert Morris},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '04 Conference},
  year         = 2004,
  month        = aug,
  address      = {Portland, Oregon},
}

Vivaldi: A decentralized network coordinate system Frank Dabek, Russ Cox, Frans Kaashoek, and Robert Morris. SIGCOMM 2004.

This paper is posted as corrected on August 28, 2004. The original version of the paper incorrectly described how often nodes measure the RTT to another node. Nodes keep one RTT probe RPC outstanding (instead of sending one probe per second)

Large-scale Internet applications can benefit from an ability to predict round-trip times to other hosts without having to contact them first. Explicit measurements are often unattractive because the cost of measurement can outweigh the benefits of exploiting proximity information. Vivaldi is a simple, light-weight algorithm that assigns synthetic coordinates to hosts such that the distance between the coordinates of two hosts accurately predicts the communication latency between the hosts.

Vivaldi is fully distributed, requiring no fixed network infrastructure and no distinguished hosts. It is also efficient: a new host can compute good coordinates for itself after collecting latency information from only a few other hosts. Because it requires little communication, Vivaldi can piggy-back on the communication patterns of the application using it and scale to a large number of hosts.

An evaluation of Vivaldi using a simulated network whose latencies are based on measurements among 1740 Internet hosts shows that a 2-dimensional Euclidean model with height vectors embeds these hosts with low error (the median relative error in round-trip time prediction is 11 percent).

@inproceedings{vivaldi:sigcomm,
  title        = {Vivaldi: A Decentralized Network Coordinate System},
  author       = {Frank Dabek and Russ Cox and Frans Kaashoek and
    Robert Morris},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '04 Conference},
  year         = 2004,
  month        = aug,
  address      = {Portland, Oregon},
}

Building secure high-performance web services with OKWS Maxwell Krohn. USENIX 2004.

@inproceedings{krohn:okws,
  title        = {Building Secure High-Performance Web Services with
    {OKWS}},
  author       = {Maxwell Krohn},
  address      = {Boston, MA},
  month        = jun,
  booktitle    = {Proceedings of the 2004 USENIX Annual Technical
    Conference (USENIX '04)},
  year         = 2004,
}

REX: Secure, Extensible Remote Execution Michael Kaminsky, Eric Peterson, Daniel B. Giffin, Kevin Fu, David Mazières, and M. Frans Kaashoek. USENIX 2004.

This version of the paper contains corrected values for the round-trip time reported in Section 6.2 and the latencies reported in Table 2. The previous values were only one-way instead of round-trip.

The ubiquitous SSH package has demonstrated the importance of secure remote login and execution. As re- mote execution tools grow in popularity, users require new features and extensions, which are difficult to add to existing systems. REX is a remote execution utility with a novel architecture specifically designed for extensibility as well as security and transparent connection persistence in the face of network complexities such as NAT and dynamic IP addresses. To achieve extensibility, REX bases much of its functionality on a single new abstraction--emulated file descriptor passing across machines. This abstraction is powerful enough for users to extend REX's functionality in many ways without changing the core software or protocol.

REX addresses security in two ways. First, the implementation internally leverages file descriptor passing to split the server into several smaller programs, reducing both privileged and remotely exploitable code. Second, REX selectively delegates authority to processes running on remote machines that need to access other resources. The delegation mechanism lets users incrementally construct trust policies for remote machines. Finally, REX provides mechanisms for accessing servers without globally routable IP addresses, and for resuming sessions when a TCP connection aborts or an endpoint's IP address changes. Measurements of the system demonstrate that REX's architecture does not come at the cost of performance.

@inproceedings{rex:usenix04,
  title        = {{REX}: {S}ecure, {E}xtensible {R}emote {E}xecution},
  author       = {Michael Kaminsky and Eric Peterson and Daniel B.
    Giffin and Kevin Fu and David Mazi{\`e}res and M. Frans Kaashoek},
  address      = {Boston, MA},
  month        = jun,
  pages        = {199--212},
  booktitle    = {Proceedings of the 2004 USENIX Annual Technical
    Conference (USENIX '04)},
  year         = 2004,
}

High-throughput routing for multi-hop wireless networks Douglas S. J. De Couto. Ph.D. thesis, Massachusetts Institute of Technology, June 2004.

@phdthesis{decouto-phd,
  title        = {High-Throughput Routing for Multi-Hop Wireless
    Networks},
  author       = {Douglas S. J. {De Couto}},
  school       = {Massachusetts Institute of Technology},
  year         = 2004,
  month        = jun,
}

On-the-fly verification of rateless erasure codes for efficient content distribution Maxwell N. Krohn, Michael J. Freedman, and David Mazières. IEEE SP 2004.

The quality of peer-to-peer content distribution can suffer when malicious participants intentionally corrupt content. Some systems using simple block-by-block downloading can verify blocks with traditional cryptographic signatures and hashes, but these techniques do not apply well to more elegant systems that use rateless erasure codes for efficient multicast transfers. This paper presents a practical scheme, based on homomorphic hashing, that enables a downloader to perform on-the-fly verification of erasure-encoded blocks.

@inproceedings{otfvec,
  title        = {On-the-Fly Verification of Rateless Erasure Codes
    for Efficient Content Distribution},
  author       = {Maxwell N. Krohn and Michael J. Freedman and David
    Mazi{\`e}res},
  booktitle    = {Proceedings of the {IEEE} Symposium on Security and
    Privacy},
  address      = {Oakland, CA},
  month        = may,
  year         = 2004,
}

Providing asynchronous file I/O for the Plan 9 operating system Jason Hickey. Master's thesis, Massachusetts Institute of Technology, May 2004.

@mastersthesis{plan9:jmhickey-meng,
  title        = {Providing Asynchronous File {I/O} for the {Plan 9}
    Operating System},
  author       = {Jason Hickey},
  school       = {Massachusetts Institute of Technology},
  year         = 2004,
  month        = may,
}

M&M: A passive toolkit for measuring, correlating, and tracking path characteristics Sachin Katti, Dina Katabi, Eddie Kohler, and Jacob Strauss. MIT CSAIL technical report, April 2004.

@techreport{mnm:katti04-tr,
  author       = {Sachin Katti and Dina Katabi and Eddie Kohler and
    Jacob Strauss},
  title        = {{M\&M}: A Passive Toolkit for Measuring,
    Correlating, and Tracking Path Characteristics},
  institution  = {{MIT} Computer Science and Artificial Intelligence
    Laboratory},
  number       = {MIT-CSAIL-TR-945},
  year         = 2004,
  month        = apr,
}

Designing a DHT for low latency and high throughput Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, and Robert Morris. NSDI 2004.

Designing a wide-area distributed hash table (DHT) that provides high-throughput and low-latency network storage is a challenge. Existing systems have explored a range of solutions, including iterative routing, recursive routing, proximity routing and neighbor selection, erasure coding, replication, and server selection.

This paper explores the design of these techniques and their interaction in a complete system, drawing on the measured performance of a new DHT implementation and results from a simulator with an accurate Internet latency model. New techniques that resulted from this exploration include use of latency predictions based on synthetic coordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.

Measurements with 425 server instances running on 150 PlanetLab and RON hosts show that the latency optimizations reduce the time required to locate and fetch data by a factor of two. The throughput optimizations result in a sustainable bulk read throughput related to the number of DHT hosts times the capacity of the slowest access link; with 150 selected PlanetLab hosts, the peak aggregate throughput over multiple clients is 12.8 megabytes per second.

@inproceedings{dhash:nsdi,
  title        = {Designing a {DHT} for low latency and high
    throughput},
  author       = {Frank Dabek and Jinyang Li and Emil Sit and James
    Robertson and M. Frans Kaashoek and Robert Morris},
  booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
    {N}etworked {S}ystems {D}esign and {I}mplementation ({NSDI} '04)},
  year         = 2004,
  month        = mar,
  address      = {San Francisco, California},
}

Comparing the performance of distributed hash tables under churn Jinyang Li, Jeremy Stribling, Thomer M. Gil, Robert Morris, and M. Frans Kaashoek. IPTPS 2004.

A protocol for a distributed hash table (DHT) incurs communication costs to keep up with churn---changes in membership---in order to maintain its ability to route lookups efficiently. This paper formulates a unified framework for evaluating cost and performance. Communication costs are combined into a single cost measure (bytes), and performance benefits are reduced to a single latency measure. This approach correctly accounts for background maintenance traffic and timeouts during lookup due to stale routing data, and also correctly leaves open the possibility of different preferences in the tradeoff of lookup time versus communication cost. Using the unified framework, this paper analyzes the effects of DHT parameters on the performance of four protocols under churn.

@inproceedings{dhtcomparison:iptps04,
  title        = {Comparing the performance of distributed hash tables
    under churn},
  author       = {Jinyang Li and Jeremy Stribling and Thomer M. Gil
    and Robert Morris and M. Frans Kaashoek},
  booktitle    = {Proceedings of the 3rd International Workshop on
    Peer-to-Peer Systems (IPTPS04)},
  year         = 2004,
  month        = feb,
  address      = {San Diego, CA},
}

UsenetDHT: A low overhead Usenet server Emil Sit, Frank Dabek, and James Robertson. IPTPS 2004.

@inproceedings{usenetdht:iptps04,
  title        = {{UsenetDHT}: A Low Overhead {Usenet} Server},
  author       = {Emil Sit and Frank Dabek and James Robertson},
  booktitle    = {Proceedings of the 3rd International Workshop on
    Peer-to-Peer Systems (IPTPS04)},
  year         = 2004,
  month        = feb,
  address      = {San Diego, CA},
}

Parsing expression grammars: A recognition-based syntactic foundation Bryan Ford. POPL 2004.

@inproceedings{parsing:popl04,
  author       = {Bryan Ford},
  title        = {Parsing Expression Grammars: A Recognition-Based
    Syntactic Foundation},
  booktitle    = {Proceedings of the Symposium on Principles of
    Programming Languages},
  year         = 2004,
  month        = jan,
}

2003

Unmanaged internet protocol: Taming the edge network management crisis Bryan Ford. HotNets 2003.

@inproceedings{uip:hotnets03,
  author       = {Bryan Ford},
  title        = {Unmanaged Internet Protocol: Taming the Edge Network
    Management Crisis},
  booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
    {T}opics in {N}etworks ({HotNets-II})},
  year         = 2003,
  month        = nov,
  organization = {{ACM SIGCOMM}},
  address      = {Cambridge, Massachusetts},
}

Practical, distributed network coordinates Russ Cox, Frank Dabek, Frans Kaashoek, Jinyang Li, and Robert Morris. HotNets 2003.

Vivaldi is a distributed algorithm that assigns synthetic coordinates to Internet hosts, so that the Euclidean distance between two hosts' coordinates predicts the network latency between them. Each node in Vivaldi computes its coordinates by simulating its position in a network of physical springs. Vivaldi is both distributed and efficient: no fixed infrastructure need be deployed and a new host can compute useful coordinates after collecting latency information from only a few other hosts. Vivaldi can rely on piggy-backing latency information on application traffic instead of generating extra traffic by sending its own probe packets.

This paper evaluates Vivaldi through simulations of 750 hosts, with a matrix of inter-host latencies derived from measurements between 750 real Internet hosts. Vivaldi finds synthetic coordinates that predict the measured latencies with a median relative error of 14 percent. The simulations show that a new host joining an existing Vivaldi system requires fewer than 10 probes to achieve this accuracy. Vivaldi is currently used by the Chord distributed hash table to perform proximity routing, replica selection, and retransmission timer estimation.

@inproceedings{hotnets:vivaldi,
  title        = {Practical, Distributed Network Coordinates},
  author       = {Russ Cox and Frank Dabek and Frans Kaashoek and
    Jinyang Li and Robert Morris},
  booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
    {T}opics in {N}etworks ({HotNets-II})},
  year         = 2003,
  month        = nov,
  organization = {{ACM SIGCOMM}},
  address      = {Cambridge, Massachusetts},
}

Opportunistic routing in multi-hop wireless networks Sanjit Biswas, and Robert Morris. HotNets 2003.

This paper describes Extremely Opportunistic Routing (ExOR), a new unicast routing technique for multi-hop wireless networks. ExOR forwards each packet through a sequence of nodes, deferring the choice of each node in the sequence until after the previous node has transmitted the packet on its radio. ExOR then determines which node, of all the nodes that successfully received that transmission, is the node closest to the destination. That closest node transmits the packet. The result is that each hop moves the packet farther (on average) than the hops of the best possible pre-determined route.

The ExOR design addresses the challenge of choosing a forwarding node after transmission using a distributed algorithm. First, when a node transmits a packet, it includes in the packet a simple schedule describing the priority order in which the potential receivers should forward the packet. The node computes the schedule based on shared measurements of inter-node delivery rates. ExOR then uses a distributed slotted MAC protocol for acknowledgments to ensure that the receivers agree who the highest priority receiver was.

The efficacy of ExOR depends mainly on the rate at which the reception probability falls off with distance. Simulations based on measured radio characteristics suggest that ExOR reduces the total number of transmissions by nearly a factor of two over the best possible pre-determined route.

@inproceedings{roofnet:exor-hotnets03,
  title        = {Opportunistic Routing in Multi-Hop Wireless Networks},
  author       = {Sanjit Biswas and Robert Morris},
  booktitle    = {Proceedings of the Second {W}orkshop on {H}ot
    {T}opics in {N}etworks ({HotNets-II})},
  year         = 2003,
  month        = nov,
  organization = {{ACM SIGCOMM}},
  address      = {Cambridge, Massachusetts},
}

Measuring the effects of internet path faults on reactive routing Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek. SIGMETRICS 2003.

@inproceedings{sigmetrics:routing,
  title        = {Measuring the effects of internet path faults on
    reactive routing},
  author       = {Nick Feamster and David G. Andersen and Hari
    Balakrishnan and M. Frans Kaashoek},
  pages        = {126--137},
  booktitle    = {Proceedings of the 2003 Conference on Measurement
    and Modelling of Computer Systems},
  year         = 2003,
  month        = nov,
  organization = {{ACM SIGMETRICS}},
  address      = {San Diego, CA},
}

Scalable Internet routing on topology-independent node identities Bryan Ford. MIT LCS technical report, October 2003.

@techreport{uip:ford03-tr,
  author       = {Bryan Ford},
  title        = {Scalable {Internet} Routing on Topology-Independent
    Node Identities},
  institution  = {{MIT} Laboratory for Computer Science},
  number       = {MIT-LCS-TR-926},
  year         = 2003,
  month        = oct,
}

A Measurement Study of Available Bandwidth Estimation Tools Jacob Strauss, Dina Katabi, and Frans Kaashoek. SIGCOMM 2003.

@inproceedings{spruce:imc03,
  title        = {{A Measurement Study of Available Bandwidth
    Estimation Tools}},
  author       = {Jacob Strauss and Dina Katabi and Frans Kaashoek},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
    Measurement Conference '03},
  year         = 2003,
  month        = oct,
  address      = {Miami, Florida},
}

Decentralized user authentication in a global file system Michael Kaminsky, George Savvides, David Mazières, and M. Frans Kaashoek. SOSP 2003.

The challenge for user authentication in a global file system is allowing people to grant access to specific users and groups in remote administrative domains, without assuming any kind of pre-existing administrative relationship. The traditional approach to user authentication across administrative domains is for users to prove their identities through a chain of certificates. Certificates allow for general forms of delegation, but they often require more infrastructure than is necessary to support a network file system.

This paper introduces an approach without certificates. Local authentication servers pre-fetch and cache remote user and group definitions from remote authentication servers. During a file access, an authentication server can establish identities for users based just on local information. This approach is particularly well-suited to file systems, and it provides a simple and intuitive interface that is similar to those found in local access control mechanisms. An implementation of the authentication server and a file server supporting access control lists demonstrate the viability of this design in the context of the Self-certifying File System (SFS). Experiments demonstrate that the authentication server can scale to groups with tens of thousands of members.

@inproceedings{sfs:sosp03,
  title        = {{D}ecentralized User Authentication in a Global File
    System},
  author       = {Michael Kaminsky and George Savvides and David
    Mazi{\`e}res and M. Frans Kaashoek},
  pages        = {60--73},
  booktitle    = {Proceedings of the 19th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '03)},
  year         = 2003,
  month        = oct,
  address      = {Bolton Landing, New York},
}

A high-throughput path metric for multi-hop wireless routing Douglas S. J. De Couto, Daniel Aguayo, John Bicket, and Robert Morris. MobiCom 2003.

This paper presents the expected transmission count metric (ETX), which finds high-throughput paths on multi-hop wireless networks. ETX minimizes the expected total number of packet transmissions (including retransmissions) required to successfully deliver a packet to the ultimate destination. The ETX metric incorporates the effects of link loss ratios, asymmetry in the loss ratios between the two directions of each link, and interference among the successive links of a path. In contrast, the minimum hop-count metric chooses arbitrarily among the different paths of the same minimum length, regardless of the often large differences in throughput among those paths, and ignoring the possibility that a longer path might offer higher throughput.

This paper describes the design and implementation of ETX as a metric for the DSDV and DSR routing protocols, as well as modifications to DSDV and DSR which allow them to use ETX. Measurements taken from a 29-node 802.11b test-bed demonstrate the poor performance of minimum hop-count, illustrate the causes of that poor performance, and confirm that ETX improves performance. For long paths the throughput improvement is often a factor of two or more, suggesting that ETX will become more useful as networks grow larger and paths become longer.

@inproceedings{grid:mobicom03,
  title        = {A High-Throughput Path Metric for Multi-Hop Wireless
    Routing},
  author       = {Douglas S. J. {De Couto} and Daniel Aguayo and John
    Bicket and Robert Morris},
  booktitle    = {Proceedings of the 9th {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '03)},
  year         = 2003,
  month        = sep,
  address      = {San Diego, California},
}

SSL splitting: securely serving data from untrusted caches Chris Lesniewski-Laas, and M. Frans Kaashoek. USENIX Security 2003.

SSL splitting is implemented as a patch to the industry-standard OpenSSL library, with which the server is linked. In experiments replaying two-hour access.log traces taken from LCS Web sites over an ADSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@inproceedings{ssl-splitting:usenixsecurity03,
  title        = {{SSL} splitting: securely serving data from
    untrusted caches},
  author       = {Chris Lesniewski-Laas and M. Frans Kaashoek},
  pages        = {187--199},
  booktitle    = {Proceedings of the 12th {USENIX} {S}ecurity
    {S}ymposium},
  year         = 2003,
  month        = aug,
  address      = {Washington, D.C.},
}

Experience with an evolving overlay network testbed David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. SIGCOMM 33(3), July 2003.

@article{ron:ccr,
  title        = {Experience with an evolving overlay network testbed},
  author       = {David Andersen and Hari Balakrishnan and M. Frans
    Kaashoek and Robert Morris},
  month        = jul,
  volume       = 33,
  number       = 3,
  pages        = {13--19},
  journal      = {{ACM SIGCOMM} Computer Communication Review},
  year         = 2003,
}

Multiprocessor support for event-driven programs Nickolai Zeldovich, Alexander Yip, Frank Dabek, Robert Morris, David Mazières, and Frans Kaashoek. USENIX 2003.

This paper presents a new asynchronous programming library (libasync-smp) that allows event-driven applications to take advantage of multiprocessors by running code for event handlers in parallel. To control the concurrency between events, the programmer can specify a color for each event: events with the same color (the default case) are handled serially; events with different colors can be handled in parallel. The programmer can incrementally expose parallelism in existing event-driven applications by assigning different colors to computationally-intensive events that do not share mutable state.

An evaluation of libasyncsmp demonstrates that applications achieve multiprocessor speedup with little programming effort. As an example, parallelizing the cryptography in the SFS file server required about 90 lines of changed code in two modules, out of a total of about 12,000 lines. Multiple clients were able to read large cached files from the libasync-smp SFS server running on a 4-CPU machine 2.5 times as fast as from an unmodified uniprocessor SFS server on one CPU. Applications without computationally intensive tasks also benefit: an event-driven Web server achieves 1.5 speedup on four CPUs with multiple clients reading small cached files.

@inproceedings{asyncmp:usenix03,
  title        = {Multiprocessor Support for Event-Driven Programs},
  author       = {Nickolai Zeldovich and Alexander Yip and Frank Dabek
    and Robert Morris and David Mazi{\`e}res and Frans Kaashoek},
  booktitle    = {Proceedings of the 2003 USENIX Annual Technical
    Conference (USENIX '03)},
  year         = 2003,
  month        = jun,
  address      = {San Antonio, Texas},
}

Certifying program execution with secure processors Benjie Chen, and Robert Morris. HotOS 2003.

@inproceedings{cerium:hotos03,
  title        = {{C}ertifying Program Execution with Secure
    Processors},
  author       = {Benjie Chen and Robert Morris},
  booktitle    = {Proceedings of the 9th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-IX})},
  year         = 2003,
  month        = may,
  address      = {Lihue, Hawaii},
}

Robust and efficient data management for a distributed hash table Josh Cates. Master's thesis, Massachusetts Institute of Technology, May 2003.

@mastersthesis{chord:cates-meng,
  title        = {Robust and Efficient Data Management for a
    Distributed Hash Table},
  author       = {Josh Cates},
  school       = {Massachusetts Institute of Technology},
  year         = 2003,
  month        = may,
}

On the feasibility of peer-to-peer web indexing and search Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kaashoek, David Karger, and Robert Morris. IPTPS 2003.

@inproceedings{p2psearch:iptps03,
  title        = {On the feasibility of peer-to-peer web indexing and
    search},
  author       = {Jinyang Li and Boon Thau Loo and Joseph M.
    Hellerstein and M. Frans Kaashoek and David Karger and Robert
    Morris},
  booktitle    = {Proceedings of the 2nd International Workshop on
    Peer-to-Peer Systems (IPTPS03)},
  year         = 2003,
  month        = feb,
  address      = {Berkeley, CA},
}

Towards a common API for structured peer-to-peer overlays Frank Dabek, Ben Zhao, Peter Druschel, John Kubiatowicz, and Ion Stoica. IPTPS 2003.

In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer overlays and the key abstractions that can be built on them. In doing so, we hope to facilitate independent innovation in overlay protocols, services, and applications, to allow direct experimental comparisons, and to encourage application development by third parties. We provide a snapshot of our efforts and discuss open problems in an effort to solicit feedback from the research community.

@inproceedings{iptps:apis,
  title        = {Towards a Common {API} for Structured Peer-to-Peer
    Overlays},
  author       = {Frank Dabek and Ben Zhao and Peter Druschel and John
    Kubiatowicz and Ion Stoica},
  booktitle    = {Proceedings of the 2nd International Workshop on
    Peer-to-Peer Systems (IPTPS03)},
  year         = 2003,
  month        = feb,
  address      = {Berkeley, CA},
}

Chord: A scalable peer-to-peer lookup service for internet applications Ion Stoica, Robert Morris, David Liben-Nowell, David Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. IEEE Transactions on Networking 11, February 2003.

A fundamental problem that confronts peer-to-peer applications is the efficient location of the node that stores a desired data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis and simulations show that Chord is scalable: communication cost and the state maintained by each node scale logarithmically with the number of Chord nodes.

@article{chord:ton,
  title        = {Chord: A Scalable Peer-to-peer Lookup Service for
    Internet Applications},
  author       = {Ion Stoica and Robert Morris and David Liben-Nowell
    and David Karger and M. Frans Kaashoek and Frank Dabek and Hari
    Balakrishnan},
  journal      = {{IEEE} Transactions on Networking},
  volume       = 11,
  year         = 2003,
  month        = feb,
},
}

Looking up data in P2P systems Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robbert Morris, and Ion Stoica. Communications of the ACM 46, February 2003.

@article{acm:lookup,
  title        = {Looking up data in {P2P} systems},
  author       = {Hari Balakrishnan and M. Frans Kaashoek and David
    Karger and Robbert Morris and Ion Stoica},
  volume       = 46,
  month        = feb,
  year         = 2003,
  pages        = {43--48},
  journal      = {Communications of the ACM},
}

Melody: A distributed music-sharing system James Robertson. Master's thesis, Massachusetts Institute of Technology, February 2003.

@mastersthesis{chord:jsr-meng,
  title        = {Melody: A Distributed Music-Sharing System},
  author       = {James Robertson},
  school       = {Massachusetts Institute of Technology},
  year         = 2003,
  month        = feb,
}

SSL splitting and barnraising: Cooperative caching with authenticity guarantees Chris Lesniewski-Laas. Master's thesis, Massachusetts Institute of Technology, February 2003.

SSL splitting is a cryptographic technique to guarantee that public data served by caching Web proxies is endorsed by the originating server. When a client makes a request, the trusted server generates a stream of authentication records and sends them to the untrusted proxy, which combines them with a stream of data records retrieved from its local cache. The combined stream is relayed to the client, a standard Web browser, which verifies the data's integrity. Since the combined stream simulates a normal Secure Sockets Layer (SSL) connection, SSL splitting works with unmodified browsers; however, since it does not provide confidentiality, it is appropriate for applications that require only authentication. The server must be linked to a patched version of the industry-standard OpenSSL library; no other server modifications are necessary. In experiments replaying two-hour access.log traces taken from LCS Web sites over a DSL link, SSL splitting reduces bandwidth consumption of the server by between 25% and 90% depending on the warmth of the cache and the redundancy of the trace. Uncached requests forwarded through the proxy exhibit latencies within approximately 5% of those of an unmodified SSL server.

@mastersthesis{ssl-splitting:ctl-meng,
  title        = {{SSL} splitting and Barnraising: Cooperative caching
    with authenticity guarantees},
  author       = {Chris Lesniewski-Laas},
  school       = {Massachusetts Institute of Technology},
  year         = 2003,
  month        = feb,
}

REX: Secure, modular remote execution through file descriptor passing Michael Kaminsky, Eric Peterson, Kevin Fu, David Mazières, and M. Frans Kaashoek. MIT LCS technical report, January 2003.

The ubiquitous SSH package has demonstrated the importance of secure remote login and execution. This paper presents a new system, REX, designed to provide remote login and execution in the context of the SFS secure distributed file system. REX departs from traditional remote login design and is built around two main mechanisms---file descriptor passing and a user agent process.

File descriptor passing allows REX to be split into several smaller pieces; privileged code can run as its own process to provide enhanced security guarantees. REX also emulates secure file descriptor passing over network connections, allowing users to build extensions to REX outside of the core REX software.

REX uses and extends SFS's agent mechanism to provide a transparent distributed computing environment to users. The agent stores private keys, server nicknames, and other per-user configuration state; REX makes the SFS agent available to programs that it executes on remote machines.

We have an implementation of REX and demonstrate that its flexibility does not come at the cost of performance. Initial REX connections are comparable to those of SSH in speed, while subsequent connections are much faster because REX exploits the SFS agent to cache connection state to avoid costly public-key operations.

@techreport{sfs:rextr03,
  title        = {{REX}: Secure, modular remote execution through file
    descriptor passing},
  author       = {Michael Kaminsky and Eric Peterson and Kevin Fu and
    David Mazi{\`e}res and M. Frans Kaashoek},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 2003,
  month        = jan,
  number       = {MIT-LCS-TR-884},
  note         = {http://www.pdos.lcs.mit.edu/papers/sfs:rextr03/},
}

2002

Ivy: A read/write peer-to-peer file system Athicha Muthitacharoen, Robert Morris, Thomer Gil, and Benjie Chen. OSDI 2002.

@inproceedings{ivy:osdi02,
  title        = {Ivy: A Read/Write Peer-to-peer File System},
  author       = {Athicha Muthitacharoen and Robert Morris and Thomer
    Gil and Benjie Chen},
  booktitle    = {Proceedings of the 5th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '02)},
  year         = 2002,
  month        = dec,
  address      = {Boston, Massachusetts},
}

A session-based architecture for Internet mobility Alex C. Snoeren. Ph.D. thesis, Massachusetts Institute of Technology, December 2002.

@phdthesis{snoeren-phd,
  title        = {A Session-Based Architecture for {Internet} Mobility},
  author       = {Alex C. Snoeren},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = dec,
}

Tarzan: A peer-to-peer anonymizing network layer Michael J. Freedman, and Robert Morris. CCS 2002.

Tarzan is a peer-to-peer anonymous IP network overlay. Because it provides IP service, Tarzan is general-purpose and transparent to applications. Organized as a decentralized peer-to-peer overlay, Tarzan is fault-tolerant, highly scalable, and easy to manage.

Tarzan achieves its anonymity with layered encryption and multi-hop routing, much like a Chaumian mix. A message initiator chooses a path of peers pseudo-randomly through a restricted topology in a way that adversaries cannot easily influence. Cover traffic prevents a global observer from using traffic analysis to identify an initiator. Protocols toward unbiased peer-selection offer new directions for distributing trust among untrusted entities.

Tarzan provides anonymity to either clients or servers, without requiring that both participate. In both cases, Tarzan uses a network address translator (NAT) to bridge between Tarzan hosts and oblivious Internet hosts.

Measurements show that Tarzan imposes minimal overhead over a corresponding non-anonymous overlay route.

@inproceedings{tarzan:ccs9,
  title        = {Tarzan: A Peer-to-Peer Anonymizing Network Layer},
  author       = {Michael J. Freedman and Robert Morris},
  booktitle    = {Proceedings of the 9th {ACM} Conference on Computer
    and Communications Security ({CCS-9})},
  year         = 2002,
  month        = nov,
  address      = {Washington, D.C.},
}

Programming language optimizations for modular router configurations Eddie Kohler, Robert Morris, and Benjie Chen. ASPLOS 2002.

@inproceedings{click:asplos02,
  title        = {Programming Language Optimizations for Modular
    Router Configurations},
  author       = {Eddie Kohler and Robert Morris and Benjie Chen},
  booktitle    = {Proceedings of the 10th Conference on Architectural
    Support for Programming Languages and Operating Systems (ASPLOS)},
  month        = oct,
  year         = 2002,
}

Performance of multihop wireless networks: Shortest path is not enough Douglas S. J. De Couto, Daniel Aguayo, Benjamin A. Chambers, and Robert Morris. HotNets 2002.

Existing wireless ad hoc routing protocols typically find routes with the minimum hop-count. This paper presents experimental evidence from two wireless test-beds which shows that there are usually multiple minimum hop-count paths, many of which have poor throughput. As a result, minimum-hop-count routing often chooses routes that have significantly less capacity than the best paths that exist in the network. Much of the reason for this is that many of the radio links between nodes have loss rates low enough that the routing protocol is willing to use them, but high enough that much of the capacity is consumed by retransmissions. These observations suggest that more attention be paid to link quality when choosing ad hoc routes; the paper presents measured link characteristics likely to be useful in devising a better path quality metric.

@inproceedings{grid:hotnets02,
  title        = {Performance of Multihop Wireless Networks: Shortest
    Path is Not Enough},
  author       = {Douglas S. J. {De Couto} and Daniel Aguayo and
    Benjamin A. Chambers and Robert Morris},
  booktitle    = {Proceedings of the First {W}orkshop on {H}ot
    {T}opics in {N}etworks ({HotNets-I})},
  year         = 2002,
  month        = oct,
  organization = {{ACM SIGCOMM}},
  address      = {Princeton, New Jersey},
}

Packrat parsing: Simple, powerful, lazy, linear time Bryan Ford. ICFP 2002.

@inproceedings{packrat-parsing:icfp02,
  author       = {Bryan Ford},
  title        = {Packrat Parsing: Simple, Powerful, Lazy, Linear Time},
  booktitle    = {Proceedings of the 2002 International Conference on
    Functional Programming},
  year         = 2002,
  month        = oct,
}

Event-driven programming for robust software Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazières, and Robert Morris. SIGOPS European Workshop 2002.

Events are a better means of managing I/O concurrency in server software than threads: events help avoid bugs caused by the unnecessary CPU concurrency introduced by threads. Event-based programs also tend to have more stable performance under heavy load than threaded programs. We argue that our libasync-smp non-blocking I/O library makes event-based programming convenient and evaluate extensions to the library that allow event-based programs to take advantage of multi-processors. We conclude that events provide all the benefits of threads, with substantially less complexity; the result is more robust software.

@inproceedings{events:sigops,
  title        = {Event-Driven Programming for Robust Software},
  author       = {Frank Dabek and Nickolai Zeldovich and Frans
    Kaashoek and David Mazi{\`e}res and Robert Morris},
  booktitle    = {Proceedings of the 2002 SIGOPS European Workshop},
  year         = 2002,
  month        = sep,
  address      = {Saint-Emilion, France},
}

Span: An energy-efficient coordination algorithm for topology maintenance in ad hoc wireless networks Benjie Chen, Kyle Jamieson, Hari Balakrishnan, and Robert Morris. ACM Wireless Networks 8(5), September 2002.

This paper presents Span, a power saving technique for multi-hop ad hoc wireless networks that reduces energy consumption without significantly diminishing the capacity or connectivity of the network. Span builds on the observation that when a region of a shared-channel wireless network has a sufficient density of nodes, only a small number of them need be on at any time to forward traffic for active connections.

Span is a distributed, randomized algorithm where nodes make local decisions on whether to sleep, or to join a forwarding backbone as a coordinator. Each node bases its decision on an estimate of how many of its neighbors will benefit from it being awake, and the amount of energy available to it. We give a randomized algorithm where coordinators rotate with time, demonstrating how localized node decisions lead to a connected, capacity-preserving global topology.

Improvement in system lifetime due to Span increases as the ratio of idle-to-sleep energy consumption increases. Our simulations show that with a practical energy model, system lifetime of an 802.11 network in power saving mode with Span is a factor of two better than without. Additionally, Span also improves communication latency and capacity.

@article{span:wireless01,
  title        = {Span: An Energy-Efficient Coordination Algorithm for
    Topology Maintenance in Ad Hoc Wireless Networks},
  author       = {Benjie Chen and Kyle Jamieson and Hari Balakrishnan
    and Robert Morris},
  volume       = 8,
  number       = 5,
  year         = 2002,
  month        = sep,
},
  journal      = {{ACM} Wireless Networks},
}

Packrat parsing: a practical linear-time algorithm with backtracking Bryan Ford. Master's thesis, Massachusetts Institute of Technology, September 2002.

@mastersthesis{packrat-parsing:ford-ms,
  author       = {Bryan Ford},
  title        = {Packrat Parsing: a practical linear-time algorithm
    with backtracking},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = sep,
}

Choosing Internet paths with high bulk transfer capacity Jacob Strauss. Master's thesis, Massachusetts Institute of Technology, Sep 2002.

@mastersthesis{chord:jastr-meng,
  title        = {Choosing {Internet} Paths with High Bulk Transfer
    Capacity},
  author       = {Jacob Strauss},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = sep,
}

Network layer support for overlay networks John Jannotti. Ph.D. thesis, Massachusetts Institute of Technology, August 2002.

@phdthesis{jannotti-phd,
  title        = {Network Layer Support for Overlay Networks},
  author       = {John Jannotti},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = aug,
}

Natron: Overlay routing to oblivious destinations Alexander Yip. Master's thesis, Massachusetts Institute of Technology, Aug 2002.

@mastersthesis{natron:yip-meng,
  title        = {NATRON: Overlay Routing to Oblivious Destinations},
  author       = {Alexander Yip},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = aug,
}

Access control lists for the self-certifying filesystem George Savvides. Master's thesis, Massachusetts Institute of Technology, Aug 2002.

@mastersthesis{sfs:savvides-meng,
  title        = {Access Control Lists for the Self-Certifying
    Filesystem},
  author       = {George Savvides},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = aug,
}

Self-certifying filesystem implementation for Windows David Euresti. Master's thesis, Massachusetts Institute of Technology, Aug 2002.

@mastersthesis{sfs:euresti-meng,
  title        = {Self-Certifying Filesystem Implementation for
    {Windows}},
  author       = {David Euresti},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = aug,
}

Simple and general statistical profiling with PCT Charles Blake, and Steve Bauer. USENIX 2002.

The Profile Collection Toolkit (PCT) provides a novel generalized CPU profiling facility. PCT enables arbitrarily late profiling activation and arbitrarily early report generation. PCT usually requires no re-compilation, re-linking, or even re-starting of programs. Profiling reports gracefully degrade with available debugging data.

PCT uses its debugger controller, dbctl, to drive a debugger's control over a process. dbctl has a configuration language that allows users to specify context-specific debugger commands. These commands can sample general program state, such as call stacks and function parameters.

For systems or situations with poor debugger support, PCT provides several other portable and flexible collection methods. PCT can track most program code, including code in shared libraries and late-loaded shared objects. On Linux, PCT can seamlessly merge kernel CPU time profiles with user-level CPU profiles to create whole system reports.

@inproceedings{pct:usenix02,
  title        = {Simple and General Statistical Profiling with {PCT}},
  author       = {Charles Blake and Steve Bauer},
  pages        = {333--346},
  booktitle    = {Proceedings of the 2002 USENIX Annual Technical
    Conference (USENIX '02)},
  year         = 2002,
  month        = jun,
  address      = {Monterey, California},
}

A keyword set search system for peer-to-peer networks Omprakash D Gnawali. Master's thesis, Massachusetts Institute of Technology, Jun 2002.

@mastersthesis{chord:om_p-meng,
  title        = {A Keyword Set Search System for Peer-to-Peer
    Networks},
  author       = {Omprakash D Gnawali},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = jun,
}

Concurrency control for multi-processor event-driven systems Nickolai Zeldovich. Master's thesis, Massachusetts Institute of Technology, May 2002.

@mastersthesis{sfs:zeldovich-meng,
  title        = {Concurrency Control for Multi-Processor Event-Driven
    Systems},
  author       = {Nickolai Zeldovich},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = may,
}

The Grid Roofnet: a rooftop ad hoc wireless network Benjamin A. Chambers. Master's thesis, Massachusetts Institute of Technology, May 2002.

@mastersthesis{grid:bac-meng,
  title        = {The {Grid} {Roofnet}: a Rooftop Ad Hoc Wireless
    Network},
  author       = {Benjamin A. Chambers},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = may,
}

A peer-to-peer anonymizing network layer Michael J. Freedman. Master's thesis, Massachusetts Institute of Technology, May 2002.

Existing Internet systems implement anonymity at the application layer or through centralized components. A robust, decentralized infrastructure that anonymizes any Internet traffic could benefit a wide array of existing protocols and systems. This anonymous network layer could seamlessly replace the current communications channel, and it could continue to offer anonymity and availability even while components fail maliciously.

This thesis proposes Tarzan, a peer-to-peer anonymous IP network overlay. Because it provides IP service, Tarzan is general-purpose and transparent to applications. Organized as a decentralized peer-to-peer overlay, Tarzan is fault-tolerant, highly scalable, and easy to manage.

Tarzan achieves its anonymity with layered encryption and multi-hop routing, much like a Chaumian mix. A message initiator chooses a path of peers pseudo-randomly in a way that adversaries cannot easily influence. Cover traffic prevents a global observer from drawing conclusions based on traffic analysis as to an initiator's identity.

Tarzan provides anonymity to either clients or servers, without requiring that both participate, presenting the abstraction of a one-way anonymous tunnel. In both cases, Tarzan uses a network address translator (NAT) to bridge between Tarzan hosts and oblivious Internet hosts.

We quantify Tarzan's anonymity properties and show that Tarzan imposes minimal performance overhead over a corresponding non-anonymous overlay route.

@mastersthesis{tarzan:freedman-meng,
  title        = {A Peer-to-Peer Anonymizing Network Layer},
  author       = {Michael J. Freedman},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = may,
}

Herodotus: A peer-to-peer web archival system Timo Burkard. Master's thesis, Massachusetts Institute of Technology, May 2002.

@mastersthesis{chord:tburkard-meng,
  title        = {Herodotus: A Peer-to-Peer Web Archival System},
  author       = {Timo Burkard},
  school       = {Massachusetts Institute of Technology},
  year         = 2002,
  month        = may,
}

Efficient peer-to-peer lookup based on a distributed trie Michael J. Freedman, and Radek Vingralek. IPTPS 2002.

Two main approaches have been taken for distributed key/value lookup operations in peer-to-peer systems: broadcast searches and location-deterministic algorithms. We describe a third alternative based on a distributed trie. This algorithm functions well in a very dynamic, hostile environment, offering security benefits over prior proposals. Our approach takes advantage of working-set temporal locality and global key/value distribution skews due to content popularity. Peers gradually learn system state during lookups, receiving the sought values and/or internal information used by the trie. The distributed trie converges to an accurate network map over time. We describe several modes of information piggybacking, and conservative and liberal variances of the basic algorithm for adversarial settings. Simulations show efficient lookups and low failure rates.

@inproceedings{trie:iptps02,
  title        = {Efficient Peer-To-Peer Lookup Based on a Distributed
    Trie},
  author       = {Michael J. Freedman and Radek Vingralek},
  booktitle    = {Proceedings of the 1st International Workshop on
    Peer-to-Peer Systems (IPTPS)},
  year         = 2002,
  month        = mar,
  address      = {Cambridge, MA},
}

Serving DNS using Chord Russ Cox, Athicha Muthitacharoen, and Robert Morris. IPTPS 2002.

The current domain name system (DNS) couples ownership of domains with the responsibility of serving data for them. The DNS security extensions (DNSSEC) allow verificaton of records obtained by alternate means, opening exploration of alternative storage systems for DNS records. We explore one such alternative using DHash, a peer-to-peer distributed hash table built on top of Chord. Our system inherits Chord's fault-tolerance and load balance properties, at the same time eliminating many administrative problems with the current DNS. Still, our system has significantly higher latencies and other disadvantages in comparison with conventional DNS. We use this comparison to draw conclusions about general issues that still need to be addressed in peer-to-peer systems and distributed hash tables in particular.

@inproceedings{chord:dns02,
  title        = {Serving {DNS} using {Chord}},
  author       = {Russ Cox and Athicha Muthitacharoen and Robert
    Morris},
  booktitle    = {Proceedings of the 1st International Workshop on
    Peer-to-Peer Systems (IPTPS)},
  year         = 2002,
  month        = mar,
  address      = {Cambridge, MA},
}

Security considerations for peer-to-peer distributed hash tables Emil Sit, and Robert Morris. IPTPS 2002.

Recent peer-to-peer research has focused on providing efficient hash lookup systems that can be used to build more complex systems. These systems have good properties when their algorithms are executed correctly but have not generally considered how to handle misbehaving nodes. This paper looks at what sorts of security problems are inherent in large peer-to-peer systems based on distributed hash lookup systems. We examine the types of problems that such systems might face, drawing examples from existing systems, and propose some design principles for detecting and preventing these problems.

@inproceedings{chord:security02,
  title        = {Security Considerations for Peer-to-Peer Distributed
    Hash Tables},
  author       = {Emil Sit and Robert Morris},
  booktitle    = {Proceedings of the 1st International Workshop on
    Peer-to-Peer Systems (IPTPS)},
  year         = 2002,
  month        = mar,
  address      = {Cambridge, MA},
}

Effects of loss rate on ad hoc wireless routing Douglas S. J. De Couto, Daniel Aguayo, Benjamin A. Chambers, and Robert Morris. MIT LCS technical report, March 2002.

This paper uses measurements from two deployed wireless ad hoc networks to illustrate the effects of link loss rates on routing protocol performance. Measurements of these networks show that the radio links between the majority of nodes have substantial loss rates. These loss rates are high enough to decrease forwarding performance, but not high enough to prevent existing ad hoc routing protocols from using the links. Link-level retransmission can mask high loss rates, at the cost of substantial decreases in throughput. Simulations, driven by the observed loss rates, show that the shortest paths chosen by existing routing protocols tend to find routes with much less capacity than is available along the best route.

Based on these observations, we present a routing metric intended to allow routing protocols to find good routes in wireless ad hoc networks. The metric is the expected total number of transmissions required to deliver a packet along a route. This metric favors routes with high throughput and low total impact on spectrum. It is expected to perform better than existing techniques that eliminate links based on loss rate thresholds.

@techreport{grid:losstr01,
  title        = {Effects of Loss Rate on Ad Hoc Wireless Routing},
  author       = {Douglas S. J. {De Couto} and Daniel Aguayo and
    Benjamin A. Chambers and Robert Morris},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 2002,
  month        = mar,
  number       = {MIT-LCS-TR-836},
}

Introducing Tarzan, a peer-to-peer anonymizing network layer Michael J. Freedman, Emil Sit, Josh Cates, and Robert Morris. IPTPS 2002.

We introduce Tarzan, a peer-to-peer anonymous network layer that provides generic IP forwarding. Unlike prior anonymizing layers, Tarzan is flexible, transparent, decentralized, and highly scalable.

Tarzan achieves these properties by building anonymous IP tunnels between an open-ended set of peers. Tarzan can provide anonymity to existing applications, such as web browsing and file sharing, without change to those applications. Performance tests show that Tarzan imposes minimal overhead over a corresponding non-anonymous overlay route.

@inproceedings{tarzan:iptps02,
  title        = {Introducing {Tarzan}, a Peer-to-Peer Anonymizing
    Network Layer},
  author       = {Michael J. Freedman and Emil Sit and Josh Cates and
    Robert Morris},
  booktitle    = {Proceedings of the 1st International Workshop on
    Peer-to-Peer Systems (IPTPS)},
  year         = 2002,
  month        = mar,
  address      = {Cambridge, MA},
}

Fast and secure distributed read-only file system Kevin Fu, M. Frans Kaashoek, and David Mazières. ACM Transactions on Computer Systems 20(1), February 2002.

@article{sfsro:tocs2002,
  title        = {{F}ast and secure distributed read-only file system},
  author       = {Kevin Fu and M. Frans Kaashoek and David
    Mazi{\`e}res},
  volume       = 20,
  number       = 1,
  year         = 2002,
  month        = feb,
  pages        = {1--24},
  journal      = {{ACM} Transactions on Computer Systems},
}

Fast and flexible application-level networking on exokernel systems Gregory R. Ganger, Dawson R. Engler, M. Frans Kaashoek, Hector M. Briceno, Russell Hunt, and Thomas Pinckney. ACM Transactions on Computer Systems 20(1), February 2002.

Application-level networking is a promising software organization for improving performance and functionality for important network services. The Xok/ExOS exokernel system includes application-level support for standard network services, while at the same time allowing application writers to specialize networking services. This paper describes how Xok/ExOS’s kernel mechanisms and library operating system organization achieve this flexibility, and retrospectively shares our experiences and lessons learned (both positive and negative). It also describes how we used this flexibility to build and specialize three network data services: the Cheetah HTTP server, the webswamp Web benchmarking tool, and an application-level TCP forwarder. Overall measurements show large performance improvements relative to similar services built on conventional interfaces, in each case reaching the maximum possible end-to-end performance for the experimental platform. For example, Cheetah provides factor of 2-4 increases in throughput compared to highly tuned socket-based implementations and factor of 3-8 increases compared to conventional systems. Webswamp can offer loads that are two to eight times heavier. The TCP forwarder provides 50-300% higher throughput while also providing end-to-end TCP semantics that cannot be achieved with POSIX sockets. With more detailed measurements and profiling, these overall performance improvements are also broken down and attributed to the specific specializations described, providing server writers with insights into where to focus their optimization efforts.

@article{exo:tocs2002,
  title        = {{F}ast and flexible Application-Level Networking on
    Exokernel Systems},
  author       = {Gregory R. Ganger and Dawson R. Engler and M. Frans
    Kaashoek and Hector M. Briceno and Russell Hunt and Thomas
    Pinckney},
  volume       = 20,
  number       = 1,
  year         = 2002,
  month        = feb,
  pages        = {49--83},
  journal      = {{ACM} Transactions on Computer Systems},
}

DNS performance and the effectiveness of caching Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. IEEE Transactions on Networking 10(5), 2002.

@article{dns:ton,
  author       = {Jaeyeon Jung and Emil Sit and Hari Balakrishnan and
    Robert Morris},
  title        = {{DNS} performance and the effectiveness of caching},
  volume       = 10,
  number       = 5,
  year         = 2002,
  pages        = {589--603},
  publisher    = {IEEE Press},
  address      = {Piscataway, NJ, USA},
  journal      = {{IEEE} Transactions on Networking},
}

2001

DNS performance and the effectiveness of caching Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. SIGCOMM 2001.

@inproceedings{dnscache:sigcommimw01,
  title        = {{DNS} Performance and the Effectiveness of Caching},
  author       = {Jaeyeon Jung and Emil Sit and Hari Balakrishnan and
    Robert Morris},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} Internet
    Measurement Workshop '01},
  year         = 2001,
  month        = nov,
  address      = {San Francisco, California},
}

Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. SOSP 2001.

The Cooperative File System (CFS) is a new peer-to-peer read-only storage system that provides provable guarantees for the efficiency, robustness, and load-balance of file storage and retrieval. CFS does this with a completely decentralized architecture that can scale to large systems. CFS servers provide a distributed hash table (DHash) for block storage. CFS clients interpret DHash blocks as a file system. DHash distributes and caches blocks at a fine granularity to achieve load balance, uses replication for robustness, and decreases latency with server selection. DHash finds blocks using the Chord location protocol, which operates in time logarithmic in the number of servers.

CFS is implemented using the SFS file system toolkit and runs on Linux, OpenBSD, and FreeBSD. Experience on a globally deployed prototype shows that CFS delivers data to clients as fast as FTP. Controlled tests show that CFS is scalable: with 4,096 servers, looking up a block of data involves contacting only seven servers. The tests also demonstrate nearly perfect robustness and unimpaired performance even when as many as half the servers fail.

@inproceedings{cfs:sosp01,
  title        = {Wide-area cooperative storage with {CFS}},
  author       = {Frank Dabek and M. Frans Kaashoek and David Karger
    and Robert Morris and Ion Stoica},
},
  booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '01)},
  year         = 2001,
  month        = oct,
  address      = {Chateau Lake Louise, Banff, Canada},
}

Resilient overlay networks David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. SOSP 2001.

@inproceedings{ron:sosp01,
  title        = {Resilient Overlay Networks},
  author       = {David Andersen and Hari Balakrishnan and M. Frans
    Kaashoek and Robert Morris},
},
  booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '01)},
  year         = 2001,
  month        = oct,
  address      = {Chateau Lake Louise, Banff, Canada},
}

A low-bandwidth network file system Athicha Muthitacharoen, Benjie Chen, and David Mazières. SOSP 2001.

Users rarely consider running network file systems over slow or wide-area networks, as the performance would be unacceptable and the bandwidth consumption too high. Nonetheless, efficient remote file access would often be desirable over such networks---particularly when high latency makes remote login sessions unresponsive. Rather than run interactive programs such as editors remotely, users could run the programs locally and manipulate remote files through the file system. To do so, however, would require a network file system that consumes less bandwidth than most current file systems.

This paper presents LBFS, a network file system designed for low-bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique in conjunction with conventional compression and caching, LBFS consumes over an order of magnitude less bandwidth than traditional network file systems on common workloads.

@inproceedings{lbfs:sosp01,
  title        = {A Low-bandwidth Network File System},
  author       = {Athicha Muthitacharoen and Benjie Chen and David
    Mazi{\`e}res},
  pages        = {174--187},
  booktitle    = {Proceedings of the 18th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '01)},
  year         = 2001,
  month        = oct,
  address      = {Chateau Lake Louise, Banff, Canada},
}

A cooperative file system Frank Dabek. Master's thesis, Massachusetts Institute of Technology, September 2001.

@mastersthesis{cfs:dabek-meng,
  title        = {A Cooperative File System},
  author       = {Frank Dabek},
  school       = {Massachusetts Institute of Technology},
  year         = 2001,
  month        = sep,
}

A case study of server selection Tina Tyan. Master's thesis, Massachusetts Institute of Technology, September 2001.

@mastersthesis{chord:tyan-meng,
  title        = {A Case Study of Server Selection},
  author       = {Tina Tyan},
  school       = {Massachusetts Institute of Technology},
  year         = 2001,
  month        = sep,
}

Chord: A scalable peer-to-peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. SIGCOMM 2001.

A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

@inproceedings{chord:sigcomm01,
  title        = {Chord: A Scalable Peer-to-peer Lookup Service for
    {Internet} Applications},
  author       = {Ion Stoica and Robert Morris and David Karger and M.
    Frans Kaashoek and Hari Balakrishnan},
},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '01 Conference},
  year         = 2001,
  month        = aug,
  address      = {San Diego, California},
}

Dos and don'ts of client authentication on the web Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster. USENIX Security 2001.

Client authentication has been a continuous source of problems on the Web. Although many well-studied techniques exist for authentication, Web sites continue to use extremely weak authentication schemes, especially in non-enterprise environments such as store fronts. These weaknesses often result from careless use of authenticators within Web cookies. Of the twenty-seven sites we investigated, we weakened the client authentication on two systems, gained unauthorized access on eight, and extracted the secret key used to mint authenticators from one.

We provide a description of the limitations, requirements, and security models specific to Web client authentication. This includes the introduction of the interrogative adversary, a surprisingly powerful adversary that can adaptively query a Web site.

We propose a set of hints for designing a secure client authentication scheme. Using these hints, we present the design and analysis of a simple authentication scheme secure against forgeries by the interrogative adversary. In conjunction with SSL, our scheme is secure against forgeries by the active adversary.

@inproceedings{webauth:sec10,
  title        = {{D}os and Don'ts of Client Authentication on the Web},
  author       = {Kevin Fu and Emil Sit and Kendra Smith and Nick
    Feamster},
  note         = {An extended version is available as MIT-LCS-TR-818},
  booktitle    = {Proceedings of the 10th {USENIX} {S}ecurity
    {S}ymposium},
  year         = 2001,
  month        = aug,
  address      = {Washington, D.C.},
}

Span: An energy-efficient coordination algorithm for topology maintenance in ad hoc wireless networks Benjie Chen, Kyle Jamieson, Hari Balakrishnan, and Robert Morris. MobiCom 2001.

Improvement in system lifetime due to Span increases as the ratio of idle-to-sleep energy consumption increases, and increases as the density of the network increases. For example, our simulations show that with a practical energy model, system lifetime of an 802.11 network in power saving mode with Span is a factor of two better than without. Span integrates nicely with 802.11---when run in conjunction with the 802.11 power saving mode, Span improves communication latency, capacity, and system lifetime.

@inproceedings{span:mobicom01,
  title        = {Span: An Energy-Efficient Coordination Algorithm for
    Topology Maintenance in Ad Hoc Wireless Networks},
  author       = {Benjie Chen and Kyle Jamieson and Hari Balakrishnan
    and Robert Morris},
  pages        = {85--96},
  booktitle    = {Proceedings of the 7th {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '01)},
  year         = 2001,
  month        = jul,
  address      = {Rome, Italy},
}

Capacity of ad hoc wireless networks Jinyang Li, Charles Blake, Douglas S. J. De Couto, Hu Imm Lee, and Robert Morris. MobiCom 2001.

Early simulation experience with wireless ad hoc networks suggests that their capacity can be surprisingly low, due to the requirement that nodes forward each others' packets. The achievable capacity depends on network size, traffic patterns, and detailed local radio interactions. This paper examines these factors alone and in combination, using simulation and analysis from first principles. Our results include both specific constants and general scaling relationships helpful in understanding the limitations of wireless ad hoc networks.

We examine interactions of the 802.11 MAC and ad hoc forwarding and the effect on capacity for several simple configurations and traffic patterns. While 802.11 discovers reasonably good schedules, we nonetheless observe capacities markedly less than optimal for very simple chain and lattice networks with very regular traffic patterns. We validate some simulation results with experiments.

We also show that the traffic pattern determines whether an ad hoc network's per node capacity will scale to large networks. In particular, we show that for total capacity to scale up with network size the average distance between source and destination nodes must remain small as the network grows. Non-local traffic patterns in which this average distance grows with the network size result in a rapid decrease of per node capacity. Thus the question ``Are large ad hoc networks feasible?'' reduces to a question about the likely locality of communication in such networks.

@inproceedings{grid:mobicom01,
  title        = {Capacity of Ad Hoc Wireless Networks},
  author       = {Jinyang Li and Charles Blake and Douglas S. J. {De
    Couto} and Hu Imm Lee and Robert Morris},
  pages        = {61--69},
  booktitle    = {Proceedings of the 7th {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '01)},
  year         = 2001,
  month        = jul,
  address      = {Rome, Italy},
}

Location proxies and intermediate node forwarding for practical geographic forwarding Douglas S. J. De Couto, and Robert Morris. MIT LCS technical report, June 2001.

Two main problems prevent the deployment of geographic forwarding in real systems: geographic forwarding requires that all nodes know their locations, and it has trouble routing around local dead ends. This paper presents practical solutions to each problem.

The location proxy technique allows a node that does not know its location to find a nearby location aware node to use as a proxy for geographic forwarding. The technique works well over a large range of densities of location aware nodes, and allows a tradeoff between bandwidth used for routing information and expense of providing location information.

The intermediate node forwarding (INF) mechanism is a probabilistic solution for routing around bad geographic topologies via intermediate geographic locations. Existing solutions unrealistically assume that nodes have identical radio propagation; INF works on a restricted set of situations but makes assumptions that better match reality.

Experiments using the ns simulator show that location proxies and INF are effective enough to make geographic forwarding practical. We believe geographic forwarding will enable scalable ad hoc networking.

@techreport{grid:proxytr01,
  title        = {Location Proxies and Intermediate Node Forwarding
    for Practical Geographic Forwarding},
  author       = {Douglas S. J. {De Couto} and Robert Morris},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 2001,
  month        = jun,
  number       = {MIT-LCS-TR-824},
}

Flexible control of parallelism in a multiprocessor PC router Benjie Chen, and Robert Morris. USENIX 2001.

SMP Click is a software router that provides both flexibility and high performance on stock multiprocessor PC hardware. It achieves high performance using device, buffer, and queue management techniques optimized for multiprocessor routing. It allows vendors or network administrators to configure the router in a way that indicates parallelizable packet processing tasks, and adaptively load-balances those tasks across the available CPUs.

SMP Click's absolute performance is high: it can forward 494,000 64-byte IP packets per second on a 2-CPU 500 MHz Intel Xeon machine, compared to 302,000 packets per second for uniprocessor Click. SMP Click also scales well for CPU intensive tasks: 4-CPU SMP Click can encrypt and forward 87,000 64-byte packets per second using IPSec 3DES, compared to 23,000 packets per second for uniprocessor Click.

@inproceedings{click:usenix01,
  title        = {Flexible Control of Parallelism in a Multiprocessor
    {PC} Router},
  author       = {Benjie Chen and Robert Morris},
  pages        = {333--346},
  booktitle    = {Proceedings of the 2001 USENIX Annual Technical
    Conference (USENIX '01)},
  year         = 2001,
  month        = jun,
  address      = {Boston, Massachusetts},
}

Building peer-to-peer systems with Chord, a distributed lookup service Frank Dabek, Emma Brunskill, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, and Hari Balakrishnan. HotOS 2001.

We argue that the core problem facing peer-to-peer systems is locating documents in a decentralized network and propose Chord, a distributed lookup primitive. Chord provides an efficient method of locating documents while placing few constraints on the applications that use it. As proof that Chord's functionality is useful in the development of peer-to-peer applications, we outline the implementation of a peer-to-peer file sharing system based on Chord.

@inproceedings{chord:hotos,
  title        = {Building Peer-to-Peer Systems With {Chord}, a
    Distributed Lookup Service},
  author       = {Frank Dabek and Emma Brunskill and M. Frans Kaashoek
    and David Karger and Robert Morris and Ion Stoica and Hari
    Balakrishnan},
},
  booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-VIII})},
  year         = 2001,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Schloss Elmau, Germany},
}

Resilient overlay networks David Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. HotOS 2001.

@inproceedings{ron:hotos8,
  title        = {Resilient Overlay Networks},
  author       = {David Andersen and Hari Balakrishnan and M. Frans
    Kaashoek and Robert Morris},
  booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-VIII})},
  year         = 2001,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Schloss Elmau, Germany},
}

Dos and don'ts of client authentication on the web Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster. MIT LCS technical report, May 2001.

@techreport{webauth:tr,
  title        = {{D}os and Don'ts of Client Authentication on the Web},
  author       = {Kevin Fu and Emil Sit and Kendra Smith and Nick
    Feamster},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 2001,
  month        = may,
  number       = {MIT-LCS-TR-818},
}

Reconsidering internet mobility Alex C. Snoeren, Hari Balakrishnan, and M. Frans Kaashoek. HotOS 2001.

@inproceedings{migrate:hotos8,
  title        = {Reconsidering Internet Mobility},
  author       = {Alex C. Snoeren and Hari Balakrishnan and M. Frans
    Kaashoek},
  booktitle    = {Proceedings of the 8th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-VIII})},
  year         = 2001,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Schloss Elmau, Germany},
}

2000

Modular components for network address translation Eddie Kohler, Robert Morris, and Massimiliano Poletto. MIT LCS Click Project technical report, December 2000.

We present a general-purpose toolkit for network address translation in a modular, component-based networking system. Network address translation is a powerful, general technique for building network applications, such as allowing disparate address realms to communicate, load balancing among servers, and changing ordinary proxies into transparent proxies. The components of our toolkit can be combined in a variety of ways to implement these applications and others. The context of this work, the Click modular networking system, makes the NAT components simpler and more understandable. For example, the NAT components concern themselves solely with address translation; related functions, such as classification, are implemented by separate components. This design is more flexible than most existing NAT implementations. The user can choose where network address translation takes place in relation to other router functions, and can use multiple translators in a single configuration. These components have been in use in a production environment for several months.

We describe our design approach, demonstrate its flexibility by presenting a range of examples of its use, and evaluate its performance.

@techreport{click:rewritertr,
  title        = {Modular components for network address translation},
  author       = {Eddie Kohler and Robert Morris and Massimiliano
    Poletto},
  institution  = {{MIT} Laboratory for Computer Science Click Project},
  year         = 2000,
  month        = dec,
  note         = {http://www.pdos.lcs.mit.edu/papers/click-rewriter/},
}

The Click modular router Eddie Kohler. Ph.D. thesis, Massachusetts Institute of Technology, November 2000.

@phdthesis{click:kohler-phd,
  title        = {The {Click} modular router},
  author       = {Eddie Kohler},
  school       = {Massachusetts Institute of Technology},
  year         = 2000,
  month        = nov,
}

Fast and secure distributed read-only file system Kevin Fu, M. Frans Kaashoek, and David Mazières. OSDI 2000.

Internet users increasingly rely on publicly available data for everything from software installation to investment decisions. Unfortunately, the vast majority of public content on the Internet comes with no integrity or authenticity guarantees. This paper presents the self-certifying read-only file system, a content distribution system providing secure, scalable access to public, read-only data.

The read-only file system makes the security of published content independent from that of the distribution infrastructure. In a secure area (perhaps off-line), a publisher creates a digitally-signed database out of a file system's contents. The publisher then replicates the database on untrusted content-distribution servers, allowing for high availability. The read-only file system protocol furthermore pushes the cryptographic cost of content verification entirely onto clients, allowing servers to scale to a large number of clients. Measurements of an implementation show that an individual server running on a 550 Mhz Pentium III with FreeBSD can support 1,012 connections per second and 300 concurrent clients compiling a large software package.

@inproceedings{sfsro:osdi2000,
  title        = {{F}ast and secure distributed read-only file system},
  author       = {Kevin Fu and M. Frans Kaashoek and David
    Mazi{\`e}res},
  pages        = {181--196},
  booktitle    = {Proceedings of the 4th {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} 2000)},
  year         = 2000,
  month        = oct,
  address      = {San Diego, California},
}

CarNet: A scalable ad hoc wireless network system Robert Morris, John Jannotti, Frans Kaashoek, Jinyang Li, and Douglas S. J. De Couto. SIGOPS European Workshop 2000.

CarNet is an application for a large ad hoc mobile network system that scales well without requiring a fixed network infrastructure to route messages. CarNet places radio nodes in cars, which communicate using Grid, a novel scalable routing system. Grid uses geographic forwarding and a scalable distributed location service to route packets from car to car without flooding the network. CarNet will support IP connectivity as well as applications such as cooperative highway congestion monitoring, fleet tracking, and discovery of nearby points of interest.

@inproceedings{grid:sigops-euro9,
  title        = {{C}ar{N}et: A Scalable Ad Hoc Wireless Network
    System},
  author       = {Robert Morris and John Jannotti and Frans Kaashoek
    and Jinyang Li and Douglas S. J. {De Couto}},
},
  note         = {The published version incorrectly lists Douglas De
    Couto's name},
  booktitle    = {Proceedings of the 9th {ACM} {SIGOPS} {E}uropean
    workshop: Beyond the {PC}: New Challenges for the Operating System},
  year         = 2000,
  month        = sep,
  address      = {Kolding, Denmark},
}

A scalable location service for geographic ad hoc routing Jinyang Li, John Jannotti, Douglas S. J. De Couto, David R. Karger, and Robert Morris. MobiCom 2000.

GLS is a new distributed location service which tracks mobile node locations. GLS combined with geographic forwarding allows the construction of ad hoc mobile networks that scale to a larger number of nodes than possible with previous work. GLS is decentralized and runs on the mobile nodes themselves, requiring no fixed infrastructure. Each mobile node periodically updates a small set of other nodes (its location servers) with its current location. A node sends its position updates to its location servers without knowing their actual identities, assisted by a predefined ordering of node identifiers and a predefined geographic hierarchy. Queries for a mobile node's location also use the predefined identifier ordering and spatial hierarchy to find a location server for that node.

Experiments using the ns simulator for up to 600 mobile nodes show that the storage and bandwidth requirements of GLS grow slowly with the size of the network. Furthermore, GLS tolerates node failures well: each failure has only a limited effect and query performance degrades gracefully as nodes fail and restart. The query performance of GLS is also relatively insensitive to node speeds. Simple geographic forwarding combined with GLS compares favorably with Dynamic Source Routing (DSR): in larger networks (over 200 nodes) our approach delivers more packets, but consumes fewer network resources.

@inproceedings{grid:mobicom00,
  title        = {A Scalable Location Service for Geographic Ad Hoc
    Routing},
  author       = {Jinyang Li and John Jannotti and Douglas S. J. {De
    Couto} and David R. Karger and Robert Morris},
  pages        = {120--130},
  booktitle    = {Proceedings of the 6th {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '00)},
  year         = 2000,
  month        = aug,
  address      = {Boston, Massachusetts},
}

Programming language techniques for modular router configurations Eddie Kohler, Benjie Chen, M. Frans Kaashoek, Robert Morris, and Massimiliano Poletto. MIT LCS technical report, August 2000.

This paper applies programming language techniques to a high-level system description, both to optimize the system and to prove useful properties about it. The system in question is Click, a modular software router framework. Click routers are built from components called elements. Elements are written in C++, but the user creates a configuration using a simple, declarative data flow language. This language is amenable to data flow analysis and other conventional programming language techniques. Applied to a router configuration, these techniques have high-level results---for example, optimizing the router or verifying its high-level properties. This paper describes several programming language techniques that have been useful in practice, including optimization tools that remove virtual function calls from router definitions and remove redundant parts of adjacent routers. We also present performance results for an extensively optimized standards-compliant IP router. On conventional PC hardware, this router can forward up to 456,000 64-byte packets per second.

@techreport{click:lcstr00,
  title        = {Programming language techniques for modular router
    configurations},
  author       = {Eddie Kohler and Benjie Chen and M. Frans Kaashoek
    and Robert Morris and Massimiliano Poletto},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 2000,
  month        = aug,
  number       = {MIT-LCS-TR-812},
}

The Click modular router Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. ACM Transactions on Computer Systems 18(3), August 2000.

Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. A router configuration is a directed graph with elements at the vertices; packets flow along the edges of the graph. Several features make individual elements more powerful and complex configurations easier to write, including pull connections, which model packet flow driven by transmitting hardware devices, and flow-based router context, which helps an element locate other interesting elements.

Click configurations are modular and easy to extend. A standards-compliant Click IP router has sixteen elements on its forwarding path; some of its elements are also useful in Ethernet switches and IP tunneling configurations. Extending the IP router to support dropping policies, fairness among flows, or Differentiated Services simply requires adding a couple elements at the right place. On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.

@article{click:tocs00,
  title        = {The {Click} modular router},
  author       = {Eddie Kohler and Robert Morris and Benjie Chen and
    John Jannotti and M. Frans Kaashoek},
  volume       = 18,
  number       = 3,
  year         = 2000,
  month        = aug,
  pages        = {263--297},
  journal      = {{ACM} Transactions on Computer Systems},
}

Multops: a data structure for denial-of-service attack detection Thomer M. Gil. Master's thesis, Vrije Universiteit, August 2000.

@mastersthesis{click:gil-ms,
  title        = {MULTOPS: a data structure for denial-of-service
    attack detection},
  author       = {Thomer M. Gil},
  school       = {Vrije Universiteit},
  year         = 2000,
  month        = aug,
}

Self-certifying file system David Mazières. Ph.D. thesis, Massachusetts Institute of Technology, May 2000.

@phdthesis{sfs:mazieres-phd,
  title        = {Self-certifying File System},
  author       = {David Mazi{\`e}res},
  school       = {Massachusetts Institute of Technology},
  year         = 2000,
  month        = may,
}

A study of caching in the internet domain name system Emil Sit. Master's thesis, Massachusetts Institute of Technology, May 2000.

@mastersthesis{click:sit-ms,
  title        = {A Study of Caching in the Internet Domain Name
    System},
  author       = {Emil Sit},
  school       = {Massachusetts Institute of Technology},
  year         = 2000,
  month        = may,
}

Flexible key management with SFS agents Michael Kaminsky. Master's thesis, Massachusetts Institute of Technology, May 2000.

@mastersthesis{sfs:kaminsky-ms,
  title        = {Flexible Key Management with {SFS} Agents},
  author       = {Michael Kaminsky},
  school       = {Massachusetts Institute of Technology},
  year         = 2000,
  month        = may,
}

Multiprocessing with the Exokernel operating system Benjie Chen. Master's thesis, Massachusetts Institute of Technology, February 2000.

Exokernel is a minimal operating system kernel that safely multiplexes hardware resources, while leaving all system abstractions to applications. An exokernel exhibits better performance and offers more functionality because applications can provide optimized system abstractions, at the user-level, based on their needs. Current design of the exokernel system, however, does not support multiprocessor architectures. This thesis presents a symmetric multiprocessing exokernel and demonstrates that unprivileged library implementation of operating system abstractions is viable on a multiprocessor system.

This thesis focus on three issues. First, it presents synchronization strategies used in kernel. Second, this thesis describes three new exokernel interfaces: message passing, kernel support for multithreading, and multiprocessor scheduling. Third, because exokernel applications do not trust each other, traditional synchronization primitives used to guard system abstractions, such as voluntary memory locks, do not function well. This thesis presents and evaluates a strategy for synchronization among untrusted processes. A multiprocessor exokernel and a synchronized library operating system result from this thesis. Performance analysis shows that the overheads of synchronization in both the kernel and the library operating system are small.

@mastersthesis{exo:chen-meng,
  title        = {Multiprocessing with the {Exokernel} Operating
    System},
  author       = {Benjie Chen},
  school       = {Massachusetts Institute of Technology},
  year         = 2000,
  month        = feb,
}

1999

The Click modular router Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek. SOSP 1999.

Please consider citing the journal version of this paper, which has improved explanations, more detail and examples, and significantly better performance results.

Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. Complete configurations are built by connecting elements into a graph; packets flow along the graph's edges. Several features make individual elements more powerful and complex configurations easier to write, including pull processing, which models packet flow driven by transmitting interfaces, and flow-based router context, which helps an element locate other interesting elements.

We demonstrate several working configurations, including an IP router and an Ethernet bridge. These configurations are modular---the IP router has 16 elements on the forwarding path---and easy to extend by adding additional elements, which we demonstrate with augmented configurations. On commodity PC hardware running Linux, the Click IP router can forward 64-byte packets at 73,000 packets per second, just 10% slower than Linux alone.

@inproceedings{click:sosp99,
  title        = {The {C}lick modular router},
  author       = {Robert Morris and Eddie Kohler and John Jannotti and
    M. Frans Kaashoek},
  pages        = {217--231},
  booktitle    = {Proceedings of the 17th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '99)},
  year         = 1999,
  month        = dec,
  address      = {Kiawah Island, South Carolina},
}

Separating key management from file system security David Mazières, Michael Kaminsky, M. Frans Kaashoek, and Emmett Witchel. SOSP 1999.

@inproceedings{sfs:sosp99,
  title        = {{S}eparating key management from file system
    security},
  author       = {David Mazi{\`e}res and Michael Kaminsky and M. Frans
    Kaashoek and Emmett Witchel},
  pages        = {124--139},
  booktitle    = {Proceedings of the 17th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '99)},
  year         = 1999,
  month        = dec,
  address      = {Kiawah Island, South Carolina},
}

Linear scan register allocation Massimiliano Poletto, and Vivek Sarkar. ACM Transactions on Programming Languages and Systems 21(5), September 1999.

@article{linearscan,
  title        = {Linear scan register allocation},
  author       = {Massimiliano Poletto and Vivek Sarkar},
  volume       = 21,
  number       = 5,
  year         = 1999,
  month        = sep,
  pages        = {895--913},
  journal      = {{ACM} Transactions on Programming Languages and
    Systems},
}

A readable TCP in the Prolac protocol language Eddie Kohler, M. Frans Kaashoek, and David R. Montgomery. SIGCOMM 1999.

Prolac is a new statically-typed, object-oriented language for network protocol implementation. It is designed for readability, extensibility, and real-world implementation; most previous protocol languages, in contrast, have been based on hard-to-implement theoretical models and have focused on verification. We present a working Prolac TCP implementation directly derived from 4.4BSD. Our implementation is modular---protocol processing is logically divided into minimally-interacting pieces; readable---Prolac encourages top-down structure and naming intermediate computations; and extensible---subclassing cleanly separates protocol extensions like delayed acknowledgements and slow start. The Prolac compiler uses simple global analysis to remove expensive language features like dynamic dispatch, resulting in end-to-end performance comparable to an unmodified Linux 2.0 TCP.

@inproceedings{prolac:sigcomm99,
  title        = {A readable {TCP} in the {Prolac} protocol language},
  author       = {Eddie Kohler and M. Frans Kaashoek and David R.
    Montgomery},
  pages        = {3--13},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '99 Conference:
    Applications, Technologies, Architectures, and Protocols for
    Computer Communication},
  year         = 1999,
  month        = aug,
  address      = {Cambridge, Massachusetts},
}

Language and compiler support for dynamic code generation Massimiliano Poletto. Ph.D. thesis, Massachusetts Institute of Technology, June 1999.

@phdthesis{tickc:poletto-phd,
  title        = {Language and compiler support for dynamic code
    generation},
  author       = {Massimiliano Poletto},
  school       = {Massachusetts Institute of Technology},
  year         = 1999,
  month        = jun,
}

The secure file system under Windows NT Matthew Rimer. Master's thesis, Massachusetts Institute of Technology, June 1999.

@mastersthesis{sfs:rimer-ms,
  title        = {The Secure File System under {Windows NT}},
  author       = {Matthew Rimer},
  school       = {Massachusetts Institute of Technology},
  year         = 1999,
  month        = jun,
}

Evolving software with an application-specific language Eddie Kohler, Massimiliano Poletto, and David R. Montgomery. WCSSS 1999.

Software systems can be developed through evolution (gradual change) or revolution (reimplementation from scratch). Both approaches have advantages and disadvantages. An evolutionary approach keeps the system working throughout, allowing early problem detection, but tends to retain ingrained design flaws and can result in complex, ad hoc systems. A revolutionary approach is required to change the basic architecture of a system, but many more resources must be invested before the system can be evaluated. In this paper, we describe how we used a little application-specific language to combine these approaches' advantages.

The context of our work is CTAS, the next-generation air traffic control automation system developed originally by NASA. The overall goal was to redesign and reimplement one of the CTAS processes in Java, while retaining its ability to communicate with unmodified processes---a project complicated by CTAS's ad hoc message formats. To address this, we designed a language that combines C code copied from CTAS source, to express the message formats, with new Java code for message actions. A compiler then automatically generates code for marshalling and unmarshalling. The result is a system with both evolutionary and revolutionary properties, exemplified by the use of both old CTAS code and new Java code in the message language.

This paper discusses the language and compiler and evaluates some of the engineering tradeoffs inherent in their design.

@inproceedings{evolving-software:wcsss99,
  title        = {Evolving software with an application-specific
    language},
  author       = {Eddie Kohler and Massimiliano Poletto and David R.
    Montgomery},
  pages        = {94--102},
  booktitle    = {Workshop Record of {WCSSS} '99: The 2nd {ACM}
    {SIGPLAN} Workshop on Compiler Support for Systems Software},
  year         = 1999,
  month        = may,
  address      = {Atlanta, Georgia},
}

A fast Prolac TCP for the real world Jr. Montgomery David Rogers. Master's thesis, Massachusetts Institute of Technology, May 1999.

@mastersthesis{prolac:montgomery-meng,
  title        = {A fast {Prolac} {TCP} for the real world},
  author       = {Montgomery, Jr., David Rogers},
  school       = {Massachusetts Institute of Technology},
  year         = 1999,
  month        = may,
}

An x86 protected mode virtual machine monitor for the MIT Exokernel Charles L. Coffing. Master's thesis, Massachusetts Institute of Technology, May 1999.

@mastersthesis{exo:coffing-meng,
  title        = {An x86 Protected Mode Virtual Machine Monitor for
    the {MIT} {Exokernel}},
  author       = {Charles L. Coffing},
  school       = {Massachusetts Institute of Technology},
  year         = 1999,
  month        = may,
}

PAN: a high-performance active network node supporting multiple mobile code systems Erik L. Nygren, Stephen J. Garland, and M. Frans Kaashoek. OpenArch 1999.

A capsule-based active network transports capsules containing code to be executed on network nodes through which they pass. Active networks facilitate the deployment of new protocols, which can be used without any changes to the underlying network infrastructure. This paper describes the design, implementation, and evaluation of a high-performance active network node which supports multiple mobile code systems. Experiments, using capsules executing unsafe native Intel ix86 object code, indicate that active networks may be able to provide significant flexibility relative to traditional networks with only a small performance overhead (as little as 13% for 1500 byte packets). However, capsules executing JavaVM code performed far worse (with over three times the performance overhead of native code for 128 byte packets), indicating that mobile code system performance is critical to overall node performance.

@inproceedings{pan:openarch99,
  title        = {{PAN}: a high-performance active network node
    supporting multiple mobile code systems},
  author       = {Erik L. Nygren and Stephen J. Garland and M. Frans
    Kaashoek},
  pages        = {78--89},
  booktitle    = {Proceedings of the 2nd {IEEE} Conference on Open
    Architectures and Network Programming ({OpenArch} '99)},
  year         = 1999,
  month        = mar,
  address      = {New York, New York},
}

`C and tcc: A language and compiler for dynamic code generation Massimiliano Poletto, Wilson C. Hsieh, Dawson R. Engler, and M. Frans Kaashoek. ACM Transactions on Programming Languages and Systems 21(2), March 1999.

@article{tickc:toplas,
  title        = {{`C} and {tcc}: A language and compiler for dynamic
    code generation},
  author       = {Massimiliano Poletto and Wilson C. Hsieh and Dawson
    R. Engler and M. Frans Kaashoek},
  volume       = 21,
  number       = 2,
  year         = 1999,
  month        = mar,
  pages        = {324--369},
  journal      = {{ACM} Transactions on Programming Languages and
    Systems},
}

1998

The design, implementation and operation of an email pseudonym server David Mazières, and M. Frans Kaashoek. CCS 1998.

@inproceedings{nymserver:ccs5,
  title        = {The design, implementation and operation of an email
    pseudonym server},
  author       = {David Mazi{\`e}res and M. Frans Kaashoek},
  pages        = {27--36},
  booktitle    = {Proceedings of the 5th {ACM} Conference on Computer
    and Communications Security ({CCS-5})},
  year         = 1998,
  month        = nov,
  address      = {San Francisco, California},
}

The exokernel operating system architecture Dawson R. Engler. Ph.D. thesis, Massachusetts Institute of Technology, October 1998.

@phdthesis{exo:engler-phd,
  title        = {The exokernel operating system architecture},
  author       = {Dawson R. Engler},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = oct,
}

Escaping the evils of centralized control with self-certifying pathnames David Mazières, and M. Frans Kaashoek. SIGOPS European Workshop 1998.

@inproceedings{sfs:sigops-euro8,
  title        = {Escaping the evils of centralized control with
    self-certifying pathnames},
  author       = {David Mazi{\`e}res and M. Frans Kaashoek},
},
  booktitle    = {Proceedings of the 8th {ACM} {SIGOPS} {E}uropean
    workshop: Support for composing distributed applications},
  year         = 1998,
  month        = sep,
  address      = {Sintra, Portugal},
}

Framework for implementing file systems in Windows NT Danilo Almeida. Master's thesis, Massachusetts Institute of Technology, May 1998.

@mastersthesis{sfs:almeida-ms,
  title        = {Framework for Implementing File Systems in {Windows
    NT}},
  author       = {Danilo Almeida},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = may,
}

Flexible and efficient sharing of protected abstractions George M. Candea. Master's thesis, Massachusetts Institute of Technology, May 1998.

Traditional operating systems are overly restrictive and do not allow user-level applications to modify operating system abstractions. The exokernel operating system architecture safely gives untrusted applications efficient control over hardware and software resources by separating management from protection. Decentralized control, however, makes it very difficult for mutually distrustful applications to share system abstractions.

This thesis presents the design, implementation, and evaluation of the protected abstraction mechanism (PAM), a novel way to safely share user-level abstractions in an exokernel. PAM enables unprivileged, untrusted applications to define and securely share generic abstractions at run-time. PAM achieves a good flexibility-performance combination by eliminating the need for context switches and optimizing for the common case, in which the same abstraction is invoked repeatedly. PAM's design emphasizes simplicity and provable correctness, which makes it easy to understand and use: a couple of manual pages are sufficient for the average programmer to start using PAM.

We report measurements of PAM's performance on null method calls. In spite of the fact that such invocations do not take advantage of PAM's context switch-free operation, the PAM version of a simple abstraction outperforms the equivalent LRPC implementation by over 15% on null method calls. It is also considerably easier to write an abstraction using PAM. We therefore believe the protected abstraction mechanism is a viable solution to the problem of safely sharing user-level abstractions in the exokernel.

@mastersthesis{exo:candea-meng,
  title        = {Flexible and efficient sharing of protected
    abstractions},
  author       = {George M. Candea},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = may,
}

Performance evaluation of the Orca shared-object system Henri Bal, Raoul Bhoedjang, Rutger Hofman, Ceriel Jacobs, Koen Langendoen, Tim Ruhl, and M. Frans Kaashoek. ACM Transactions on Computer Systems 16(1), February 1998.

@article{orca:tocs,
  title        = {Performance evaluation of the {Orca} shared-object
    system},
  author       = {Henri Bal and Raoul Bhoedjang and Rutger Hofman and
    Ceriel Jacobs and Koen Langendoen and Tim Ruhl and M. Frans
    Kaashoek},
  pages        = {1--40},
  month        = feb,
  year         = 1998,
  volume       = 16,
  number       = 1,
  journal      = {{ACM} Transactions on Computer Systems},
}

An efficient virtual network interface in the FUGU scalable workstation Kenneth M. Mackenzie. Ph.D. thesis, Massachusetts Institute of Technology, February 1998.

@phdthesis{fugu:mackenzie-phd,
  title        = {An efficient virtual network interface in the {FUGU}
    scalable workstation},
  author       = {Kenneth M. Mackenzie},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = feb,
}

Applying exokernel principles to conventional operating systems John Jannotti. Master's thesis, Massachusetts Institute of Technology, February 1998.

@mastersthesis{exo-os:jj-meng,
  title        = {Applying Exokernel Principles to Conventional
    Operating Systems},
  author       = {John Jannotti},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = feb,
}

The design and implementation of a high-performance active network node Erik L. Nygren. Master's thesis, Massachusetts Institute of Technology, February 1998.

A capsule-oriented active network transports capsules containing code to be executed on the network nodes that they pass through. This approach makes networks more extensible by allowing new networking protocols to be deployed and used without any changes to the underlying network infrastructure. This thesis project describes the design, implementation, and evaluation of a high-performance practical active network node that can serve as a testbed for research into active network performance and resource management issues. Nodes provide resources to executing capsules containing Intel ix86 object code. Although the current implementation does not yet provide safety or interoperability, the results of experiments performed on the system implemented for this thesis indicate that an active network architecture may be able to provide significant flexibility while only incurring a small performance overhead relative to traditional networks.

@mastersthesis{pan:nygren-meng,
  title        = {The design and implementation of a high-performance
    active network node},
  author       = {Erik L. Nygren},
  school       = {Massachusetts Institute of Technology},
  year         = 1998,
  month        = feb,
}

1997

Mobile computing with the Rover toolkit Anthony Joseph. Ph.D. thesis, Massachusetts Institute of Technology, November 1997.

@phdthesis{rover:adj-phd,
  title        = {Mobile computing with the {Rover} toolkit},
  author       = {Anthony Joseph},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = nov,
}

Application performance and flexibility on exokernel systems M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Héctor M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. SOSP 1997.

The exokernel operating system architecture safely gives untrusted software efficient control over hardware and software resources by separating management from protection. This paper describes an exokernel system that allows specialized applications to achieve high performance without sacrificing the performance of unmodified UNIX programs. It evaluates the exokernel architecture by measuring end-to-end application performance on Xok, an exokernel for Intel x86-based computers, and by comparing Xok's performance to the performance of two widely-used 4.4BSD UNIX systems (FreeBSD and OpenBSD). The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). This paper also describes insights about the exokernel approach gained through building three different exokernel systems, and presents novel approaches to resource multiplexing.

@inproceedings{exo:sosp97,
  title        = {Application performance and flexibility on exokernel
    systems},
  author       = {M. Frans Kaashoek and Dawson R. Engler and Gregory
    R. Ganger and H{\'e}ctor M. Brice{\~n}o and Russell Hunt and David
    Mazi{\`e}res and Thomas Pinckney and Robert Grimm and John
    Jannotti and Kenneth Mackenzie},
  pages        = {52--65},
  booktitle    = {Proceedings of the 16th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '97)},
  year         = 1997,
  month        = oct,
  address      = {Saint-Mal{\^o}, France},
}

Shared libraries in an exokernel operating system Douglas Karl Wyatt. Master's thesis, Massachusetts Institute of Technology, September 1997.

@mastersthesis{exo:wyatt-meng,
  title        = {Shared libraries in an exokernel operating system},
  author       = {Douglas Karl Wyatt},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = sep,
}

Prolac: a language for protocol compilation Eddie Kohler. Master's thesis, Massachusetts Institute of Technology, September 1997.

@mastersthesis{prolac:kohler-ms,
  title        = {Prolac: a language for protocol compilation},
  author       = {Eddie Kohler},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = sep,
}

Security and decentralized control in the SFS global file system David Mazières. Master's thesis, Massachusetts Institute of Technology, August 1997.

@mastersthesis{sfs:mazieres-ms,
  title        = {Security and decentralized control in the {SFS}
    global file system},
  author       = {David Mazi{\`e}res},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = aug,
}

tcc: a system for fast, flexible, and high-level dynamic code generation Massimiliano Poletto, Dawson R. Engler, and M. Frans Kaashoek. PLDI 1997.

@inproceedings{tickc:pldi97,
  title        = {tcc: a system for fast, flexible, and high-level
    dynamic code generation},
  author       = {Massimiliano Poletto and Dawson R. Engler and M.
    Frans Kaashoek},
  pages        = {109--121},
  booktitle    = {Proceedings of the {ACM} {SIGPLAN} '97 Conference on
    Programming Language Design and Implementation ({PLDI} '97)},
  year         = 1997,
  month        = jun,
  address      = {Las Vegas, Nevada},
}

Secure applications need flexible operating systems David Mazières, and M. Frans Kaashoek. HotOS 1997.

@inproceedings{secure-apps:hotos6,
  title        = {Secure applications need flexible operating systems},
  author       = {David Mazi{\`e}res and M. Frans Kaashoek},
  pages        = {56--61},
  booktitle    = {Proceedings of the 6th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-VI})},
  year         = 1997,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Chatham, Cape Cod, Massachusetts},
}

Mobile computing with the Rover toolkit Anthony D. Joseph, Joshua A. Tauber, and M. Frans Kaashoek. IEEE Transactions on Computers 46(3), March 1997.

@article{rover:ieee-toc,
  title        = {Mobile computing with the {Rover} toolkit},
  author       = {Anthony D. Joseph and Joshua A. Tauber and M. Frans
    Kaashoek},
  volume       = 46,
  number       = 3,
  year         = 1997,
  month        = mar,
  pages        = {337--352},
  journal      = {{IEEE} Transactions on Computers},
}

Operating system extensibility through event capture Thomas Pinckney III. Master's thesis, Massachusetts Institute of Technology, February 1997.

@mastersthesis{exo:pinckney-meng,
  title        = {Operating system extensibility through event capture},
  author       = {Thomas {Pinckney III}},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = feb,
}

Decentralizing UNIX abstractions in the exokernel architecture Héctor Manuel Briceño Pulido. Master's thesis, Massachusetts Institute of Technology, February 1997.

@mastersthesis{exo:briceno-meng,
  title        = {Decentralizing {UNIX} abstractions in the exokernel
    architecture},
  author       = {H{\'e}ctor Manuel {Brice{\~n}o Pulido}},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = feb,
}

Embedded inodes and explicit grouping: exploiting disk bandwidth for small files Gregory R. Ganger, and M. Frans Kaashoek. USENIX 1997.

Small file performance in most file systems is limited by slowly improving disk access times, even though current file systems improve on-disk locality by allocating related data objects in the same general region. The key insight for why current file systems perform poorly is that locality is insufficient - exploiting disk bandwidth for small data objects requires that they be placed adjacently. We describe C-FFS (Co-locating Fast File System), which introduces two techniques, embedded inodes and explicit grouping, for exploiting what disks do well (bulk data movement) to avoid what they do poorly (reposition to new locations). With embedded inodes, the inodes for most files are stored in the directory with the corresponding name, removing a physical level of indirection without sacrificing the logical level of indirection. With explicit grouping, the data blocks of multiple small files named by a given directory are allocated adjacently and moved to and from the disk as a unit in most cases. Measurements of our C-FFS implementation show that embedded inodes and explicit grouping have the potential to increase small file throughput (for both reads and writes) by a factor of 5-7 compared to the same file system without these techniques. The improvement comes directly from reducing the number of disk accesses required by an order of magnitude. Preliminary experience with software-development applications shows performance improvements ranging from 10-300 percent.

@inproceedings{cffs:usenix97,
  title        = {Embedded inodes and explicit grouping: exploiting
    disk bandwidth for small files},
  author       = {Gregory R. Ganger and M. Frans Kaashoek},
  pages        = {1--17},
  booktitle    = {Proceedings of the 1997 {USENIX} Annual Technical
    Conference (USENIX '97)},
  year         = 1997,
  month        = jan,
  address      = {Anaheim, California},
}

High-performance application-specific networking Deborah Anne Wallach. Ph.D. thesis, Massachusetts Institute of Technology, January 1997.

@phdthesis{app-specific-networking:wallach-phd,
  title        = {High-performance application-specific networking},
  author       = {Deborah Anne Wallach},
  school       = {Massachusetts Institute of Technology},
  year         = 1997,
  month        = jan,
}

Building reliable mobile-aware applications using the Rover toolkit Anthony D. Joseph, and M. Frans Kaashoek. ACM Wireless Networks 3(5), 1997.

@article{rover:winet,
  title        = {Building reliable mobile-aware applications using
    the {Rover} toolkit},
  author       = {Anthony D. Joseph and M. Frans Kaashoek},
  volume       = 3,
  number       = 5,
  year         = 1997,
  pages        = {405--419},
  journal      = {{ACM} Wireless Networks},
}

1996

Building reliable mobile-aware applications using the Rover toolkit Anthony D. Joseph, Joshua A. Tauber, and M. Frans Kaashoek. MobiCom 1996.

@inproceedings{rover:mobicom,
  title        = {Building reliable mobile-aware applications using
    the {Rover} toolkit},
  author       = {Anthony D. Joseph and Joshua A. Tauber and M. Frans
    Kaashoek},
  booktitle    = {Proceedings of the 2nd {ACM} International
    Conference on Mobile Computing and Networking ({MobiCom} '96)},
  year         = 1996,
  month        = nov,
  address      = {Rye, New York},
}

Dynamic computation migration in distributed shared memory systems Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl. HPCC 1996.

@inproceedings{dynamic-migration:supercomp96,
  title        = {Dynamic computation migration in distributed shared
    memory systems},
  author       = {Wilson C. Hsieh and M. Frans Kaashoek and William E.
    Weihl},
  booktitle    = {Supercomputing '96 Conference Proceedings: The
    international conference on high performance computing and
    communications},
  organization = {{ACM}},
  year         = 1996,
  month        = nov,
  address      = {Pittsburgh, Pennsylvania},
}

Server operating systems M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, and Deborah A. Wallach. SIGOPS European Workshop 1996.

We introduce server operating systems, which are sets of abstractions and runtime support for specialized, high-performance server applications. We have designed and are implementing a prototype server OS with support for aggressive specialization, direct device-to-device access, an event-driven organization, and dynamic compiler-assisted ILP. Using this server OS, we have constructed an HTTP server that outperforms servers running on a conventional OS by more than an order of magnitude and that can safely timeshare the hardware platform with other applications.

@inproceedings{server-os:sigops-euro,
  title        = {Server operating systems},
  author       = {M. Frans Kaashoek and Dawson R. Engler and Gregory
    R. Ganger and Deborah A. Wallach},
  pages        = {141--148},
  booktitle    = {Proceedings of the 7th {ACM} {SIGOPS} {E}uropean
    workshop: Systems support for worldwide applications},
  year         = 1996,
  month        = sep,
  address      = {Connemara, Ireland},
}

ASHs: application-specific handlers for high-performance messaging Deborah A. Wallach, Dawson R. Engler, and M. Frans Kaashoek. SIGCOMM 1996.

@inproceedings{ash:sigcomm96,
  title        = {{ASHs}: application-specific handlers for
    high-performance messaging},
  author       = {Deborah A. Wallach and Dawson R. Engler and M. Frans
    Kaashoek},
  pages        = {40--52},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '96 Conference:
    Applications, Technologies, Architectures, and Protocols for
    Computer Communication},
  year         = 1996,
  month        = aug,
  address      = {Stanford, California},
}

DPF: fast, flexible message demultiplexing using dynamic code generation Dawson R. Engler, and M. Frans Kaashoek. SIGCOMM 1996.

@inproceedings{dpf:sigcomm96,
  title        = {{DPF}: fast, flexible message demultiplexing using
    dynamic code generation},
  author       = {Dawson R. Engler and M. Frans Kaashoek},
  pages        = {53--59},
  booktitle    = {Proceedings of the {ACM} {SIGCOMM} '96 Conference:
    Applications, Technologies, Architectures, and Protocols for
    Computer Communication},
  year         = 1996,
  month        = aug,
  address      = {Stanford, California},
}

The Rover NNTP proxy Constantine Cristakos. Advanced Undergraduate Project, Massachusetts Institute of Technology, June 1996.

@mastersthesis{rover:nntp,
  title        = {The {Rover} {NNTP} proxy},
  author       = {Constantine Cristakos},
  school       = {Massachusetts Institute of Technology},
  year         = 1996,
  month        = jun,
  type         = {Advanced Undergraduate Project},
}

An evaluation of the Amoeba group communication system M. Frans Kaashoek, and Andrew S. Tanenbaum. ICDCS 1996.

@inproceedings{amoeba-eval:dcs16,
  title        = {An evaluation of the {Amoeba} group communication
    system},
  author       = {M. Frans Kaashoek and Andrew S. Tanenbaum},
  pages        = {436--448},
  booktitle    = {Proceedings of the 16th International Conference on
    Distributed Computing Systems},
  organization = {{IEEE} {C}omputer {S}ociety},
  year         = 1996,
  month        = may,
  address      = {Hong Kong},
}

Atomic recovery units: failure atomicity for logical disks Robert Grimm, Wilson C. Hsieh, Wiebren de Jonge, and M. Frans Kaashoek. ICDCS 1996.

@inproceedings{arus:dcs16,
  title        = {Atomic recovery units: failure atomicity for logical
    disks},
  author       = {Robert Grimm and Wilson C. Hsieh and Wiebren de
    Jonge and M. Frans Kaashoek},
  pages        = {26--37},
  booktitle    = {Proceedings of the 16th International Conference on
    Distributed Computing Systems},
  organization = {{IEEE} {C}omputer {S}ociety},
  year         = 1996,
  month        = may,
  address      = {Hong Kong},
}

VCODE: a retargetable, extensible, very fast dynamic code generation system Dawson R. Engler. PLDI 1996.

@inproceedings{vcode:pldi96,
  title        = {{VCODE}: a retargetable, extensible, very fast
    dynamic code generation system},
  author       = {Dawson R. Engler},
  pages        = {160--170},
  booktitle    = {Proceedings of the {ACM} {SIGPLAN} '96 Conference on
    Programming Language Design and Implementation ({PLDI} '96)},
  year         = 1996,
  month        = may,
  address      = {Philadelphia, Pennsylvania},
}

Exodisk: maximizing application control over storage management Robert Grimm. Master's thesis, Massachusetts Institute of Technology, May 1996.

@mastersthesis{exo:grimm-ms,
  title        = {Exodisk: maximizing application control over storage
    management},
  author       = {Robert Grimm},
  school       = {Massachusetts Institute of Technology},
  year         = 1996,
  month        = may,
}

Issues in building mobile-aware applications with the Rover toolkit Joshua A. Tauber. Master's thesis, Massachusetts Institute of Technology, May 1996.

@mastersthesis{rover:tauber-ms,
  title        = {Issues in building mobile-aware applications with
    the {Rover} toolkit},
  author       = {Joshua A. Tauber},
  school       = {Massachusetts Institute of Technology},
  year         = 1996,
  month        = may,
}

tcc: a template-based compiler for `C Massimiliano Poletto, Dawson R. Engler, and M. Frans Kaashoek. WCSSS 1996.

@inproceedings{tickc:wcsss96,
  title        = {tcc: a template-based compiler for {`C}},
  author       = {Massimiliano Poletto and Dawson R. Engler and M.
    Frans Kaashoek},
  pages        = {1--7},
  booktitle    = {Workshop Record of {WCSSS} '96: The Inaugural
    Workshop on Compiler Support for Systems Software},
  organization = {{ACM} {SIGPLAN}},
  year         = 1996,
  month        = feb,
  address      = {Tuscon, Arizona},
}

`C: A language for efficient, machine-independent dynamic code generation Dawson R. Engler, Wilson C. Hsieh, and M. Frans Kaashoek. POPL 1996.

@inproceedings{tickc:popl96,
  title        = {{`C}: A language for efficient, machine-independent
    dynamic code generation},
  author       = {Dawson R. Engler and Wilson C. Hsieh and M. Frans
    Kaashoek},
  pages        = {131--144},
  note         = {An earlier version is available as MIT-LCS-TM-526},
  booktitle    = {Proceedings of the 23rd {ACM} {SIGPLAN}-{SIGACT}
    Symposium on Principles of Programming Languages ({POPL} '96)},
  year         = 1996,
  month        = jan,
  address      = {St. Petersburg Beach, Florida},
}

1995

Rover: a toolkit for mobile information access Anthony D. Joseph, Alan F. deLespinasse, Joshua A. Tauber, David K. Gifford, and M. Frans Kaashoek. SOSP 1995.

@inproceedings{rover:sosp95,
  title        = {{Rover}: a toolkit for mobile information access},
  author       = {Anthony D. Joseph and {deLespinasse}, Alan F. and
    Joshua A. Tauber and David K. Gifford and M. Frans Kaashoek},
  pages        = {156--171},
  booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '95)},
  year         = 1995,
  month        = dec,
  address      = {Copper Mountain Resort, Colorado},
}

Exokernel: an operating system architecture for application-level resource management Dawson R. Engler, M. Frans Kaashoek, and James O'Toole Jr.. SOSP 1995.

@inproceedings{exo:sosp95,
  title        = {{E}xokernel: an operating system architecture for
    application-level resource management},
  author       = {Dawson R. Engler and M. Frans Kaashoek and James
    {O'Toole Jr.}},
  pages        = {251--266},
  booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '95)},
  year         = 1995,
  month        = dec,
  address      = {Copper Mountain Resort, Colorado},
}

CRL: high-performance all-software distributed shared memory Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. SOSP 1995.

@inproceedings{crl:sosp95,
  title        = {{CRL}: high-performance all-software distributed
    shared memory},
  author       = {Kirk L. Johnson and M. Frans Kaashoek and Deborah A.
    Wallach},
  pages        = {213--226},
  note         = {An earlier version of this work appeared as
    Technical Report MIT-LCS-TM-517, MIT Laboratory for Computer
    Science, March 1995},
  booktitle    = {Proceedings of the 15th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '95)},
  year         = 1995,
  month        = dec,
  address      = {Copper Mountain Resort, Colorado},
}

High-performance all-software distributed shared memory Kirk L. Johnson. Ph.D. thesis, Massachusetts Institute of Technology, December 1995.

@phdthesis{crl:johnson-phd,
  title        = {High-performance all-software distributed shared
    memory},
  author       = {Kirk L. Johnson},
  school       = {Massachusetts Institute of Technology},
  year         = 1995,
  month        = dec,
}

Dynamic computation migration in distributed shared memory systems Wilson C. Hsieh. Ph.D. thesis, Massachusetts Institute of Technology, September 1995.

@phdthesis{dyn-comp-migration:hsieh-phd,
  title        = {Dynamic computation migration in distributed shared
    memory systems},
  author       = {Wilson C. Hsieh},
  school       = {Massachusetts Institute of Technology},
  year         = 1995,
  month        = sep,
  note         = {Also available as MIT LCS tech report
    MIT-LCS-TR-665.},
}

Optimistic active messages: a mechanism for scheduling communication with computation Deborah A. Wallach, Wilson C. Hsieh, Kirk Johnson, M. Frans Kaashoek, and William E. Weihl. PPoPP 1995.

@inproceedings{oam:ppopp95,
  title        = {Optimistic active messages: a mechanism for
    scheduling communication with computation},
  author       = {Deborah A. Wallach and Wilson C. Hsieh and Kirk
    Johnson and M. Frans Kaashoek and William E. Weihl},
  pages        = {217--226},
  booktitle    = {Proceedings of the 5th {ACM} {SIGPLAN} Symposium on
    Principles and Practice of Parallel Programming ({PPoPP} '95)},
  year         = 1995,
  month        = jul,
  address      = {Santa Barbara, California},
}

Implementing sequentially consistent shared objects using broadcast and point-to-point communication Alan Fekete, M. Frans Kaashoek, and Nancy Lynch. ICDCS 1995.

@inproceedings{formal-sequential-consistent:dcs15,
  title        = {Implementing sequentially consistent shared objects
    using broadcast and point-to-point communication},
  author       = {Alan Fekete and M. Frans Kaashoek and Nancy Lynch},
  pages        = {439--449},
  booktitle    = {Proceedings of the 15th International Conference on
    Distributed Computing Systems},
  organization = {{IEEE} {C}omputer {S}ociety},
  year         = 1995,
  month        = jun,
  address      = {Vancouver, British Columbia},
}

Implementing sequentially consistent shared objects using broadcast and point-to-point communication Alan Fekete, M. Frans Kaashoek, and Nancy Lynch. MIT LCS technical report, June 1995.

@techreport{formal-sequential-consistent:tr,
  title        = {Implementing sequentially consistent shared objects
    using broadcast and point-to-point communication},
  author       = {Alan Fekete and M. Frans Kaashoek and Nancy Lynch},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 1995,
  month        = jun,
  number       = {MIT-LCS-TR-518},
}

Rover Mosaic: e-mail communication for a full-function Web browser Alan F. deLespinasse. Master's thesis, Massachusetts Institute of Technology, June 1995.

@mastersthesis{rover-mosaic:delespinasse-thesis,
  title        = {{Rover} {Mosaic}: e-mail communication for a
    full-function {Web} browser},
  author       = {Alan F. {deLespinasse}},
  school       = {Massachusetts Institute of Technology},
  year         = 1995,
  month        = jun,
}

Exterminate all operating system abstractions Dawson R. Engler, and M. Frans Kaashoek. HotOS 1995.

@inproceedings{exo:hotos5,
  title        = {Exterminate all operating system abstractions},
  author       = {Dawson R. Engler and M. Frans Kaashoek},
  pages        = {78--83},
  booktitle    = {Proceedings of the 5th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-V})},
  year         = 1995,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Orcas Island, Washington},
}

AVM: application-level virtual memory Dawson R. Engler, Sandeep K. Gupta, and M. Frans Kaashoek. HotOS 1995.

@inproceedings{avm:hotos5,
  title        = {{AVM}: application-level virtual memory},
  author       = {Dawson R. Engler and Sandeep K. Gupta and M. Frans
    Kaashoek},
  pages        = {72--77},
  booktitle    = {Proceedings of the 5th {W}orkshop on {H}ot {T}opics
    in {O}perating {S}ystems ({HotOS-V})},
  year         = 1995,
  month        = may,
  organization = {{IEEE} {C}omputer {S}ociety},
  address      = {Orcas Island, Washington},
}

The operating system kernel as a secure programmable machine Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. Operating Systems Review 29(1), January 1995.

@article{exo:osr,
  title        = {The operating system kernel as a secure programmable
    machine},
  author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
    Jr.}, James W.},
  year         = 1995,
  month        = jan,
  volume       = 29,
  number       = 1,
  pages        = {78--82},
  journal      = {Operating Systems Review},
  organization = {{ACM}},
}

1994

Dynamic documents: mobile wireless access to the WWW M. Frans Kaashoek, Tom Pinckney, and Joshua A. Tauber. WMCSA 1994.

We propose dynamic documents as an approach to extending and customizing the WWW/Mosaic for mobile computing platforms. Dynamic documents are programs executed on a mobile platform to generate a document; they are implemented as Tcl scripts. We have modified the NCSA Mosaic web client to run the dynamic documents it retrieves through a modified Tcl interpreter. The interpreter is designed to execute only commands that do not violate safety.

To hide the latencies of slow links we have modified the Mosaic client to perform caching and prefetching. The policies for caching and prefetching can be under control of dynamic documents, allowing the strategies to be document-specific. Using dynamic documents, we have built an adaptive email browser that employs application-specific caching and prefetching strategies. Both the browser and the displayed email messages are dynamically customized to the mobile computing environment in which they run.

@inproceedings{dynamic-documents:wmcsa,
  title        = {Dynamic documents: mobile wireless access to the
    {WWW}},
  author       = {M. Frans Kaashoek and Tom Pinckney and Joshua A.
    Tauber},
  pages        = {179--184},
  booktitle    = {Proceedings of the Workshop on Mobile Computing
    Systems and Applications ({WMCSA} '94)},
  organization = {{IEEE} {C}omputer {S}ociety},
  year         = 1994,
  month        = dec,
  address      = {Santa Cruz, California},
}

Storage alternatives for mobile computers Fred Douglis, Ramón Cáceres, M. Frans Kaashoek, Kai Li, Brian Marsh, and Joshua A. Tauber. OSDI 1994.

@inproceedings{mobile-storage-alt:osdi1,
  title        = {Storage alternatives for mobile computers},
  author       = {Fred Douglis and Ram{\'o}n C{\'a}ceres and M. Frans
    Kaashoek and Kai Li and Brian Marsh and Joshua A. Tauber},
  pages        = {25--37},
  booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
  year         = 1994,
  month        = nov,
  address      = {Monterey, California},
}

The exokernel approach to extensibility (panel statement) Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. OSDI 1994.

@inproceedings{exo:osdi1,
  title        = {The exokernel approach to extensibility (panel
    statement)},
  author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
    Jr.}, James W.},
  pages        = 198,
  booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
  year         = 1994,
  month        = nov,
  address      = {Monterey, California},
}

Software prefetching and caching for translation lookaside buffers Kavita Bala, M. Frans Kaashoek, and William Weihl. OSDI 1994.

@inproceedings{software-tlb-prefetch:osdi1,
  title        = {Software prefetching and caching for translation
    lookaside buffers},
  author       = {Kavita Bala and M. Frans Kaashoek and William Weihl},
  pages        = {243--253},
  booktitle    = {Proceedings of the 1st {USENIX} {S}ymposium on
    {O}perating {S}ystems {D}esign and {I}mplementation ({OSDI} '94)},
  year         = 1994,
  month        = nov,
  address      = {Monterey, California},
}

Dynamic documents: extensibility and adaptability in the WWW M. Frans Kaashoek, Tom Pinckney, and Joshua A. Tauber. WWW 1994.

@inproceedings{dynamic-documents:www94,
  title        = {Dynamic documents: extensibility and adaptability in
    the {WWW}},
  author       = {M. Frans Kaashoek and Tom Pinckney and Joshua A.
    Tauber},
  edition      = {developers' day track},
  booktitle    = {Proceedings of the 2nd International {WWW}
    Conference: Mosaic and the Web},
  year         = 1994,
  month        = oct,
  address      = {Chicago, Illinois},
}

DCG: an efficient, retargetable dynamic code generation system Dawson R. Engler, and Todd A. Proebsting. ASPLOS 1994.

@inproceedings{dcg:asplos6,
  title        = {{DCG}: an efficient, retargetable dynamic code
    generation system},
  author       = {Dawson R. Engler and Todd A. Proebsting},
  pages        = {263--272},
  booktitle    = {Proceedings of the 6th International Conference on
    Architectural Support for Programming Languages and Operating
    Systems ({ASPLOS-VI})},
  year         = 1994,
  month        = oct,
  address      = {San Jose, California},
}

The operating system kernel as a secure programmable machine Dawson R. Engler, M. Frans Kaashoek, and James W. O'Toole Jr.. SIGOPS European Workshop 1994.

@inproceedings{exo:sigops-euro,
  title        = {The operating system kernel as a secure programmable
    machine},
  author       = {Dawson R. Engler and M. Frans Kaashoek and {O'Toole
    Jr.}, James W.},
  pages        = {62--67},
  booktitle    = {Proceedings of the 6th {ACM} {SIGOPS} {E}uropean
    workshop: Matching operating systems to application needs},
  year         = 1994,
  month        = sep,
  address      = {Dagstuhl Castle, Wadern, Germany},
}

Efficient implementation of high-level languages on user-level communication architectures Wilson C. Hsieh, Kirk L. Johnson, M. Frans Kaashoek, Deborah A. Wallach, and William E. Weihl. MIT LCS technical report, May 1994.

@techreport{user-level-comm:tr,
  title        = {Efficient implementation of high-level languages on
    user-level communication architectures},
  author       = {Wilson C. Hsieh and Kirk L. Johnson and M. Frans
    Kaashoek and Deborah A. Wallach and William E. Weihl},
  institution  = {{MIT} Laboratory for Computer Science},
  year         = 1994,
  month        = may,
  number       = {MIT-LCS-TR-616},
}

1993

The logical disk: a new approach to improving file systems Wiebren de Jonge, M. Frans Kaashoek, and Wilson C. Hsieh. SOSP 1993.

@inproceedings{logicaldisk:sosp14,
  title        = {The logical disk: a new approach to improving file
    systems},
  author       = {Wiebren de Jonge and M. Frans Kaashoek and Wilson C.
    Hsieh},
  pages        = {15--28},
  booktitle    = {Proceedings of the 14th {ACM} {S}ymposium on
    {O}perating {S}ystems {P}rinciples ({SOSP} '93)},
  year         = 1993,
  month        = dec,
  address      = {Asheville, North Carolina},
}

The persistent relevance of IPC performance Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl. WWOS 1993.

@inproceedings{ipc-persistent-relevance:wwos4,
  title        = {The persistent relevance of {IPC} performance},
  author       = {Wilson C. Hsieh and M. Frans Kaashoek and William E.
    Weihl},
  pages        = {186--190},
  booktitle    = {Proceedings of the 4th Workshop on Workstation
    Operating Systems},
  organization = {{IEEE} {C}omputer {S}ociety},
  year         = 1993,
  month        = oct,
  address      = {Napa, California},
}

All papers written by PDOS since '93