Austin T. Clements,
M. Frans Kaashoek,
Mosbench is a set of application
benchmarks designed to measure scalability of operating systems.
It consists of applications that previous work has shown not to
scale well on Linux and applications that are designed for
parallel execution and are kernel intensive. The applications
and workloads are chosen to stress important parts of many
Exim, a mail server;
Memcached, an object cache;
Apache, a web server;
PostgreSQL, a SQL database;
gmake, a parallel build system;
psearchy, a parallel text indexer;
and Metis, a multicore MapReduce library.
An Analysis of Linux Scalability to Many Cores
Abstract BibTeX PDF
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey
Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai
In the Proceedings of the 9th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '10), Vancouver, Canada,
This paper analyzes the scalability of seven system
applications (Exim, memcached, Apache, PostgreSQL, gmake,
Psearchy, and MapReduce) running on Linux on a 48-core
computer. Except for gmake, all applications trigger
scalability bottlenecks inside a recent Linux kernel. Using
mostly standard parallel programming techniques—this
paper introduces one new technique, sloppy
counters—these bottlenecks can be removed from the
kernel or avoided by changing the applications slightly.
Modifying the kernel required in total 3002 lines of code
changes. A speculative conclusion from this analysis is that
there is no scalability reason to give up on traditional
operating system organizations just yet.
Scalable Address Spaces Using RCU Balanced Trees
Abstract BibTeX PDF
Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich.
In the Proceedings of the 17th International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS), London, UK, March 2012.
Software developers commonly exploit multicore processors by
building multithreaded software in which all threads of an
application share a single address space. This shared address
space has a cost: kernel virtual memory operations such as
handling soft page faults, growing the address space, mapping
files, etc. can limit the scalability of these applications.
In widely-used operating systems, all of these operations are
synchronized by a single per-process lock. This paper
contributes a new design for increasing the concurrency of
kernel operations on a shared address space by exploiting
read-copy-update (RCU) so that soft page faults can both run
in parallel with operations that mutate the same address space
and avoid contending with other page faults on shared cache
lines. To enable such parallelism, this paper also introduces
an RCU-based binary balanced tree for storing memory mappings.
An experimental evaluation using three multithreaded
applications shows performance improvements on 80 cores
ranging from 1.7× to 3.4× for an implementation of
this design in the Linux 2.6.37 kernel. The RCU-based binary
tree enables soft page faults to run at a constant cost with
an increasing number of cores, suggesting that the design will
scale well beyond 80 cores.
Improving network connection locality on multicore systems
Abstract BibTeX PDF
Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and
Robert T. Morris.
In the Proceedings of the ACM EuroSys Conference,
Bern, Switzerland, April 2012.
Incoming and outgoing processing for a given TCP connection often
execute on different cores: an incoming packet is typically processed
on the core that receives the interrupt, while outgoing data processing
occurs on the core running the relevant user code. As a result,
accesses to read/write connection state (such as TCP control blocks)
often involve cache invalidations and data movement between cores'
caches. These can take hundreds of processor cycles, enough to
significantly reduce performance. We present a new design, called
Multi-Accept, that causes all processing for a given TCP connection
to occur on the same core. Multi-Accept arranges for the network
interface to determine the core on which application processing for
each new connection occurs, in a lightweight way; it adjusts the card's
choices only in response to imbalances in CPU scheduling. Measurements
show that for the Apache web server serving static files on a 48-core
AMD system, Multi-Accept reduces time spent in the TCP stack by 30%
and improves overall throughput by 24%.
Non-scalable locks are dangerous
Abstract BibTeX PDF
Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.
In the Proceedings of the Linux Symposium, Ottawa, Canada, July 2012.
Several operating systems rely on non-scalable spin locks for
serialization. For example, the Linux kernel uses ticket spin
locks, even though scalable locks have better theoretical
properties. Using Linux on a 48-core machine, this paper
shows that non-scalable locks can cause dramatic collapse in
the performance of real workloads, even for very short
critical sections. The nature and sudden onset of collapse
are explained with a new Markov-based performance model.
Replacing the offending non-scalable spin locks with scalable
spin locks avoids the collapse and requires modest changes to
- $ git clone https://pdos.csail.mit.edu/mosbench/mosbench.git
- The MOSBENCH driver and application suite. See the top-level README for
details on how to run the benchmark.
- $ git clone git://g.csail.mit.edu/pk.git
- The Patched Linux 2.6.35-rc5 Kernel. Our changes are divided among 18
branches. The 'pk' branch is a merge of those branches.
The .config we use is also available.
This repository contains Linux 2.6.37 modified for
scalable address spaces. The 'rcuvm-fault-lock',
'rcuvm-hybrid', and 'rcuvm-pure' branches correspond to the
three refinements in the paper.
This repository contains MCS lock implementations for Linux
2.6.39. The 'mcs-anonvma',
'mcs-as', and 'mcs-dentry' branches correspond to the
experiments in the paper.
NOTE: The latest Linux kernels (2.6.38-rc5 and
newer) include many VFS scalability improvements that our patches
aim to provide. We recommend trying
those kernels if you're
interested in a scalable VFS.
- $ git clone https://pdos.csail.mit.edu/mosbench/ixgbe.git
- The patched IXGBE 188.8.131.52 driver. Our changes are divided among multiple
branches. The 'unified' branch is a merge of those branches.
- $ git clone https://pdos.csail.mit.edu/mosbench/rcuvm.git
- Supplementary Scalable Address Spaces files: Kernel and
MOSBENCH configurations, build scripts, benchmark drivers, and
microbenchmarks. Also includes a copy of the Bonsai source.
- $ git clone https://pdos.csail.mit.edu/mosbench/parsec.git
- The Parsec 2.1 benchmark suite with modifications for
Scalable Address Spaces on the 'rcuvm' branch.
- $ git clone git://g.csail.mit.edu/lock-bench.git
- Benchmarks to stress various spin locks in the Linux kernel.