OS Bugs

Required reading: An emperical study of oeprating systems errors

Overview

Operating systems must obey many rules for correctness and performance. Examples rules:

In addition, there are standard software engineering rules, like use function results in consistent ways.

These rules are typically not checked by a compiler, even though they could be checked by a compiler, in principle. The goal of the meta-level compilation project is to allow system implementors to write system-specific compiler extensions that check the source code for rule violations.

The results are dramatic: many new bugs found (500-1000) in Linux alone. The paper for today studies these bugs and attempts to draw lessons from these bugs.

Metacompilation

A system programmer writes the rule checkers in a high-level, state-machine language (metal). These checkers are dynamically linked into an extensible version of g++, xg++. Xg++ applies the rule checkers to every possible execution path of a function that is being compiled.

An example rule from the OSDI paper:

sm check_interrupts {
   decl { unsigned} flags;
   pat enable = { sti(); } | {restore_flags(flags);} ;
   pat disable = { cli(); };
   
   is_enabled: disable ==> is_disabled | enable ==> { err("double
      enable")};
   ...
A more complete version found 82 errors in the Linux 2.3.99 kernel. Common mistake:
get_free_buffer ( ... ) {
   ....
   save_flags (flags);
   cli ();
   if ((bh = sh->buffer_pool) == NULL)
      return NULL;
   ....
}

Some checkers produce false positives, because of limitations of both static analysis and the checkers, which mostly use local analysis.

How does the block checker work? The first pass is a rule that marks functions as potentially blocking. After processing a function, the checker emits the function's flow graph to a file (including, annotations and functions called). The second pass takes the merged flow graph of all function calls, and produces a file with all functions that have a path in the control-flow-graph to a blocking function call. For the Linux kernel this results in 3,000 functions that potentially could call sleep. Yet another checker like check_interrupts checks if a function calls any of the 3,000 functions with interrupts disabled. Etc.

Case study: check 21 releases of Linux

The paper is primarily about the block, null, and var checker. The primary conclusions of the paper are:

Methodology

Paper discussion

This paper is best discussed by studying every figure.