6.828 2011 Lecture 21: Klee

From notes by Nickolai Zeldovich

another bug lecture
  o/s people always have been worried about bugs
  since o/s bug may affect many people/programs
  even more interested after Internet became popular
  bug => security hole => read about your s/w in newspaper
  of course attackers like bug-finding tools too!

how to find bugs?
  static: analyze the code and find potential problems
  runtime: run the code to see if it breaks

metrics:
  false positive: claims bug in OK code
  false negative: says nothing about buggy code
  both are a problem

static analysis
  examples:
    type checking
    variable used before initialized
  nice: few false- in given domain
    compiler *will* find all errors if I change fn(int x) to fn(int x, int y)
    compiler *proves* that your program has no typing errors
  rejects some programs that would work (false+):
    int pte = ...;
    char *p = (pte & 0xfffff000) + KERNBASE;
  why does type checking work well?
    don't need to understand program spec
    local: no need to e.g. consider all pairs of lines
      can type-check x + f(x) without looking inside f()
  many properties can be statically checked in a type-like framework
  but correctly-typed is not the same as correct

static checking of program correctness?
  e.g. that f(x)*f(x) == x
  not just low-level properties like typing
  yes; e.g. recent proof that microkernel implements a spec
  but:
    very hard for programs much larger than a page
    requires a formal specification; not easy
    lots of false positives
    may require re-writing of correct programs to ease proof
  someday, but not yet

what about run-time testing?
  run the program, watch what it does, detect errors
  examples:
    divide-by-zero
    page fault from wild reference
    array bounds
    assertions
  pro: few false+, since you observed the error
  con: only as good as the test cases, many false-

run-time testing usually program-specific
  assert()
  test suites -- inputs and desired outputs
  maybe add a test every time you fix a bug
  hard to build a complete test suite
  usually reactive; finds re-emergence of old bugs
  aiming for "coverage" may help
    automatic analysis that suite executed every line
  full code coverage is hard
    Klee paper Table 2 says about 70% for coreutils

how to find unknown bugs w/ run-time testing?

fuzzers
  idea: generate random input and see what breaks
  1. find input sources
     command-line arguments, files
     network packets
  2. write input generation code
     completely random inputs unlikely to get very "deep"
       if crc32(pkt) == pkt.checksum
       if method == "GET" ...
     usually need smart input generator:
       generate pkts with random content but correct checksums
       make the packet content have a correct HTTP header
       make the URL follow the server's conventions
     time-consuming, easy to inject programmer assumptions
  3. how to detect bugs?
     easy if the program crashes
     hard in general -- what's the right output for some random input?
     fuzzer may need its own model of correctness
     e.g. web server should never send kaashoek's content to rtm

do fuzzers work?
  advantage: no false positives, if our bug detection rules are good
  advantage: no need to control entire execution env
    can fuzz black-box systems over the network
    no need to recompile into LLVM, etc
  advantage: no need for source code
  disadvantage: hard to cover everything
    can miss bugs because didn't happen to try a particular input
      e.g. if(argv[1] == "--foo")
    or if inputs must have complex structure
      e.g. hard to test a compiler with purely random input
    paper Figure 9 shows random tests often have very low coverage
      though often pretty high!

Klee
  uses the program's own code to generate structure for test cases
    understands if(argv[1] == "--foo")
    so high coverage (can get past if conditions)
    and finds inputs that cause trouble
  finds real bugs in real C programs (GNU Coreutils)

why I like Klee
  ordinary testing:
    does this input produce a failure?
    what about this other one?
    a long process...
  Klee:
    here's a line of code that might conceivably fail (e.g. ptr deref)
    is there any input that causes the program to fail there?
  powerful, gets results, techniques could prob be used for other purposes

how does Klee work?
  "symbolic execution"
  compile C source into bytecodes
  interpreter executes bytecodes

Klee's execution environment:
  a program has "state":
    program counter
    values in variables
    memory objects containing values
    a "path constraint"
  some values are "concrete" -- as in ordinary execution
  some values are "symbolic"
    either inputs -- symbols
    or expressions over symbols, result of e.g. x + alpha

what happens when the Klee interpreter wants to run an llvm bytecode op?
  if all operands are concrete: run it as usual
  symbolic non-branches: create a new expression for the result
  symbolic branches:
    try both -- now *two* executions with cloned states
    add a "path constraint" to each reflecting branch condition
  to find e.g. NULL dereference
    ask "constraint solver" to find values for symbols
      satisfying path constraint and p==0
    if constraint solver says "yes"
      there is a set of inputs that will cause failure

example:
1. read x, y
2. if x > y:
3.   x = y
4. if x < y:
5.   x = x + 1
6. assert(x + y != 7)

one path:
1. pc = { }, x = a, y = b (alpha, beta)
2. pc = { a > b }, x = a, y = b (fork)
3. pc = { a > b }, x = b, y = b
4. same (no fork)
6. cannot satisfy b + b == 7

another path:
2. pc = { a <= b }, x = a, y = b
4. pc = { a <= b AND a < b }, x = a, y = b
5. pc = { a <= b AND a < b }, x = a + 1, y = b
6. (a + 1 + b) == 7 for a < b?
   lots of solutions e.g. a=0 b=6
   Klee would report a bug w/ inputs e.g. a=0 b=6

symbolic execution generates tree of states
  leaves are waiting to execute
    many leaves, so there's a scheduling problem

what would a loop look like in the tree?

how does Klee know what to check?
  checks every load/store for out-of-bounds
    as if "if(p < start || p >= end) panic()"
    thus Klee has a notion of distinct memory objects, not just memory
  checks every assertion
    assert(a < 100)
    assert(ptr != NULL)
  looks for exit(), to produce coverage test inputs

a real example with a bug:
  paper's figure 1
  pretty complex code, hard to spot bugs
  input argv[1] &c initialized as array of symbolic bytes
    with a fixed known size, 2 bytes in this example
  fork state at line 33
  we get to line 7, call expand(argv[1], ...)
  fork again at 3, 4, 15
  let's assume Klee is executing the "true" state of line 15's if
  arg = arg1+0 (symbolic)
  so pc includes { *(arg1+0) == '[' AND argc > 1 AND ... }
  at line 18, arg = arg1+2
  Klee knows arg1 was only two bytes long
  thus out-of-bounds IF pc can be satisfied
    yes: one-byte first argument containing just [

what if open(symbolic-name)?
  e.g. cat opening file specified on command line
  can't just call ordinary open() syscall
  Klee maintains a symbolic file system
    forks state on open with a symbolic file name
    one state for each of a few symbolic files
    and one for failed open

why does Klee maintain more than one symbolic file?
  why not have all open(symname) refer to same symbolic file?
  would force "diff <f1> <f2>" to always diff a file against itself
  forking tries both "diff f f" and "dif f g"

how to handle uncertainty about what obj a ptr refers to?
  e.g. p = a[i] where i is symbolic
  need to know in order to e.g. know how big the obj is, for bounds check
  fork, once for each possible value of i (and thus a[i])

how long might it take Klee to run?

how does a constraint solver work?
  big bag of tricks
    e.g. for x*x=2
  may have to try every combination of values
    e.g. for MD5(x) = 1234
  so the constraint solver can take a very long time
  Klee eventually gives up

what if the constraint solver produces no answer?
  can't tell if e.g. that assert might fail
  just keep executing the state

what optimizations do they need to make Klee run fast?
  main cost is constraint solver
  simplify path conditions, for faster constraint solving
    e.g. a <= b AND a < b
  try to solve each "if" branch, to avoid forking state
  copy-on-write states, for more compact states

how does Klee decide which state to execute?
  often thousands to choose from
  Klee time-slices among them -- but not uniformly
  main goal: good coverage
  * "random path selection" better than random state selection
    favors states high in the tree, less constrained, more paths open
    avoids huge # states created by loops w/ symb conditions
  * "coverage-optimized search" -- schedule states "near" uncovered code

does Klee have false positives?
  probably not
  maybe there are external constraints not known to Klee
  or assertions weren't precise
  in any case, Klee tries to reproduce bugs on unmodified app
    knows all inputs, tries to set up external environment to match

does Klee have false negatives?
  i.e. Klee finds no bugs -- does that mean there are no bugs?
    no: Klee doesn't have a precise defn of bug
    by default, finds crashes, but not incorrect answers
  if Klee finds no way to cause an assertion to fail, could it fail anyway?
    yes! Klee and constraint solver are time-limited
    constraint solver might give up before finding the solution
    Klee might give up before finding the right path

Klee evaluation
  Table 2: test coverage
    try to make a set of tests that causes all statements to execute
    Klee beats hand-generated tests
    why does Klee win?
    why is a big set of coverage tests by itself useful?
    note that Klee does *not* know if test output is correct!
      unlike the hand-generated tests
  Figure 7: bugs found
    it found a bunch of real bugs -- that manual tests did not
    generally out-of-bounds mem refs; Klee didn't need to know prog specs
    e.g. the paste -d\\ bug is much like the Figure 1 tr example
    some of the inputs involve file content as well as arguments
      from files with symbolic contents
      Klee deduced the content required to force a crash

how could you use Klee for 6.828 labs?
  what bugs did you introduce?
  would Klee catch them?