6.828 2011 Lecture 21: Klee From notes by Nickolai Zeldovich another bug lecture o/s people always have been worried about bugs since o/s bug may affect many people/programs even more interested after Internet became popular bug => security hole => read about your s/w in newspaper of course attackers like bug-finding tools too! how to find bugs? static: analyze the code and find potential problems runtime: run the code to see if it breaks metrics: false positive: claims bug in OK code false negative: says nothing about buggy code both are a problem static analysis examples: type checking variable used before initialized nice: few false- in given domain compiler *will* find all errors if I change fn(int x) to fn(int x, int y) compiler *proves* that your program has no typing errors rejects some programs that would work (false+): int pte = ...; char *p = (pte & 0xfffff000) + KERNBASE; why does type checking work well? don't need to understand program spec local: no need to e.g. consider all pairs of lines can type-check x + f(x) without looking inside f() many properties can be statically checked in a type-like framework but correctly-typed is not the same as correct static checking of program correctness? e.g. that f(x)*f(x) == x not just low-level properties like typing yes; e.g. recent proof that microkernel implements a spec but: very hard for programs much larger than a page requires a formal specification; not easy lots of false positives may require re-writing of correct programs to ease proof someday, but not yet what about run-time testing? run the program, watch what it does, detect errors examples: divide-by-zero page fault from wild reference array bounds assertions pro: few false+, since you observed the error con: only as good as the test cases, many false- run-time testing usually program-specific assert() test suites -- inputs and desired outputs maybe add a test every time you fix a bug hard to build a complete test suite usually reactive; finds re-emergence of old bugs aiming for "coverage" may help automatic analysis that suite executed every line full code coverage is hard Klee paper Table 2 says about 70% for coreutils how to find unknown bugs w/ run-time testing? fuzzers idea: generate random input and see what breaks 1. find input sources command-line arguments, files network packets 2. write input generation code completely random inputs unlikely to get very "deep" if crc32(pkt) == pkt.checksum if method == "GET" ... usually need smart input generator: generate pkts with random content but correct checksums make the packet content have a correct HTTP header make the URL follow the server's conventions time-consuming, easy to inject programmer assumptions 3. how to detect bugs? easy if the program crashes hard in general -- what's the right output for some random input? fuzzer may need its own model of correctness e.g. web server should never send kaashoek's content to rtm do fuzzers work? advantage: no false positives, if our bug detection rules are good advantage: no need to control entire execution env can fuzz black-box systems over the network no need to recompile into LLVM, etc advantage: no need for source code disadvantage: hard to cover everything can miss bugs because didn't happen to try a particular input e.g. if(argv[1] == "--foo") or if inputs must have complex structure e.g. hard to test a compiler with purely random input paper Figure 9 shows random tests often have very low coverage though often pretty high! Klee uses the program's own code to generate structure for test cases understands if(argv[1] == "--foo") so high coverage (can get past if conditions) and finds inputs that cause trouble finds real bugs in real C programs (GNU Coreutils) why I like Klee ordinary testing: does this input produce a failure? what about this other one? a long process... Klee: here's a line of code that might conceivably fail (e.g. ptr deref) is there any input that causes the program to fail there? powerful, gets results, techniques could prob be used for other purposes how does Klee work? "symbolic execution" compile C source into bytecodes interpreter executes bytecodes Klee's execution environment: a program has "state": program counter values in variables memory objects containing values a "path constraint" some values are "concrete" -- as in ordinary execution some values are "symbolic" either inputs -- symbols or expressions over symbols, result of e.g. x + alpha what happens when the Klee interpreter wants to run an llvm bytecode op? if all operands are concrete: run it as usual symbolic non-branches: create a new expression for the result symbolic branches: try both -- now *two* executions with cloned states add a "path constraint" to each reflecting branch condition to find e.g. NULL dereference ask "constraint solver" to find values for symbols satisfying path constraint and p==0 if constraint solver says "yes" there is a set of inputs that will cause failure example: 1. read x, y 2. if x > y: 3. x = y 4. if x < y: 5. x = x + 1 6. assert(x + y != 7) one path: 1. pc = { }, x = a, y = b (alpha, beta) 2. pc = { a > b }, x = a, y = b (fork) 3. pc = { a > b }, x = b, y = b 4. same (no fork) 6. cannot satisfy b + b == 7 another path: 2. pc = { a <= b }, x = a, y = b 4. pc = { a <= b AND a < b }, x = a, y = b 5. pc = { a <= b AND a < b }, x = a + 1, y = b 6. (a + 1 + b) == 7 for a < b? lots of solutions e.g. a=0 b=6 Klee would report a bug w/ inputs e.g. a=0 b=6 symbolic execution generates tree of states leaves are waiting to execute many leaves, so there's a scheduling problem what would a loop look like in the tree? how does Klee know what to check? checks every load/store for out-of-bounds as if "if(p < start || p >= end) panic()" thus Klee has a notion of distinct memory objects, not just memory checks every assertion assert(a < 100) assert(ptr != NULL) looks for exit(), to produce coverage test inputs a real example with a bug: paper's figure 1 pretty complex code, hard to spot bugs input argv[1] &c initialized as array of symbolic bytes with a fixed known size, 2 bytes in this example fork state at line 33 we get to line 7, call expand(argv[1], ...) fork again at 3, 4, 15 let's assume Klee is executing the "true" state of line 15's if arg = arg1+0 (symbolic) so pc includes { *(arg1+0) == '[' AND argc > 1 AND ... } at line 18, arg = arg1+2 Klee knows arg1 was only two bytes long thus out-of-bounds IF pc can be satisfied yes: one-byte first argument containing just [ what if open(symbolic-name)? e.g. cat opening file specified on command line can't just call ordinary open() syscall Klee maintains a symbolic file system forks state on open with a symbolic file name one state for each of a few symbolic files and one for failed open why does Klee maintain more than one symbolic file? why not have all open(symname) refer to same symbolic file? would force "diff " to always diff a file against itself forking tries both "diff f f" and "dif f g" how to handle uncertainty about what obj a ptr refers to? e.g. p = a[i] where i is symbolic need to know in order to e.g. know how big the obj is, for bounds check fork, once for each possible value of i (and thus a[i]) how long might it take Klee to run? how does a constraint solver work? big bag of tricks e.g. for x*x=2 may have to try every combination of values e.g. for MD5(x) = 1234 so the constraint solver can take a very long time Klee eventually gives up what if the constraint solver produces no answer? can't tell if e.g. that assert might fail just keep executing the state what optimizations do they need to make Klee run fast? main cost is constraint solver simplify path conditions, for faster constraint solving e.g. a <= b AND a < b try to solve each "if" branch, to avoid forking state copy-on-write states, for more compact states how does Klee decide which state to execute? often thousands to choose from Klee time-slices among them -- but not uniformly main goal: good coverage * "random path selection" better than random state selection favors states high in the tree, less constrained, more paths open avoids huge # states created by loops w/ symb conditions * "coverage-optimized search" -- schedule states "near" uncovered code does Klee have false positives? probably not maybe there are external constraints not known to Klee or assertions weren't precise in any case, Klee tries to reproduce bugs on unmodified app knows all inputs, tries to set up external environment to match does Klee have false negatives? i.e. Klee finds no bugs -- does that mean there are no bugs? no: Klee doesn't have a precise defn of bug by default, finds crashes, but not incorrect answers if Klee finds no way to cause an assertion to fail, could it fail anyway? yes! Klee and constraint solver are time-limited constraint solver might give up before finding the solution Klee might give up before finding the right path Klee evaluation Table 2: test coverage try to make a set of tests that causes all statements to execute Klee beats hand-generated tests why does Klee win? why is a big set of coverage tests by itself useful? note that Klee does *not* know if test output is correct! unlike the hand-generated tests Figure 7: bugs found it found a bunch of real bugs -- that manual tests did not generally out-of-bounds mem refs; Klee didn't need to know prog specs e.g. the paste -d\\ bug is much like the Figure 1 tr example some of the inputs involve file content as well as arguments from files with symbolic contents Klee deduced the content required to force a crash how could you use Klee for 6.828 labs? what bugs did you introduce? would Klee catch them?