Software Testing CS 408. Lecture 6: Dynamic Symbolic Execution and Concolic Testing 1/30/18

Size: px

Start display at page:

Download "Software Testing CS 408. Lecture 6: Dynamic Symbolic Execution and Concolic Testing 1/30/18"

Anastasia Barton
5 years ago
Views:

1 Software Testing CS 408 Lecture 6: Dynamic Symbolic Execution and Concolic Testing 1/30/18

2 Relevant Papers CUTE: A Concolic Unit Testing Engine for C Koushik Sen, Darko Marinov, Gul Agha Department of Computer Science University of Illinois at Urbana-Champaign {ksen,marinov,agha}@cs.uiuc.edu Abstract KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs We present a new symbolic execution tool, KLEE, capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentally-intensive programs. We used KLEE to thoroughly check all 89 stand-alone programs in the GNU COREUTILS utility suite, which form the core user-level environment installed on millions of Unix systems, and arguably are the single most heavily tested set of open-source programs in existence. KLEE-generated tests achieve high line coverage on average over 90% per tool (median: over 94%) and significantly beat the coverage of the developers own hand-written test suites. When we did the same for 75 equivalent tools in the BUSYBOX embedded system suite, results were even better, including 100% coverage on 31 of them. We also used KLEE as a bug finding tool, applying it to 452 applications (over 430K total lines of code), where it found 56 serious bugs, including three in COREUTILS that had been missed for over 15 years. Finally, we used KLEE to cross-check purportedly identical BUSY- BOX and COREUTILS utilities, finding functional correctness errors and a myriad of inconsistencies. 1 Introduction Many classes of errors, such as functional correctness bugs, are difficult to find without executing a piece of code. The importance of such testing combined with the difficulty and poor performanceof random and manual approaches has led to much recent work in using symbolic execution to automatically generate test inputs [11, 14 16, 20 22, 24, 26, 27, 36]. At a high-level, these tools use variations on the following idea: Instead of running code on manually- or randomly-constructed input, they run it on symbolic input initially allowed to be anything. They substitute program inputs with sym- Author names are in alphabetical order. Daniel Dunbar is the main author of the KLEE system. Cristian Cadar, Daniel Dunbar, Dawson Engler Stanford University bolic values and replace corresponding concrete program operations with ones that manipulate symbolic values. When program execution branches based on a symbolic value, the system (conceptually) follows both branches, on each path maintaining a set of constraints called the path condition which must hold on execution of that path. When a path terminates or hits a bug, a test case can be generated by solving the current path condition for concrete values. Assuming deterministic code, feeding this concrete input to a raw, unmodified version of the checked code will make it follow the same path and hit the same bug. Results are promising. However, while researchers have shown such tools can sometimes get good coverage and find bugs on a small number of programs, it has been an open question whether the approach has any hope of consistently achieving high coverage on real applications. Two common concerns are (1) the exponential number of paths through code and (2) the challenges in handling code that interacts with its surrounding environment, such as the operating system, the network, or the user (colloquially: the environmentproblem ). Neither concern has been much helped by the fact that most past work, including ours, has usually reported results on alimitedsetofhand-pickedbenchmarksandtypically has not included any coverage numbers. This paper makes two contributions. First, we present anewsymbolicexecutiontool, KLEE, which we designed for robust, deep checking of a broad range of applications, leveraging several years of lessons from our previous tool, EXE [16]. KLEE employs a variety of constraint solving optimizations, represents program states compactly, and uses search heuristics to get high code coverage. Additionally, it uses a simple and straightforward approach to dealing with the external environment. These features improve KLEE s performance by overanorderof magnitudeandlet it checka broadrange of system-intensive programs out of the box. ABSTRACT In unit testing, a program is decomposed into units which are collections of functions. A part of unit can be tested by generating inputs for a single entry function. The entry function may contain pointer arguments, in which case the inputs to the unit are memory graphs. The paper addresses the problem of automating unit testing with memory graphs as inputs. The approach used builds on previous work combining symbolic and concrete execution, and more specifically, using such a combination to generate test inputs to explore all feasible execution paths. The current work develops a method to represent and track constraints that capture the behavior of a symbolic execution of a unit with memory graphs as inputs. Moreover, an efficient constraint solver is proposed to facilitate incremental generation of such test inputs. Finally, CUTE, a tool implementing the method is described together with the results of applying CUTE to real-world examples of C code. Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging General Terms: Reliability,Verification Keywords: concolic testing, random testing, explicit path model-checking, data structure testing, unit testing, testing C programs. 1. INTRODUCTION Unit testing is a method for modular testing of a programs functional behavior. A program is decomposed into units, where each unit is a collection of functions, and the units are independently tested. Such testing requires specification of values for the inputs (or test inputs) to the unit. Manual specification of such values is labor intensive and cannot guarantee that all possible behaviors of the unit will be observed during the testing. In order to improve the range of behaviors observed (or test coverage), several techniques have been proposed to automatically generate values for the inputs. One such tech- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copyotherwise,to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ESEC-FSE 05, September 5 9, 2005, Lisbon, Portugal. Copyright 2005 ACM /05/ $5.00. nique is to randomly choose the values over the domain of potential inputs [4,8,10,21]. The problem with such random testing is two fold: first, many sets of values may lead to the same observable behavior and are thus redundant, and second, the probability of selecting particular inputs that cause buggy behavior may be astronomically small [20]. One approach which addresses the problem of redundant executions and increases test coverage is symbolic execution [1,3,9,22,23,27,28,30]. In symbolic execution, a program is executed using symbolic variables in place of concrete values for inputs. Each conditional expression in the program represents a constraint that determines an execution path. Observe that the feasible executions of a program can be represented as a tree, where the branch points in a program are internal nodes of the tree. The goal is to generate concrete values for inputs which would result in different paths being taken. The classic approach is to use depth first exploration of the paths by backtracking [14]. Unfortunately, for large or complex units, it is computationally intractable to precisely maintain and solve the constraints required for test generation. To the best of our knowledge, Larson and Austin were the first to propose combining concrete and symbolic execution [16]. In their approach, the program is executed on some user-provided concrete input values. Symbolic path constraints are generated for the specific execution. These constraints are solved, if feasible, to see whether there are potential input values that would have led to a violation along the same execution path. This improves coverage while avoiding the computational cost associated with fullblown symbolic execution which exercises all possible execution paths. Godefroid et al. proposed incrementally generating test inputs by combining concrete and symbolic execution [11]. In Godefroid et al. s approach, during a concrete execution, a conjunction of symbolic constraints along the path of the execution is generated. These constraints are modified and then solved, if feasible, to generate further test inputs which would direct the program along alternative paths. Specifically, they systematically negate the conjuncts in the path constraint to provide a depth first exploration of all paths in the computation tree. If it is not feasible to solve the modified constraints, Godefroid et al. propose simply substituting random concrete values. A challenge in applying Godefroid et al. s approach is to provide methods which extract and solve the constraints generated by a program. This problem is particularly complex for programs which have dynamic data structures using Patrice Godefroid Microsoft (Research) pg@microsoft.com Abstract Automated Whitebox Fuzz Testing Michael Y. Levin Microsoft (CSE) mlevin@microsoft.com Fuzz testing is an effective technique for finding security vulnerabilities in software. Traditionally, fuzz testing tools apply random mutations to well-formed inputs of a program and test the resulting values. We present an alternative whitebox fuzz testing approach inspired by recent advances in symbolic execution and dynamic test generation. Our approach records an actual run of the program under test on a well-formed input, symbolically evaluates the recorded trace, and gathers constraints on inputs capturing how the program uses these. The collected constraints are then negated one by one and solved with a constraint solver, producing new inputs that exercise different control paths in the program. This process is repeated with the help of a code-coverage maximizing heuristic designed to find defects as fast as possible. We have implemented this algorithm in SAGE (Scalable, Automated, Guided Execution), a new tool employing x86 instruction-level tracing and emulation for whitebox fuzzing of arbitrary file-reading Windows applications. We describe key optimizations needed to make dynamic test generation scale to large input files and long execution traces with hundreds of millions of instructions. We then present detailed experiments with several Windows applications. Notably, without any format-specific knowledge, SAGE detects the MS ANI vulnerability, which was missed by extensive blackbox fuzzing and static analysis tools. Furthermore, while still in an early stage of development, SAGE has already discovered 30+ new bugs in large shipped Windows applications including image processors, media players, and file decoders. Several of these bugs are potentially exploitable memory access violations. 1 Introduction Since the Month of Browser Bugs released a new bug each day of July 2006 [25], fuzz testing has leapt to prominence as a quick and cost-effective method for finding serious security defects in large applications. Fuzz testing is a The work of this author was done while visiting Microsoft. David Molnar UC Berkeley dmolnar@eecs.berkeley.edu form of blackbox random testing which randomly mutates well-formed inputs and tests the program on the resulting data [13, 30, 1, 4]. In some cases, grammars are used to generate the well-formed inputs, which also allows encoding application-specific knowledge and test heuristics. Although fuzz testing can be remarkably effective, the limitations of blackbox testing approaches are well-known. For instance, the then branch of the conditional statement if (x==10) then hasonlyonein2 32 chances of being exercised if x is a randomly chosen 32-bit input value. This intuitively explains why random testing usually provides low code coverage [28]. In the security context, these limitations mean that potentially serious security bugs, such as buffer overflows, may be missed because the code that contains the bug is not even exercised. Weproposeaconceptually simplebutdifferentapproach of whitebox fuzz testing. This workisinspiredby recent advances in systematic dynamic test generation [16, 7]. Starting with a fixed input, our algorithm symbolically executes the program, gathering input constraints from conditional statements encountered along the way. The collected constraints are then systematically negated and solved with a constraint solver, yielding new inputs that exercise different execution paths in the program. This process is repeated using a novel search algorithm with a coverage-maximizing heuristic designed to find defects as fast as possible. For example, symbolic execution of the above fragment on the input x = 0 generates the constraint x 10. Oncethisconstraint is negated and solved, it yields x = 10, which gives us a new input that causes the program to follow the then branch of the given conditional statement. This allows us to exercise and test additional code for security bugs, even without specific knowledge of the input format. Furthermore, this approach automatically discovers and tests corner cases where programmers may fail to properly allocate memory or manipulate buffers, leading to security vulnerabilities. In theory, systematic dynamic test generation can lead to full program path coverage, i.e., program verification [16]. In practice, however, the search is typically incomplete both because the number of execution paths in the program un- 2

3 Blackbox vs Whitebox Testing Korat, Randoop, and Quickcheck effectively employ blackbox methods - They don t try to analyze the program or methods under test - Instead, they use random testing along with clever heuristics or user-provided directives (e.g., properties) to improve coverage In contrast, whitebox testing methods analyze the source and use the output of the analysis to drive testcase generation 3

4 Dynamic Symbolic Execution A specific (and particularly useful) kind of whitebox testing is dynamic symbolic execution Basic idea: - Maintain both concrete and symbolic representations of program state at every program point - Use a theorem prover or constraint solver to guide execution along unexplored paths - Goal: explore all paths in the method under test 4

5 Program Paths and Computation Trees 5

6 Example 6

7 Dynamic Techniques When reachability conditions depend upon specific values drawn from a large universe of values, random testing is unlikely to be effective 7

8 Symbolic Execution Rather than simply relying on random execution to reach an error state, we might symbolically evaluate the program. A symbolic value represents a set of concrete values (e.g., a type). Collect constraints on these values along different paths. Use a theorem prover to reason about these constraints. - To reach the error state, we need to discover a value for x such that (x * 3 == 15) /\!(x % 5) == 0 - A theorem prover would respond that no such x exists, allowing us to conclude that the error condition cannot arise. 8

9 Challenges - Complexity of symbolic constraints might overwhelm the theorem prover - How do we define symbolic abstractions of data structures and the heap? - E.g., how do we encode properties like heap reachability? - What about library calls or methods that are dynamically linked? Scalability? 9

10 Dynamic Symbolic Execution (Concolic Testing) This approach allows the theorem prover to be incomplete - concrete values are used to simplify constraints No unsatisfiable constraint will be deemed satisfiable 10

11 Example 11

12 Example 12

13 Example 13

14 Example 14

15 Example 15

16 Example 16

17 Example 17

18 Example 18

19 Example 19

20 Example 20

21 Example 21

22 Example 22

23 Constraint Exploration What are the different constraints that a dynamic symbolic execution strategy might solve in exploring the following computation tree? 23

24 Another Example (Overcoming Complexity) 24

25 Another Example 25

26 Another Example 26

27 Another Example 27

28 Another Example 28

29 Another Example 29

30 Another Example 30

31 Another Example 31

32 Soundness 32

33 Soundness 33

34 Soundness 34

35 Soundness 35

36 Dynamic vs. Symbolic Execution A dynamic analysis never examines an execution that could not occur. - Never returns false positives - every error is a true error - But, may have false negatives - not all (latent) errors will be discovered Symbolic execution attempts to examine all paths - But, not all such paths are actually realizable (feasible) - It can thus have false positives - not every error it reports will be a true error - But, will never have false negatives - it will never incorrectly declare a program to be error-free (thus, it is always sound) 36

37 Dynamic Symbolic Execution 37

38 Another Example 38

39 Dealing with Data Structures What is the likelihood random testing would reach the error state? To trigger the testcase, a randomly generated testcase would need to: - pick x > 0 - pick p to be a non-empty linked list - set the contents of the first element to be the value of foo(x) - set the tail of the first element to point to p 39

40 Dealing with Data Structures 40

41 Dealing with Data Structures 41

42 Dealing with Data Structures 42

43 Dealing with Data Structures 43

44 Dealing with Data Structures 44

45 Dealing with Data Structures 45

46 Dealing with Data Structures 46

47 Dealing with Data Structures 47

48 Dealing with Data Structures 48

49 Dealing with Data Structures 49

50 Summary 50

51 Summary 51

Automated Whitebox Fuzz Testing. by - Patrice Godefroid, - Michael Y. Levin and - David Molnar

Automated Whitebox Fuzz Testing by - Patrice Godefroid, - Michael Y. Levin and - David Molnar OUTLINE Introduction Methods Experiments Results Conclusion Introduction Fuzz testing is an effective Software