Software Testing CS 408. Lecture 10: Compiler Testing 2/15/18

Size: px

Start display at page:

Download "Software Testing CS 408. Lecture 10: Compiler Testing 2/15/18"

Andrew Smith
5 years ago
Views:

1 Software Testing CS 408 Lecture 10: Compiler Testing 2/15/18

2 Compilers Clearly, a very critical part of any software system. It is itself a complex piece of software - How should they be tested? - Random testing? - Black-box methods? - White-box methods? Challenges - Reasoning about the input space - Understanding program transformations 2

3 Finding and Understanding Bugs in C Compilers Xuejun Yang Yang Chen Eric Eide John Regehr University of Utah, School of Computing { jxyang, chenyang, eeide, regehr }@cs.utah.edu Compiler Validation via Equivalence Modulo Inputs Vu Le Mehrdad Afshari Zhendong Su Department of Computer Science, University of California, Davis, USA {vmle, mafshari, su}@ucdavis.edu Abstract Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong-code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging testing tools; D.3.2 [Programming Languages]: Language Classifications C; D.3.4 [Programming Languages]: Processors compilers General Terms Languages, Reliability Keywords compiler testing, compiler defect, automated testing, random testing, random program generation 1. Introduction The theory of compilation is well developed, and there are compiler frameworks in which many optimizations have been proved correct. Nevertheless, the practical art of compiler construction involves a morass of trade-offs between compilation speed, code quality, code debuggability, compiler modularity, compiler retargetability, and other goals. It should be no surprise that optimizing compilers like all complex software systems contain bugs. Miscompilations often happen because optimization safety checks are inadequate, static analyses are unsound, or transformations are flawed. These bugs are out of reach for current and future automated program-verification tools because the specifications that need to be checked were never written down in a precise way, if they were written down at all. Where verification is impractical, however, other methods for improving compiler quality can succeed. This paper reports our experience in using testing to make C compilers better. c ACM, This is the author s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 2011 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), San Jose, CA, Jun. 2011, 1 int foo (void) { 2 signed char x = 1; 3 unsigned char y = 255; 4 return x > y; 5 } Figure 1. We found a bug in the version of GCC that shipped with Ubuntu Linux for x86. At all optimization levels it compiles this function to return 1; the correct result is 0. The Ubuntu compiler was heavily patched; the base version of GCC did not have this bug. We created Csmith, a randomized test-case generator that supports compiler bug-hunting using differential testing. Csmith generates a C program; a test harness then compiles the program using several compilers, runs the executables, and compares the outputs. Although this compiler-testing approach has been used before [6, 16, 23], Csmith s test-generation techniques substantially advance the state of the art by generating random programs that are expressive containing complex code using many C language features while also ensuring that every generated program has a single interpretation. To have a unique interpretation, a program must not execute any of the 191 kinds of undefined behavior, nor depend on any of the 52 kinds of unspecified behavior, that are described in the C99 standard. For the past three years, we have used Csmith to discover bugs in C compilers. Our results are perhaps surprising in their extent: to date, we have found and reported more than 325 bugs in mainstream C compilers including GCC, LLVM, and commercial tools. Figure 1 shows a representative example. Every compiler that we have tested, including several that are routinely used to compile safety-critical embedded systems, has been crashed and also shown to silently miscompile valid inputs. As measured by the responses to our bug reports, the defects discovered by Csmith are important. Most of the bugs we have reported against GCC and LLVM have been fixed. Twenty-five of our reported GCC bugs have been classified as P1, the maximum, release-blocking priority for GCC defects. Our results suggest that fixed test suites the main way that compilers are tested are an inadequate mechanism for quality control. We claim that Csmith is an effective bug-finding tool in part because it generates tests that explore atypical combinations of C language features. Atypical code is not unimportant code, however; it is simply underrepresented in fixed compiler test suites. Developers who stray outside the well-tested paths that represent a compiler s comfort zone for example by writing kernel code or embedded systems code, using esoteric compiler options, or automatically generating code can encounter bugs quite frequently. This is a significant problem for complex systems. Wolfe [30], talking about independent software vendors (ISVs) says: An ISV with a complex code can work around correctness, turn off the optimizer in one or two files, and usually they have to do that for any of the compilers they use (emphasis ours). As another example, the front Abstract We introduce equivalence modulo inputs (EMI), a simple, widely applicable methodology for validating optimizing compilers. Our key insight is to exploit the close interplay between (1) dynamically executing a program on some test inputs and (2) statically compiling the program to work on all possible inputs. Indeed, the test inputs induce a natural collection of the original program s EMI variants, which can help differentially test any compiler and specifically target the difficult-to-find miscompilations. To create a practical implementation of EMI for validating C compilers, we profile a program s test executions and stochastically prune its unexecuted code. Our extensive testing in eleven months has led to 147 confirmed, unique bug reports for GCC and LLVM alone. The majority of those bugs are miscompilations, and more than 100 have already been fixed. Beyond testing compilers, EMI can be adapted to validate program transformation and analysis systems in general. This work opens up this exciting, new direction. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging testing tools; D.3.2 [Programming Languages]: Language Classifications C; H.3.4 [Programming Languages]: Processors compilers General Terms Algorithms, Languages, Reliability, Verification Keywords Compiler testing, miscompilation, equivalent program variants, automated testing 1. Introduction Compilers are among the most important, widely-used and complex software ever written. Decades of extensive research and development have led to much increased compiler performance and reliability. Perhaps less known to application programmers is that production compilers do also contain bugs, and in fact quite a few. However, compiler bugs are hard to recognize from the much more frequent bugs in applications because often they manifest only indirectly as application failures. Thus, when compiler bugs occur, they frustrate programmers and may lead to unintended application behavior and disasters, especially in safety-critical domains. Compiler verification has been an important and fruitful area for the verification grand challenge in computing research [9]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. PLDI 14, June 9 11, 2014, Edinburgh, United Kingdom. Copyright 2014 ACM /14/06... $ Besides traditional manual code review and testing, the main compiler validation techniques include testing against popular validation suites (such as Plum Hall [21] and SuperTest [1]), verification [12, 13], translation validation [20, 22], and random testing [28]. These approaches have complementary benefits. For example, CompCert [12, 13] is a formally verified optimizing compiler for a subset of C, targeting the embedded software domain. It is an ambitious project, but much work remains to have a fully verified production compiler that is correct end-to-end. Another good example is Csmith [28], a recent work that generates random C programs to stress-test compilers. To date, it has found a few hundred bugs in GCC and LLVM, and helped improve the quality of the most widely-used C compilers. Despite this incredible success, the majority of the reported bugs were compiler crashes as it is difficult to steer its random program generation to specifically exercise a compiler s most critical components its optimization phases. We defer to Section 5 for a detailed survey of related work. Equivalence Modulo Inputs (EMI) This paper introduces a simple, broadly applicable concept for validating compilers. Our vision is to take existing real-world code and transform it in a novel, systematic way to produce different, but equivalent variants of the original code. To this end, we introduce equivalence modulo inputs (EMI) for a practical, concrete realization of the vision. The key insight behind EMI is to exploit the interplay between dynamically executing a program P on a subset of inputs and statically compiling P to work on all inputs. More concretely, given a program P and a set of input values I from its domain, the input set I induces a natural collection of programs C such that every program Q 2 C is equivalent to P modulo I: 8i 2 I,Q(i)=P(i). The collection C can then be used to perform differential testing [16] of any compiler Comp: If Comp(P)(i) 6= Comp(Q)(i) for some i 2 I and Q 2 C, Comp has a miscompilation. Next we provide some high-level intuition behind EMI s effectiveness (Section 2 illustrates this insight with two concrete, real examples for Clang and GCC respectively). The EMI variants can specifically target a compiler s analysis and optimization phases, and stress-test them to reveal latent compiler bugs. Indeed, although an EMI variant Q is only equivalent to P modulo the input set I, the compiler has to perform all its (static) analysis and optimizations to produce correct code for Q over all inputs. In addition, P s EMI variants, while semantically equivalent w.r.t. I, can have quite different static data- and control-flow. Since data- and control-flow information critically affects which optimizations are enabled and how they are applied, the EMI variants not only help exercise the optimizer differently, but also demand the exact same output on I from the generated code by these different optimization strategies This is the very fact that we crucially leverage. EMI has several unique advantages: It is general and easily applicable to finding bugs in compilers, analysis and transformation tools for any language.

4 Example llvm bug $"clang" m32" O0"test.c";"./a.out" $"clang" m32" O1"test.c";"./a.out"" Aborted"(core"dumped)" 4

5 Example 1 int foo (void) { 2 signed char x = 1; 3 unsigned char y = 255; 4 return x > y; 5 } Bug in GCC in Ubuntu x86 under all optimization levels. 5

6 CSmith Random Generator: Csmith C program gcc -O0 gcc -O2 clang -Os results majority vote minority 6

7 Requirements Unambiguous: avoid undefined or unspecified behaviors that create ambiguous meanings of a program Integer undefined behavior Use without initialization Unspecified evaluation order Use of dangling pointer Null pointer dereference OOB array access Expressiveness: support most commonly used C features Integer operations Loops (with break/continue) Conditionals Function calls Const and volatile Structs and Bitfields Pointers and arrays Goto 7

8 Avoiding Undefined/unspecified Behaviors Problem Generation Time Solution Run Time Solution Integer undefined behaviors Use without initialization Constant folding/ propagation Algebraic simplification explicit initializers Safe math wrappers OOB array access Force index within range Take modulus Null pointer dereference Use of dangling pointers Unspecified evaluation order Inter-procedural points-to analysis Inter-procedural points-to analysis Inter-procedural effect analysis 8

9 no LHS *q assign RHS call validate ok? func_2 Generation Time Analyzer Code Generator 9

10 LHS assign RHS call func_2 Generation Time Analyzer Code Generator 10

11 yes LHS *p assign RHS call validate update facts ok? func_2 Generation Time Analyzer Code Generator 11

12 From March, 2008 to June 2011: Compiler GCC 104 (86) LLVM 228 (221) Others (Compcert, icc, armcc, tcc, cil, suncc, open64, etc) Bugs reported (fixed) 50 Total 382 Accounts for 1% total valid GCC bugs reported in the same period Accounts for 3.5% total valid LLVM bugs reported in the same period Do they matter? 25 priority 1 bugs for GCC 8 of our bugs were re-reported by others 12

13 Equivalence Modulo Inputs! Relax equiv. wrt a given input " Variants must satisfy P(i) = P k (i) on input i " But may differ on other input j: P(j) P k (j)! Exploit close interplay between " Dynamic program execution on some input " Static compilation for all input 13

14 Equivalence Modulo Inputs profile input!i! #######executed# ######unexecuted# program!p 14

15 Equivalence Modulo Inputs mutate I! I! I!..! O! I! O! 15

16 Equivalence Modulo Inputs mutate I! I! I!..! O! O! equivalent!wrt!i! I! O! 16

17 Example revisited Test c in GCC test suite unexecuted $"clang" m32" O0"test.c";"./a.out" $"clang" m32" O1"test.c";"./a.out" 17

18 Example revisited Reduced version $"clang" m32" O0"test.c";"./a.out" $"clang" m32" O1"test.c";"./a.out"" Aborted"(core"dumped)" $"clang" m32" O0"test.c";"./a.out" $"clang" m32" O1"test.c";"./a.out" Aborted"(core"dumped)" 18

19 Autopsy GVN:!load!struct!! using!32?bit!load! SRoA:!read!past!! the!struct s!end!! #!!!!!!!undefined!!!!!!!!!behavior! $"clang" m32" O0"test.c";"./a.out" $"clang" m32" O1"test.c";"./a.out"" Aborted"(core"dumped)" 19

20 Effectiveness bug counts GCC# LLVM# TOTAL# Reported! 111! 84! 195# Marked!Duplicate! 28! 7! 35# Confirmed! 79! 68! 147# Fixed! 56! 54! 110# bug types GCC# LLVM# TOTAL# Wrong!code! 46! 49! 95# Crash! 23! 10! 33# Performance! 10! 9! 19# 20

Automatic program generation for detecting vulnerabilities and errors in compilers and interpreters

Automatic program generation for detecting vulnerabilities and errors in compilers and interpreters 0368-3500 Nurit Dor Shir Landau-Feibish Noam Rinetzky Preliminaries Students will group in teams of 2-3