The Care and Feeding of Wild-Caught Mutants

Size: px

Start display at page:

Download "The Care and Feeding of Wild-Caught Mutants"

Liliana Robertson
6 years ago
Views:

1 The Care and Feeding of Wild-Caught Mutants Michael Vaughn and David Bingham Brown December 19th, 2015 Abstract We propose and implement a technique for providing more thorough mutation testing for software test suites by mining publicly available source code repositories. We derive wildcaught mutants from publicly available source code repositories and investigate their effectiveness in evaluating software test suites compared to traditional mutators used in mutation testing. 1 Introduction One of the biggest threats to validity in debugging research is the size and quality of the defect corpus used for evaluation. Existing techniques for empirical evaluation of code analyzers, test methodologies, and other debugging tools and techniques are fairly simplistic. Rigorous scientific analysis requires well-documented and repeatable test conditions, leading to conditions that are either contrived, as is the case for hand-introduced bugs in the Siemens Suite, or are small batches of painstakingly isolated bugs from real projects, as René Just produced for his studies[1]. A reasonable question to ask, then, is whether a more robust and scalable means of reproducibly introducing software defects could be devised. Mutation analysis has gained traction in recent years as a means of evaluating the coverage of test suites. In particular, research has shown[1] mutation coverage is a useful predictor of a test suite effectiveness, independent of code coverage. However, as René Just showed, 17% of software faults could not be accounted for with normal mutants. The simplicity of the model makes it useful in industrial settings, where a degree of scientific validity can be sacrificed in order to obtain a reasonably useful tool. Given the observation that certain real-world defects are disjoint from the class of defects that can be generated by these mutations, however, basic mutation analysis can be too artificial to find general use as a tool for scientific testing of engineering tools. We expand on the core idea of mutation testing automated bug introduction as a means to evaluate the quality of a testing suite by developing and providing an analysis of the utility of a suite of tools for generating novel, human-generated mutants into a codebase for test suite evaluation. To accomplish this, we scoured GitHub[2], a large, publicly accessible source code repository, for small, single-line commits. Working on the assumption that the primary reason one commits a single-line commit to a repository is to fix a bug, we extract these commits and create a reverse patch from the commit, assuming that if the commit exists to fix a bug, reversing the patch would, therefore, insert the fixed bug into the codebase it is applied to. To ensure that the commits are actually applicable to other codebases, we extract these potential mutants in a identifier-agnostic 1

2 way, that is, we maintain keywords and operators from the programming language used, but treat identifiers as wildcards to be matched to the host codebase upon insertion. We provide a wild mutant extraction and insertion toolchain; (scraping tool here); mutgen, a mutant extraction tool; and mutins, a a mutant insertion tool. We provide experimental evidence using source control projects and testing suites in the C language (though the toolchain does, at present, support most languages that lack semantic value for whitespace) to demonstrate the utility of our technique. 2 Development All tools developed as part of the project can be found at mutants. 2.1 Repository Mining To obtain mutants, we decided to mine code from public GitHub repositories. To start out with, we decided to investigate C repositories, as the language s comparatively simple syntax and semantics leads us to believe that a greater proportion of C s reverse patches should be applicable compared to more (syntactically) complex object-oriented languages. In order to obtain a reasonably sized set of patches, we decided to target repositories with the most commits. However, the Git search API does not expose the number of commits. As a heuristic, we decided to instead select the repositories with the most forks, as Git s API does allow users to order search results we felt this heuristic was reasonable, as we generally think projects with a large number of forks are subject to broad interest, and thus a higher rate of development. Quantitatively, this assumption appears warranted, as the top 20 projects by this metric include the Linux kernel, memcached, and Redis. To perform the scraping, we created two automated tools. The first was a small Python script which could automatically submit small batches of search queries to the GitHub server, and build a list of all projects matching the query. To simplify this, we used OctoHub[3], a small Python library which lets users programmatically build API queries. We then built a script to send small numbers of paginated queries, to avoid running afoul of GitHub s API lilmts. By doing this, we were able to build a list of the top 625 C projects, in descending order of fork count. We used the top 50 of these comprising some 850 million lines of commits for our experiments. Once this was done, we created a small program which checked out each repository on the list, and then performed the repository scraping locally. Our local scraper is a Python script which iterates backwards through the commit history, outputting each diff (in unified diff format), along with revision number and commit message to a text file. The construction of this scraper was simplified by using GitPython[4], which provides a robust object-oriented implementation of the relevant git functionality, such as reading commit histories and constructing diffs. 2.2 Mutant Extraction : mutgen We developed mutgen, the second element of our toolchain, to extract potential wild mutants from the unified diff files scraped from GitHub. mutgen identifies potential mutants by isolating single-line changes from the source control commits it reads as input. Both lines of each identified commit are tokenized with help from language specification files provided at command line detailing language keywords and operators; 2

3 Usage : mutgen [ o p t i o n s ] Options : help d i s p l a y t h i s t e x t k KEYWORD FILE load language keywords from KEYWORD FILE o OPERATOR FILE load language o p e r a t o r s from OPERATOR FILE i INPUT FILES... e x t r a c t mutants from INPUT FILES x EXTRACT FILE s t o r e e x t r a c t e d mutants in EXTRACT FILE Figure 1: mutgen command line options i f ( x && y ) + i f ( x ) i f ( x ) + i f ( x && y ) (a) Invalid potential mutant + : i f. ( $1.&& $ 1. ) : i f. ( $1. ) (b) Valid potential mutant (c) Mutant extracted from (b) Figure 2: Example potential mutants the tokenizer ignores whitespace and uses rules simple yet (largely) universal among programming languages for processing keywords, identifiers, literals, and whitespace operators are consumed greedily, while keywords and identifiers must be separated by operators or whitespace. Once the two lines the before and after lines are tokenized, mutgen then analyzes both to ensure that, once matched, it is possible to generate the before line from the after line; mutins is not yet robust enough to synthesize identifiers or literals, so mutgen requires that potential mutants not require the synthesis of new information; that is, the before state must be able to be generated solely from identifiers and literals matched in the after state. Figure 2 shows example single-line commits as processed by mutgen; figure 2a would be discarded because the before state cannot be generated from the after state (as the identifier y is unique to the before state, and thus cannot be generated solely from the after state). Figure 2c shows the tokenized mutant fully extracted; in the mutant extract language of mutgen and mutins, : indicates a keyword,. an operator, and $ an identifier keywords and operators contain their literal value, while identifiers are given an index to be used in converting the after state to the before, with -1 indicating that the identifier like the y in figure 2b is unused. Our initial expectation that valid mutants would be rare proved to be untrue, and our initial run of mutgen over our scraped corpus yielded almost three million mutants more than it was reasonably possible to evaluate. After manually combing through a subset of the mutants produced by the initial run, we added heuristics to mutgen to cull both commits likely to be comments (e.g., those containing several identifiers in a row, indicating that the committed text is more likely to be natural language than a programming language or containing several repeated operators, often seen as horizontal lines drawn in comments) and those complex enough to be unlikely to be found 3

4 } + } e l s e while ( i < n ) ; + while ( i < n ) + TMPFILE TMPFILE % 512 (a) (b) Figure 3: Example extracted mutants (c) in other codebases (that is, those that have after states containing more than four identifiers to be matched). The result of the application of this culling heuristic was to reduce the generated set from roughly three million potential mutants, most of them unusable, to roughly thirty thousand, of which a larger proportion were likely to be matched. Figure 3 shows some example mutants identified by mutgen 1. Notably, while mutgen (and the rest of the toolchain) has only been tested for the C programming language, the entire system is designed in such a way to be used on virtually any programming language that does not assign semantic value to whitespace 2 via operator and keyword specifications (with the companion tool mutins reading these language specifications in the extracted mutant file generated by mutgen). mutgen applied to the entire scraped corpus (approximately 850 million lines of commits) extracted 29,704 possibly viable mutants in less than ten minutes on desktop-grade hardware. 2.3 Mutant Insertion : mutins mutins, the final element of the toolchain, reads mutants and language definitions from the mutant extract file generated by mutgen and inserts them into a codebase specified as a list of files. mutins offers several command line options to facilitate automatic and repeated use of the tool, as due to the nature of the mutants generated, many will cause the resulting code to not compile correctly. In addition to strictly random use (by default mutins chooses a mutant at random from the mutant extract file and inserts it into a randomly selected insertion point in the target codebase), mutins can be forced to use a specific random number seed (we use the C++ STL s implementation of the Mersenne Twister[5] for pseudo-random number generation). Mutant insertion works much like mutant extraction in reverse; mutins tokenizes the input files with keyword and operator lists provided in the mutant extract file, and then finds possible insertion points by identifying token sequences matching the after state of the chosen mutant. Once an insertion point is selected, the range of text represented by the after state tokens is replaced by text synthesized from the before state tokens, matching identifier to identifier in the synthesized code. 3 Evaluation For our experimental validation, we chose to replicate a subset of an experiment J.H. Andrews and L.C. Briand and Y. Labiche describe in [6]. Their experiment takes a program from the 1 We posit no explanation why one would need to perform modulo arithmetic on a variable named TMPFILE, but do note that stripping a modulo operation does provide an interesting mutation for use in mutation testing. 2 A conversation with Ben Liblit yielded a simple and elegant method to implement support for semantic whitespace à la Python, but this has not yet been implemented. 4

5 Usage : mutins [ o p t i o n s ] Options : help d i s p l a y t h i s t e x t v verbose output c only count p o t e n t i a l matches, do not i n s e r t i MATCH INDEX i n s e r t the match at the s p e c i f i c index r RANDOM SEED use RANDOM SEED to i n i t i a l i z e random number g e n e r a t o r ( t h i s i s ignored i f the i option i s used ) m MUTANT INDEX use only the mutant found at the ( zero based ) index MUTANT INDEX x EXTRACT FILE load mutants ( and language data ) from EXTRACT FILE t TARGET FILES... attempt to i n s e r t mutant i n t o TARGET FILE b s k i p backup ( by d e f a u l t, modified f i l e s are copied to f i l e. o r i g b e f o r e mutant i n s e r t i o n ) Figure 4: mutins command line options SIR repository[7], and randomly generates a number of test suites by randomly choosing from the artifact s tests. They then measure the mutation adequacy of each suite by running them over the set of all possible program mutations. By collecting these measurements, he constructs a model of the statistical distribution of the mutant detection rate over arbitrary test rates, which he compares to a similarly constructed approximation of the distribution hand-seeded faults. 3.1 Target Program While Andrews works with a wide variety of programs from SIR, including the Siemens suite, we decided to work with Space [7]. Space is an appealing subject for this form of experimentation, as it is a mature piece of software that has been subject to years of production use. Because of this, Space is also the only program Andrews tested which used real faults instead of hand-introduced ones. Thus, we can already get a sense of how wild mutants fare against the test suites detection rates for real faults. Moreover, at 6,199 lines of code, it is larger than the other classical Siemens Suite programs. This size is large enough that mutins can generate 117,744 possible mutants, which is a non-trivial set that is still small enough to explore exhaustively. 3.2 Procedure Prior to testing, we obtained 5, case test suites by randomly shuffling the list of available suites repeatedly and taking the first 100 elements, thus ensuring that no single suite contained duplicate experiments. Next, we ran mutins on Space, in order to identify each possible point at which a wild mutant can be inserted. We recorded each possible insertion into a list of entries that could be fed to our suite execution framework at a later time. We then divide the space of 117,744 mutants into batches of six jobs each, as prior experimentation indicated that such a test 5

6 could complete in one to two hours. We packaged each list as part of an HTCondor job, which executed applied each mutation in sequence, and ran all six against each of the 5,000 test suites. Test successes and failures were recorded, and sent back from each job. Once the test results were returned, we then performed some simple analyses on the results. For each test suite S, we calculated the mutation adequacy score, Am(S), where Am(S) is the ratio of mutants detected by the suite to the total number of mutants [6]. We also recorded the number of mutants that successfully compiled, in order to get a general sense of the feasibility of mutant insertion. 3.3 Experimentation Framework and Tooling As the experimental procedure calls for exhaustively building and testing each possible mutant for a subject program, a significant amount of computation time is required to obtain a reasonably informative data set. We decided to use UW s Center for High Throughput Computing, which provides a robust environment for large-scale grid computing via the HTCondor framework [8]. Since each job can be run independently of others, the parallelization is simply a matter of appropriately packaging the mutator, along with the target program and associated suites. This posed a significant difficulty, as the computing pool s Linux environments are heterogeneous, and host machines are not guaranteed to have any specific version or build of many none-core programs, if any version is indeed present. In particular, many nodes do not have either GCC nor the headers needed to build code, which required us to create relocatable binaries of both GCC and glibc, which we could pass in as part of the job. Obtaining such a version of GCC is non-trivial, and requires a significant amount of configuration and testing to ensure that the correct versions of libraries are built, and that no subtle discrepancies are present in the toolchain. Moreover, HTCondor jobs can be located in an arbitrary directory of the host system. This poses a problem, as a naive build of GCC and Glibc may experience various linking and loading errors in this situation We eventually discovered crosstool-ng [9], which is a configurable too intended to create cross-compiler toolchains. After some investigation and experimentation, we were able to correctly build a version of GCC with the desired properties. As the experiment used the SIR repository [7], we built a Python framework for building and executing arbitrary experiments on the. By taking advantage SIR s standardized framework, we constructed a system which can be used to move various objects to staging areas, build test suites, and invoke external tools, such as mutins, to manipulate the source code. We also created a similar set of Bash scripts, with a similar functionality, in case Python was either unavailable, or more robust shell-style functionality was required. By packaging this with the relocatable GCC, we were effectively able to construct a framework for reproducible software engineering experiments. By packaging the desired artifact from the SIR repository, the Python framework, and the compiler, along with a top-level script that invokes the necessary behaviors, the experimenter can present the experiment in such a way that any interested party can simply obtain the package, and begin tweaking and experimenting. 4 Results and Discussion Given our data we were able to record a few basic statistical metrics. 6

7 Total mutants 108,134 Successfully Compiled Mutants 20,638 compilation rate Average Am(S) Interestingly, nearly 20 percent of the inserted mutants compiled, which was much greater than the rate of less than 5 percent we originally expected. This is still comparatively low; Andrews reported a compilation success rate of 92 percent[6]. However, at scale, mutation still appears feasible, as our set of compiled mutants is roughly twice as large as Andrews s 11,379 mutant set. Curiously, the wild mutants were far more difficult to detect than both the real world mutants or the mutants generated by Andrews. The average Am(S) real and generated mutants were recorded as.75 and.75 respectively, nearly 1.5 times easier to catch. Every wild mutant we tested was recorded as being caught by at least one test suite, so each one introduced some fault. This seems to indicate that wild mutants tend to induce more subtle variations than those produced by other operators. Andrews asserts that an Am(s) lower than the rate for real faults is an argument against the realism of hand-seeded faults[6]. However, given the apparent subtle behavior of the reverse patches, we feel that more careful analysis of both our results and Andrews s results is warranted. 5 Threats to Validity Currently, the most significant threat to validity is the relative age of our data collection and analysis software. In particular, as stated before, HTCondor is a challenging system to develop for - our testing and compilation frameworks required a significant number of false starts and reworkings before we had a successful execution. Thus, our test executors may still have bugs that altered our experiment in some way, or some unforeseen aspect of the execution environment may be altering the behavior of the program in a hard to detect way. 6 Related Work René Just s evaluation of mutation testing s external validity, and subsequent analysis of the limitations[1] of mutation testing served as the main impetus for our work. In particular, careful consideration of his discussion of the classes of faults which cannot be expressed in terms of basic mutation operators was our primary inspiration in searching for a more realistic set of mutation operators. Jia and Harmon [10] wrote a robust survey of the history of mutation testing, including a section delineating the various techniques for testing mutation frameworks, along with an overview of the most significant works of mutation evaluation. In addition, they also discuss various subject programs used in testing and detailed list of programs used for evaluation, sorted by number of papers using each. Their work quickly pointed us towards the SIR repository as a viable set of tools for mutation testing. Moreover, after Dr. Liblit told us about James Andrews s mutation framework, we found Andrews s evaluation experiment in the bibliography of Jia and Harman s survey. 7

8 7 Future Work The clearest short-term objective is to continue comparing our mutation tool against the results of Andrews s experiment. With our tools, it should be fairly straightforward to perform the experimental procedure on the the other 7 SIR programs he analyzed. Additionally, he performed more sophisticated statistical analyses on his results, such as the statistical significance of variation between test suites as well as between mutation styles. Given our existing framework for exhaustively searching the mutation space of a given program, we feel we can efficiently replicate the rest of his experiment in a matter of weeks. Another reasonable axis of evaluation is the derivability relationship between the basic operators provided by common mutation frameworks and our wild caught mutants. Specifically, it is reasonable to inquire what proportion of wild mutants can be derived from some bounded number of applications of a mutation operator. For mutant insertions where both the before and after code can be described as functions that map from one state of the variables the before code touches touches to a state of the variables touched by the after code, there may be a way to apply syntax guided synthesis [11] in attempting to derive the mutation. 8 Conclusion Mutation testing is predicated on software engineering researchers and practioners testing needs. In particular, test suites and bug finders, like all other software, need to be extensively tested and verified, which requires a large corpus of test cases. Mutation testing provides one way of quickly and reproducibly introducing large numbers of faults into a known piece of software. However, as Just demonstrates in [1], simple mutation operators cannot span the full space of software faults. Wild mutants derived from reversed patch data bridge this gap. By reflecting real-world code changes, these mutants will affect code in a manner that was deemed significant in some context. Moreover, given the vast quantities of publicly available patch data, large sets of candidate operations can be collected and evaluated in a matter of days. Given the one in five probability that a such a mutant will successfully compile, and their distinctive behavior with respect to test suites, we believe wild mutants are objects deserving further study. 9 Acknowledgements This research was performed using the compute resources and assistance of the UW-Madison Center For High Throughput Computing (CHTC) in the Department of Computer Sciences. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation, and is an active member of the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy s Office of Science. 8

9 References [1] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraiser, Are mutants a valid substitute for real faults in software testing? FSE, [Online]. Available: http: //homes.cs.washington.edu/~rjust/publ/mutants_real_faults_fse_2014.pdf. [2] GitHub, Inc. (2015). Github - where software is built, [Online]. Available: https : / / ww. github.com/. [3] A. Swartz. (2013). Octohub: Low level python and cli interface to github, [Online]. Available: [4] M. Trier. (2008). Gitpython, [Online]. Available: GitPython/. [5] M. Matsumoto and T. Nishimura, Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator, ACM Trans. on Modeling and Computer Simulation, [Online]. Available: http : / / www. math. sci. hiroshima - u. ac. jp / ~m - mat/mt/articles/mt.pdf. [6] J. H. Andrews, L. C. Briand, and Y. Labiche, Is mutation an appropriate tool for testing experiments? In Proceedings of the 27th International Conference on Software Engineering, ser. ICSE 05, St. Louis, MO, USA: ACM, 2005, pp , isbn: doi: / [Online]. Available: [7] H. Do, S. G. Elbaum, and G. Rothermel, Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact., Empirical Software Engineering: An International Journal, vol. 10, no. 4, pp , [8] M. Litzkow, M. Livny, and M. Mutka, Condor - a hunter of idle workstations, in Proceedings of the 8th International Conference of Distributed Computing Systems, [9] Y. E. Morin. (2013). Crosstool-ng, [Online]. Available: [10] Y. Jia and M. Harman, An analysis and survey of the development of mutation testing, IEEE Trans. Softw. Eng., vol. 37, no. 5, pp , Sep. 2011, issn: doi: /TSE [Online]. Available: [11] R. Alur et al., Syntax-guided synthesis. 9

ExMAn: A Generic and Customizable Framework for Experimental Mutation Analysis 1

ExMAn: A Generic and Customizable Framework for Experimental Mutation Analysis 1 Jeremy S. Bradbury, James R. Cordy, Juergen Dingel School of Computing, Queen s University Kingston, Ontario, Canada {bradbury,