Test Suite Minimization An Empirical Investigation. Jeffery von Ronne A PROJECT. submitted to. Oregon State University. University Honors College

Size: px

Start display at page:

Download "Test Suite Minimization An Empirical Investigation. Jeffery von Ronne A PROJECT. submitted to. Oregon State University. University Honors College"

Osborn York
5 years ago
Views:

1 Test Suite Minimization An Empirical Investigation by Jeffery von Ronne A PROJECT submitted to Oregon State University University Honors College in partial fulfillment of the requirements for the degree of Honors Bachelors of Science in Computer Science (Honors Scholar) Presented May 28, 1999 Commencement June 1999

2 AN ABSTRACT OF THE THESIS OF Jeffery von Ronne for the degree of Honors Bachelors of Science in Computer Science presented on May 28, Title: Test Suite Minimization: An Empirical Investigation. Abstract approved: Gregg Rothermel Test suite minimization techniques attempt to reduce the cost of saving and reusing tests during software maintenance, by eliminating redundant tests from test suites. A potential drawback of these techniques is that in minimizing a test suite, they might reduce the ability of that test suite to reveal faults in the software. Previous studies have shown that sometimes this reduction is small, but sometimes this reduction is severe. This work investigates the minimization process, what factors can affect its performance, and techniques for reducing this loss.

3 Test Suite Minimization An Empirical Investigation by Jeffery von Ronne A PROJECT submitted to Oregon State University University Honors College in partial fulfillment of the requirements for the degree of Honors Bachelors of Science in Computer Science (Honors Scholar) Presented May 28, 1999 Commencement June 1999

4 Honors Bachelors of Science in Computer Science project of Jeffery von Ronne presented on May 28, 1999 APPROVED: Mentor, representing Computer Science Committee Member, representing Mathematics Committee Member and Chair, Department of Computer Science Dean of University Honors College I understand that my project will become part of the permanent collection of Oregon State University Honors College. My signature below authorizes release of my project to any reader upon request. Jeffery von Ronne, Author

5 Acknowledgment Many thanks are due to Dr. Rothermel, who provided much advice and guidance during the past year, as well as collaborating on the work in this thesis. My other committee members were Dr. Robby Robson and Dr. Michael Quinn. Dr. Roland Untch of Middle Tennessee State University provided the mutation data necessary for the experiments with PSSC minimization. Chengyun Chu prepared the Space program and assisted in the preparation of the mutation data. Dr. Mary Jean Harrold and Christie Hong of Ohio State University and Jeffery Ostrin also collaborated on parts of this work. The "Siemens" programs were provided by Siemens Corporate Research. The space program came from the European Space Agency via Dr. s Pasquini and Phyllis. The NSF funded my work through a Research Experience for Undergraduates grant to Dr. Rothermel. The equipment and other collaborators were funded in part by grants from Microsoft and the NSF. Thanks everyone.

6 Contributing Co-Authors The second and third chapters of this thesis are based on an article entitled Experiments to Assess the Cost-Benefits of Test Suite Minimization by Dr. Gregg Rothermel, Dr. Mary Jean Harrold (Ohio State University), Christie Hong (Ohio State University), and myself, which is currently in preperation for submission to Transactions in Software Engineering, and is a revised and expanded version of an earlier paper, entitled An empirical study of the effects of minimization on the fault detection capabilities of test suites, which was authored by Dr. Gregg Rothermel, Dr. Mary Jean Harrold, Christie Hong, and Jeffery Ostrin, and presented at the November 1998 International Conference on Software Maintenance.

7 Table of Contents 1. Introduction and Motivation Motivation Overview of This Thesis Background and Literature Review Test suite minimization Previous empirical work The Wong98 study The Wong97 study Edge-Minimization Experiments Research Questions Measures and Tools Measures Measuring savings Measuring costs Tool infrastructure Experiments with smaller C programs Subject programs, faulty versions, test cases, and test suites Experiment design Threats to validity Minimization of edge-coverage-adequate test suites Test suite size reduction Fault detection effectiveness reduction Minimization of randomly generated test suites Test suite size reduction Fault detection effectiveness reduction Experiment with the Space Program Subject program, faulty versions, test cases, and test suites Experiment design Threats to validity Data and Analysis Test suite size reduction Fault detection effectiveness reduction Comparison to Previous Empirical Results A New Minimization Technique Mutation Analysis and Minimization Mutation Analysis and Sensitivity Adapting Sensitivity for use as a Coverage Criterion An Algorithm to Facilitate Minimization based on PSSC...55

8 A Conventional Test Suite Minimization Heuristic A Multi-Hit Minimization Algorithm Using the Multi-Hit Reduction Algorithm for PSSC Minimization Asymptotic Analysis of the Multi-Hit Reduction Algorithm An Experiment with PSSC Minimization Experimental Design Results Minimized Test Suite Size Minimized Test Suite Performance Conclusion Results Practical Implications Limitations of This Investigation and Future Work...73 Bibliography...75

9 List of Figures 3-1. Percentage of Inputs that Expose Each Fault Size Distribution among Unminimized Test Suites for the Siemens Programs Size of Minimized vs. Size of Original Test Suites Percent Reduction in Test Suite Size vs. Original Test Suite Size Minimization: Percentage Effectiveness Reduction vs. Original Size Effectiveness in Original and after Minimization vs. Original Size Random Reduction: Percentage Effectiveness Reduction vs. Original Suite Size Minimization and Random Reduction: Fault Detection vs Original Size Random Reduction: Percent Effectiveness Reduction Percentage of Test Cases that Expose each of Space s Faults Size of Minimized Test Suites vs Size of Original Test Suites Percent Reduction in Test Suite vs Original Test Suite Size Percent Reduction in Effectiveness vs. Original Size Original and Minimized: Faults Detected vs. Original Size The Harrold, Gupta, and Soffa Test Suite Minimization Algorithm A Multi-Hit Test Suite Reduction Algorithm A C Program Sizes of Test Suites after PSSC minimization Average Test Suite Size vs. Average Number of Faults Detected...67

10 List of Tables 3-1. The Siemens Programs Correlation Between Size Reduction and Original Size Minimization: Correlation between Effectiveness Reduction and Original Size Random Reduction: Correlation between Effectiveness Loss and Origianl Size Comparison of Fault Detection Reduction Comparison of Fault Detection Reduction Variance The Space Application Correlation between Size Reduction and Initial Size Correlation between Initial Size and Effectiveness Reduction Average Reductions in Fault Detection Effectiveness Fault detection abilities of tests used in the Wong98 study The Initial Test Suite for the Example Program The Coverage Requirements for the Example Program...62

11 Chapter 1. Introduction and Motivation 1.1. Motivation Testing is an important but expensive task necessary for the construction of high quality software. As such, there is great potential for any practical technique that enables the detection of more faults with limited software testing funds. One testing strategy is to orient the testing regimen around concrete, achievable criteria. These include functional tests, designed to exercise the program s documented features, and also structural tests, designed to exercise each statement in the program. It is thought that a testing regimen designed around explicit criteria such as those just mentioned is more effective than either random or ad hoc testing 1. In fact, experimentation, such as that done by researchers at Siemens, has shown that structural testing based on either controlflow or dataflow coverage criteria can show significantly better fault detection than random testing[hutchins94]. Coverage criteria are also used as a stopping point to decide when a program is sufficiently tested. In this case, additional tests are added until the test suite has achieved a specified coverage level according to a specific adequacy criterion. For example, to achieve statement coverage adequacy for a program, one would add additional test cases to the test suite until each statement in that program is executed by at least one of the test cases. It is often the case that as a program evolves, additional tests are needed to maintain adequate coverage. Sometimes, as the test suite grows, it can become prohibitively expensive 1. Random testing is selecting inputs at random, from some input distribution, and using those as test cases. Ad hoc testing is testing with inputs chosen by the tester with no explicit selection criteria.

12 Chapter 1. Introduction and Motivation 2 to execute on new versions of the program. These test suites will often contain test cases that are no longer needed to satisfy the coverage criteria, because they are now obsolete or redundant [Chen96, Harrold93, Horgan92, Offutt95]. 2 For example, Harrold et al. propose that a reduced test suite, made up of the smallest subset of the test cases that still exercises all of the coverage items, could be used in place of the original test suite[harrold93]. The reduced subset of the original test suite will be referred to as a minimized test suite,andthe process of obtaining the minimized test suite will be called minimization. Unfortunately, minimized test suites are not without drawbacks. In addition to the cost of determining the reduced set, minimization may remove test cases that detect program faults that are not detected by other test cases that satisfy the same criterion. In the worse case, the minimized test suites will no longer detect any faults, including those that would be detected by the original test suite. This work begins to quantify this loss over a limited range of coverage criteria, programs, program faults, and test cases and compares it to the benefit in reduced test suite size Overview of This Thesis Some studies have shown that minimization can result in significant savings in test suite size with little reduction in the ability of the minimized test suite to detect faults [Wong95, Wong97, Wong98]. This work, however, shows that this is not necessarily the case. For the combination of programs, faults, and types of test suites we utilized in two empirical stud- 2. Obsolete test cases no longer exercise any coverage items. Redundant test cases are those that exercise only test cases that are also exercised by other test cases in the test suite.

13 Chapter 1. Introduction and Motivation 3 ies, the loss in fault detection was substantial. While a third study showed a less extreme loss in fault detection, this loss was neither statistically nor practically insignificant. These findings motivated the search for alternative coverage criteria that could be used in place of or in conjunction with structural criteria. This resulted in a new coverage criterion: Probabilistic Statement Sensitivity Coverage. In the process, a new minimization heuristic was developed. The next chapter will discuss coverage criteria, test suite minimization, and previous work. The third chapter will discuss the experiments we conducted to assess the performance of a conventional minimization technique. Chapter 4 introduces the PSSC criterion, explains how it could be used, and compares its performance to conventional techniques. Finally, the conclusion will recap our experiment results, explain the practical consequences of this work, and suggest areas for further study.

14 4 Chapter 2. Background and Literature Review 2.1. Test suite minimization The test suite minimization problem may be stated as follows [Harrold93] Given: Test suite T, a set of test case requirements r 1, r 2,..., r n that must be satisfied to provide the desired test coverage of the program, and subsets of T, T 1, T 2,..., T n, one associated with each of the r i s such that any one of the test cases t j belonging to T i can be used to test r i. Problem: Find a representative set of test cases from T that satisfies all of the r i s. The r i s in the foregoing statement can represent various test case requirements, such as source statements, decisions, definition-use associations, or specification items. A representative set of test cases that satisfies all of the r i s must contain at least one test case from each T i ; such a set is called a hitting set of the group of sets T, T 1, T 2,..., T n.to achieve a maximum reduction, it is necessary to find the smallest representative set of test cases. However, this subset of the test suite is the minimum cardinality hitting set of the T i s, and the problem of finding such a set is NP-complete [Garey79]. Thus, minimization techniques resort to heuristics. Several test suite minimization techniques have been proposed (e.g., [Chen96, Harrold93, Horgan92, Offutt95]); in this work we utilize the technique of Harrold, Gupta, and Soffa [Harrold93].

15 Chapter 2. Background and Literature Review Previous empirical work Many empirical studies of software testing have been performed. Some of these studies, such as those reported in References [Frankl93,Hutchins94,Wong94], provide only indirect data about the effects of test suite minimization through consideration of the effects of test suite size on costs and benefits of testing. Other studies, such as the study reported in Reference [Graves98], provide only indirect data about the effects of test suite minimization through a comparison of regression test selection techniques that practice or do not practice minimization. 1 Recent studies by Wong, Horgan, London, and Mathur [Wong95,Wong98] 2 and Wong, Horgan, Mathur, and Pasquini [Wong97], however, directly examine the costs and benefits of test suite minimization. We refer to these studies collectively as the Wong studies, and individually as the Wong98 and Wong97 studies. We summarize the results of these studies here; the references provide further details The Wong98 study The Wong98 study involved ten common C UNIX utility programs, including nine programs ranging in size from 9 to 289 lines of code, and one program of 842 lines of code. 1. Whereas minimization considers a program and test suite, regression test selection considers a program, test suite, and modified program version, and selects test cases that are appropriate for that version without removing them from the test suite. The problems of regression test selection and test suite minimization are thus related but distinct. For further discussion of regression test selection see Reference [Rothermel96]. 2. Reference [Wong98] (1998) extends work reported earlier in Reference [Wong95] (1995); thus, except where otherwise noted, we here focus on the most recent (1998) reference.

16 Chapter 2. Background and Literature Review 6 For each of these programs, the researchers used a random domain-based test generator to generate an initial test case pool; the number of test cases in these pools ranged from 156 to 997. No attempt was made, in generating these pools, to achieve complete coverage of program components (blocks, decisions, or definition-use associations). The researchers next drew multiple distinct test suites from their test case pools, by randomly selecting test cases. The resulting test suites achieved basic block coverages ranging from 5% to 95%; overall, 1198 test suites were generated. Reference [Wong98] reports the sizes of the resulting test suites as averages over groups of test cases that achieved similar coverage: 27 test suites belonged to groups in which average test suite size ranged from 9.7 to test cases, and 928 test suites belonged to groups in which average test suite size ranged from only 1 to 4.43 test cases. The researchers enlisted graduate students to inject simple mutation-like faults into each of the subject programs. The researchers excluded faults that could not be detected by any test case. All told, 181 faulty versions of the programs were retained for use in the study. To assess the difficulty of detecting these faults, the researchers measured the percentages of test cases, in the associated test pools, that were able to detect the faults. Of the 181 faults, 78 (43%) were Quartile I faults detectable by fewer than 25% of the associated test cases, 42 (23%) were Quartile II faults detectable by between 25% and 5% of the associated test cases, 37 (2%) were Quartile III faults detectable by between 5% and 75% of the associated test cases, and 24 (13%) were Quartile IV faults detectable by at least 75% of the associated test cases. The researchers minimized their test suites using ATACMIN [Horgan92], a minimization

17 Chapter 2. Background and Literature Review 7 tool based on an implicit enumeration algorithm that found exact minimization solutions for all of the test suites utilized in the study. Test suites were minimized with respect to block, decision, and all-uses dataflow coverage. The researchers measured the reduction in test suite size achieved through minimization, and the reduction in fault-detection effectiveness of the minimized test suites. The researchers also repeated this procedure on the entire test pools (effectively, treating these test pools as if they were test suites.) Finally, they used null hypothesis checking to determine whether the minimized test suites had better fault detection capabilities than test suites of the same size generated randomly from the unminimized test suites. The researchers drew several overall conclusions from the study, including the following: As the coverage achieved by initial test suites increased, minimization produced greater savings with respect to those test suites, at rates ranging from % (for several of the 5-55% coverage suites) to 72.79% (for one of the 9-95% block coverage suites). As the coverage achieved by initial test suites increased, minimization produced greater losses in the fault-detection effectiveness of those suites. However, losses in fault detection effectiveness were small compared to savings in test suite size: in all but one case, reductions were less than 7.27 percent, and most reductions were less than 4.99 percent. Fault difficulty partially determined whether minimization caused losses in faultdetection effectiveness: Quartile I and II faults were more easily missed than Quartile III and IV faults following minimization. The null hypothesis testing showed that minimized test suites retain a size/effectiveness

18 Chapter 2. Background and Literature Review 8 advantage over their random counterparts. The authors draw the following overall conclusion:...when the size of a test set is reduced while the coverage is kept constant, there is little or no reduction in its fault detection effectiveness... A test set which is minimized to preserve its coverage is likely to be as effective for detecting faults at a lower execution cost. [Wong98] The Wong97 study Whereas the Wong98 study examined test suite minimization on 1 common Unix utilities, the Wong97 study involved a single C application developed for the European Space Agency to aid in the management of large antenna arrays. At 6,1 executable lines, this application is several times the size of the largest program used for the Wong98 study. Unlike the Wong98 study, in which an initial pool of test cases was generated randomly based solely on program specifications, the Wong97 study used a pool of 1 test cases generated based on an operational profile. In the Wong98 study, test suites were generated and categorized based on block coverage. For the Wong97 study, two different procedures were followed for generating test suites: the first to create test suites of fixed size, and the second to create test suites of fixed blockcoverage. For the fixed size test suites, test cases were chosen randomly from the test pool until the desired number of test cases had been selected. In all, 12 test suites were generated in this manner: 3 distinct test suites for each of the target sizes of 5, 1,

19 Chapter 2. Background and Literature Review 9 15, 2. For the fixed coverage test suites, test cases were chosen randomly from the test pool until the test suite reached the desired coverage. Only test cases that added coverage were added to the fixed coverage test suites. In all, 18 test suites were generated in this manner: 3 distinct test suites for each of the target coverages ranging from 5% to 75% block coverage. Whereas the faults in the Wong98 study were injected by graduate students, the faults used in the Wong97 study were obtained from an error log maintained during the creation of the application. The researchers selected 16 of these faults, of which all but one were detected by fewer than 7% of the test cases, making them similar in detection difficulty to the Quartile I faults used in the Wong98 study. The exceptional fault was detected by 32 (32%) of the test cases. As in the Wong98 study, all of the test suites were minimized using ATACMIN. In both studies, the size of each test suite was reduced, while the coverage was kept constant. In the Wong97 study, however, minimization with respect to block coverage was the only minimization attempted. Reduction in test suite size and in fault detection effectiveness were measured. Finally, null hypothesis testing was used to compare test suites minimized for coverage to test suites that were randomly minimized.

20 Chapter 2. Background and Literature Review 1 The researchers drew the following overall conclusions from the study: There were substantial reductions in size achieved from minimizing the fixed size test suites. For the fixed coverage test suites, reductions in size also occurred but were smaller. As in the Wong98 study, the effectiveness reductions of the minimized test suites were smaller than the size reductions, so that minimized test suites resulted in a size/effectiveness advantage over the unminimized test suites. The average effectiveness reduction due to minimization was less than 7.3%, and most reductions were less than 3.6%. The null hypothesis testing again showed that minimized test suites retain a size/effectiveness advantage over their random counterparts. Thus, the Wong97 study supports the findings of the Wong98 study, while broadening the scope of the study in terms of both the programs under scrutiny and the types of initial test suites utilized.

21 11 Chapter 3. Edge-Minimization Experiments 3.1. Research Questions The Wong studies leave a number of open research questions, primarily concerning the extent to which the results observed in those studies generalize to other testing situations. Among the open questions are the following, which motivate the present work. 1. How does minimization fare in terms of costs and benefits when test suites have a wider range of sizes than the test suites utilized in the Wong studies. 2. How does minimization fare in terms of costs and benefits when test suites are coverage-adequate? 3. How does minimization fare in terms of costs and benefits when test suites contain additional coverage-redundant test cases? The first and third questions are addressed by the Wong97 study in its use of fixed-size test suites; however, that study examines only one program. Neither of the Wong studies considers the second question. Test suites used in practice often contain test cases designed not for code coverage, but rather, designed to exercise product features, specification items, or exceptional behaviors. Such test suites may contain larger numbers of test cases, and larger numbers of coverageredundant test cases, than the test suites utilized in the Wong98 study, or than the coveragebased test suites utilized in the Wong97 study.

22 Chapter 3. Edge-Minimization Experiments 12 Similarly, a typical tactic for utilizing coverage-based testing is to begin with a base of specification-based tests, and add additional tests to achieve complete coverage. Such test suites may also contain greater coverage-redundancy than the coverage-based test suites utilized in the Wong studies, but can be expected to distribute coverage more evenly than the fixed-size test suites constructed by random selection for the Wong97 study. It is important to understand the cost-benefit tradeoffs involved in minimizing such test suites. Thus, to investigate these tradeoffs, we performed a family of experiments Measures and Tools We now discuss the measures and tools utilized in our experiments; subsequent sections discuss the individual experiments. Let T be a test suite, and let T min be the reduced test suite that results from the application of a minimization technique to T Measures We need to measure the costs and savings of test suite minimization Measuring savings. Test suite minimization lets testers spend less time executing test cases, examining test results, and managing the data associated with testing. These savings in time are dependent on the extent to which minimization reduces test suite size. Thus, to measure the savings that can result from test suite minimization, we can follow the methodology used in the Wong studies and measure the reduction in test suite size achieved by minimization. For

23 Chapter 3. Edge-Minimization Experiments 13 each program, we measure savings in terms of the number and the percentage of tests eliminated by minimization. (The former measure provides a notion of the magnitude of the savings; the latter lets us compare and contrast savings across test suites of varying sizes.) The number of tests eliminated is given by (T -T min ), and the percentage of tests eliminated is given by ((T - T min )/T * 1). This approach makes several assumptions: it assumes that all test cases have uniform costs, it does not differentiate between components of cost such as CPU time or human time, and it does not directly measure the compounding of savings that results from using the minimized test suites over a sequence of subsequent releases. This approach, however, has the advantage of simplicity, and using it we can draw several conclusions that are independent of these assumptions and compare our results with those achieved in the Wong studies Measuring costs. There are two costs to consider with respect to test suite minimization. The first cost is the cost of executing a minimization tool to produce the minimized test suite. However, a minimization tool can be run following the release of a product, automatically and during off-peak hours, and in this case the cost of running the tool may be noncritical. Moreover, having minimized a test suite, the cost of minimization is amortized over the uses of that suite on subsequent product releases, and thus assumes progressively less significance in relation to other costs. The second cost to consider is more significant. Test suite minimization may discard some

24 Chapter 3. Edge-Minimization Experiments 14 test cases that, if executed, would reveal defects in the software. Discarding these test cases reduces the fault detection effectiveness of the test suite. The cost of this reduced effectiveness may be compounded over uses of the test suite on subsequent product releases, and the effects of the missed faults may be critical. Thus, in this experiment, we focus on the costs associated with discarding fault-revealing test cases. We considered two methods for calculating reductions in fault detection effectiveness. On a per-test-case basis: One way to measure the cost of minimization in terms of effects on fault detection, given faulty program P and test suite T, is to identify the test cases in T that reveal a fault in P but are not in T min. This quantity can be normalized by the number of fault-revealing test cases in T. One problem with thisapproach is that multipletest cases may reveal a given fault. In this case some test cases could be discarded without reducing fault-detection effectiveness; this measure penalizes such a decision. On a per-test-suite basis: Another approach is to classify the results of test suite minimization, relative to a given fault in P, in one of three ways: (1) no test case in T is fault-revealing, and, thus, no test case in T min is fault-revealing; (2) some test case in both T and T min is fault-revealing; or (3) some test case in T is fault-revealing, but no test case in T min is fault-revealing. Case 1 denotes situations in which T is inadequate. Case 2 indicates a use of minimization that does not reduce fault detection, and Case 3 captures situations in which minimization compromises fault detection.

25 Chapter 3. Edge-Minimization Experiments 15 The Wong experiments utilized the second approach; we do the same. For each program, we measure reduced effectiveness in terms of the number and the percentage of faults for which T min contains no fault-revealing test cases, but T does contain fault-revealing test cases. More precisely, if F denotes the number of faults revealed by T over the faulty versions of program P, andf min denotes the number of faults revealed by T min over those versions, the number of faults lost is given by (F -F min ), and the percentage reduction in fault-detection effectiveness of minimization is given by ((F - F min )/F * 1). Note that this method of measuring the cost of minimization calculates cost relative to a fixed set of faults. This approach also assumes that missed faults have equal costs, an assumption that typically does not hold in practice Tool infrastructure. To perform our experiments we required several tools. First, we required a test suite minimization tool; to obtain this, we implemented the algorithm of Harrold, Gupta and Soffa [Harrold93] within the Aristotle program analysis system [Harrold97]. The Aristotle system also provided us with with code instrumenters for use in determining edge coverage.

26 Chapter 3. Edge-Minimization Experiments Experiments with smaller C programs Our first two experiments address our research questions on several small C programs, similar in size to the C utilities utilized in the Wong98 study. In this section we first describe details common to these two experiments, and then we report the results of the experiments in turn Subject programs, faulty versions, test cases, and test suites. We used seven C programs as subjects (see Table 3-1). The programs range in size from 138 to 516 lines of C code and perform a variety of functions. Each program has several faulty versions, each containing a single fault. Each program also has a large test pool. The programs, versions, and test pools were assembled by researchers at Siemens Corporate Research for a study of the fault-detection capabilities of control-flow and data-flow coverage criteria [Hutchins94]. We refer to these programs collectively as the Siemens programs. Table 3-1. The Siemens Programs Program Lines of Code No. of Versions Test Pool Size Description totinfo information measure schedule priority scheduler schedule priority scheduler tcas altitude separation printtok lexical analyzer printtok lexical analyzer replace pattern replacement The researchers at Siemens sought to study the fault-detecting effectiveness of coverage criteria. Therefore, they created faulty versions of the seven base programs by manually

27 Chapter 3. Edge-Minimization Experiments 17 seeding those programs with faults, usually by modifying a single line of code in the program. In a few cases they modified between two and five lines of code. Their goal was to introduce faults that were as realistic as possible, based on their experience with real programs. Ten people performed the fault seeding, working mostly without knowledge of each other s work [Hutchins94]. For each of the seven programs, the researchers at Siemens created a large test pool containing possible test cases for the program. To populate these test pools, they first created an initial set of black-box test cases according to good testing practices, based on the tester s understanding of the program s functionality and knowledge of special values and boundary points that are easily observable in the code [Hutchins94], using the category partition method and the Siemens Test Specification Language tool [Balcer89, Ostrand88]. They then augmented this set with manually-created white-box test cases to ensure that each executable statement, edge, and definition-use pair in the base program or its control flow graph was exercised by at least 3 test cases. To obtain meaningful results with the seeded versions of the programs, the researchers retained only faults that were neither too easy nor too hard to detect [Hutchins94], which they defined as being detectable by at least three and at most 35 test cases in the test pool associated with each program When we execute these faulty versions, we find four faults that are not detected, and three that are detected by only one or two test cases. This difference may be attributable to some factor involving the system on which we are executing our tests; the difference does not impact the results of our study.

28 Chapter 3. Edge-Minimization Experiments 18 Figure 3-1. Percentage of Inputs that Expose Each Fault percentage of tests that reveal faults totinfo schedule1 schedule2 tcas printtok1 subject program printtok2 replace Figure 3-1 shows the sensitivity to detection of the faults in the Siemens versions relative to the test pools; the boxplots 2 illustrate that the sensitivities of the faults vary within and between versions, but overall are all lower than 19.77%. Therefore, all of these faults were, in the terminology of the Wong studies, Quadrant I faults, detectable by fewer than 25% of the test pool inputs. To investigate our research questions we required coverage-adequate test suites that exhibit redundancy in coverage, and we required these in a range of sizes. To create these test 2. A boxplot is a standard statistical device for representing data sets [Johnson92]. In these plots, each data set s distribution is represented by a box. The box s height spans the central 5% of the data and its upper and lower ends mark the upper and lower quartiles. The middle of the three horizontal lines within the box represents the median. The vertical lines attached to the box indicate the tails of the distribution.

29 Chapter 3. Edge-Minimization Experiments 19 suites we utilized the edge coverage criterion. The edge coverage criterion is similar to the decision coverage criterion used in the Wong98 study, but is defined on control flow graphs. 3 We used the Siemens program test pools to obtain coverage-adequate test suites for each subject program. Our test suites consist of a varying number of test cases selected randomly from the associated test pool, together with any additional test cases required to achieve 1% coverage of coverable edges. 4 We did not add any particular test case to any particular test suite more than once. To ensure that these test suites would possess varying ranges of coverage redundancy, we randomly varied the number of randomly selected test cases over sizes ranging from to.5 times the number of lines of code in the program. Altogether, we generated 1 test suites for each program. Figure 3-2 provides views of the range of sizes of test suites created by the process just described. The boxplots illustrate that for each subject program, our test suite generation procedure yielded a collection of test suites of sizes that are relatively evenly distributed across the range of sizes utilized for that program. The all-uses-coverage-adequate suites 3. A test suite T is edge-coverage adequate for program P iff, for each edge e in each control flow graph for some procedure in P, ife is dynamically exercisable, then there exists at least one test case t in T that exercises e. A test case t exercises an edge e=(n 1,n 2 ) in control flow graph G iff t causes execution of the statement associated with n 1, followed immediately by the statement associated with n To randomly select test cases from the test pools, we used the C pseudo-random-number generator rand, seeded initially with the output of the C time system call, to obtain an integer which we treated as an index i into the test pool (modulo the size of that pool).

30 Chapter 3. Edge-Minimization Experiments 2 Figure 3-2. Size Distribution among Unminimized Test Suites for the Siemens Programs size of test suite totinfo schedule1 schedule2 tcas printtok1 printtok2 replace subject program are larger on average than the edge-coverage-adequate suites because in general, more tests are required to achieve all-uses coverage than to achieve edge coverage. Analysis of the fault-detection effectiveness of these test suites shows that, except for eight of the edge-coverage-based test suites for schedule2, every test suite revealed at least one fault in the set of faulty versions of the associated program. Thus, although each fault individually is difficult to detect relative to the entire test pool for the program, almost all of the test suites utilized in the study possessed at least some fault-detection effectiveness relative to the set of faulty programs utilized.

31 Chapter 3. Edge-Minimization Experiments Experiment design. The experiments were run using a full-factorial design with 1 size-reduction and 1 effectiveness-reduction measures per cell. 5 The independent variables manipulated were: The subject program (the seven programs, each with a variety of faulty versions). Test suite size (between and.5 times lines-of-code test cases randomly selected from the test pool, together with additional test cases as necessary to achieve code coverage). For each subject program, we applied minimization techniques to each of the sample test suites for that program. We then computed the size and effectiveness reductions for these test suites Threats to validity. In this section we discuss potential threats to the validity of our experiments with the Siemens programs. Threats to internal validity are influences that can affect the dependent variables without the researcher s knowledge, and that thus affect any supposition of a causal relationship between the phenomena underlying the independent and dependent variables. In these experiments, our greatest concerns for internal validity involve the fact that we do not 5. The single exception involved schedule2, for which only 992 measures were available with respect to edge-coverage-based test suites, due to exclusion of the eight test suites that did not expose any faults.

32 Chapter 3. Edge-Minimization Experiments 22 control for the structure of the subject programs or the locality of program changes. Threats to external validity are conditions that limit our ability to generalize our results. The primary threats to external validity for this study concern the representativeness of the artifacts utilized. The Siemens programs, though nontrivial, are small, and larger programs may be subject to different cost-benefit tradeoffs. Also, there is exactly one seeded fault in each Siemens program; in practice, programs have much more complex error patterns. Furthermore, the faults in the Siemens programs were deliberately chosen (by the Siemens researchers) to be faults that were relatively difficult to detect. (However, the fact that the faults in these programs were not chosen by us does eliminate one potential source of bias.) Finally, the test suites we utilized represent only two types of test suite that could occur in practice if a mix of non-coverage-based and coverage-based testing were utilized. These threats can only be addressed by additional studies utilizing a wider range of artifacts. Threats to construct validity arise when measurement instruments do not adequately capture the concepts they are supposed to measure. For example, in this experiment our measures of cost and effectiveness are very coarse: they treat all faults as equally severe, and all test cases as equally expensive Minimization of edge-coverage-adequate test suites Our first experiment addresses our research questions by applying minimization to the Siemens programs and their edge-coverage-adequate test suites. In reporting results we first consider test suite size reduction, and then we consider fault detection effectiveness reduction.

33 Chapter 3. Edge-Minimization Experiments Test suite size reduction Figure 3-3 depicts the sizes of the minimized edge-coverage-adequate test suites for the seven Siemens programs, plotted against original test suite size. The data for each program P is depicted by a scatterplot containing a point for each of the test suites utilized for P. As the figure shows, the average sizes of the minimized test suites ranges from approximately 5 (for tcas) to 12 (for replace). For each program, the minimized test suites demonstrate little variance in size: tcas exhibiting the least variance (between 4 and 5 test cases), and printtok1 showing the greatest variance (between 5 and 14 test cases). Considered across the range of original test suite sizes, minimized test suite size for each program is also relatively stable. Figure 3-4 depicts the percentage reduction in test suite size produced by minimization in terms of the formula discussed in Section , for each of the subject programs. The data for each program P is represented by a scatterplot containing a point for each of the test suites utilized for P ; each point shows the percentage size reduction achieved for a test suite versus the size of that test suite prior to minimization. Visual inspection of the plots indicates a sharp increase in test suite size reduction over the first quartile of test suite sizes, tapering off as size increases beyond the first quartile. The data gives the impression of fitting a hyperbolic curve. To verify the correctness of this impression, we performed least-squares regression to fit the data depicted in these plots with a hyperbolic curve. Table 3-2 shows the best-fit curve for each of the subject, along with its square of correlation, r 2. 6 They indicate a strong hyper-

34 Chapter 3. Edge-Minimization Experiments 24 Figure 3-3. Size of Minimized vs. Size of Original Test Suites tot info tcas average average 8 5 minimized test suite size minimized test suite size minimized test suite size minimized test suite size original test suite size schedule average original test suite size schedule average minimized test suite size minimized test suite size original test suite size print tokens average original test suite size print tokens average original test suite size 2 replace original test suite size average 18 minimized test suite size original test suite size

35 Chapter 3. Edge-Minimization Experiments 25 Figure 3-4. Percent Reduction in Test Suite Size vs. Original Test Suite Size totinfo tcas schedule printtok schedule printtok replace

36 Chapter 3. Edge-Minimization Experiments 26 bolic correlation between percentage reduction in test suite size (savings of minimization) and original test suite size. Table 3-2. Correlation Between Size Reduction and Original Size program regression equation r 2 totinfo y = 1 * (1 - (5.2762/x)).99 schedule1 y = 1 * (1 - ( /x)).96 schedule2 y = 1 * (1 - ( /x)).94 tcas y = 1 * (1 - (4.9719/x)) 1. printtok1 y = 1 * (1 - (7.4978/x)).9 printtok2 y = 1 * (1 - (6.7776/x)).93 replace y = 1 * (1 - (12.18/x)).99 Our experimental results indicate that test suite minimization can produce savings in test suite size on coverage-adequate, coverage-redundant test suites. The results also indicate that as test suite size increases, the savings produced by test suite minimization increase; a consequence of the relatively stable size of the minimized suites Fault detection effectiveness reduction Figure 3-5 depicts the cost (reduction in fault detection effectiveness) incurred by minimization, in terms of the formula discussed in Section , for each of the seven subject programs. The data for each program P is represented by a scatterplot containing a point for each of the test suites utilized for P; each point shows the percentage reduction in fault detection effectiveness observed for a test suite versus the size of that test suite prior to minimization. 6. r 2 is a dimensionless index that ranges from zero to 1., inclusive, and is the fraction of variation in the values of y that is explained by the least-squares regression of y on x [Moore99].

37 Chapter 3. Edge-Minimization Experiments 27 Figure 3-6 illustrates the magnitude of the fault detection effectiveness reduction observed for the seven subject programs. Again, this figure contains a scatterplot for each program; however, we find it most revealing to depict faults detected versus original test suite size, simultaneously for both test suites minimized for edge-coverage (black) and for original test suites (grey). The solid lines in the plots denote average numbers of faults detected over the range of original test suite sizes, the gap between these lines indicates the magnitude of the fault detection effectiveness reduction for test suites minimized for edge coverage. The plots show that the fault detection effectiveness of test suites can be severely compromised by minimization. For example, on replace, the largest of the programs, minimization reduces fault-detection effectiveness by over 5%, with average fault loss ranging from 4 faults to 2 across the range of test suite sizes, on more than half of the test suites. Also, although there are cases in which minimization does not reduce fault-detection effectiveness (e.g., on printtok1), there are also cases in which minimization reduces the fault-detection effectiveness of test suites by 1% (e.g., on schedule2). Visual inspection of the plots suggests that reduction in fault detection effectiveness slightly increases as test suite size increases. Test suites in the smallest size ranges do produce effectiveness losses of less than 5% more frequently than they produce losses in excess of 5%, a situation not true of the larger test suites. Even the smallest test cases, however, exhibit effectiveness reductions in most cases: for example, on replace, test suites containing fewer than 5 test cases exhibit an average effectiveness reduction of nearly 4% (fault detection reduction ranging from 4 to 8 faults), and few such test suites do not lose effectiveness.

38 Chapter 3. Edge-Minimization Experiments 28 Figure 3-5. Minimization: Percentage Effectiveness Reduction vs. Original Size totinfo tcas schedule printtok schedule printtok replace

39 Chapter 3. Edge-Minimization Experiments 29 Figure 3-6. Effectiveness in Original and after Minimization vs. Original Size tinfo tcas 3 by original after minimization avg. by orig. 4 by original after minimization avg. by orig. avg. after min. avg. after min faults detected 2 15 faults detected original test suite size original test suite size schedule 1 print tokens by original after minimization avg. by orig. avg. after min. 9 8 by original after minimization avg. by orig. avg. after min. faults detected faults detected original test suite size original test suite size schedule 2 print tokens by original after minimization avg. by orig. avg. after min. 12 by original after minimization avg. by orig. avg. after min. 1 1 faults detected 8 6 faults detected original test suite size replace original test suite size 35 3 by original after minimization avg. by orig. avg. after min. 25 faults detected original test suite size

40 Chapter 3. Edge-Minimization Experiments 3 Table 3-3. Minimization: Correlation between Effectiveness Reduction and Original Size program regression line 1 r 2 regression line 2 r 2 regression line 3 r 2 totinfo y =.13x y = 9.56Ln(x) y = -.2x^2 +.44x schedule1 y =.15x y = 1.3Ln(x) y = -.2x x schedule2 y =.28x y = 17.7Ln(x) y = -.4x^2 +.89x tcas y =.68x y = 22.18Ln(x) y = -.2x^ x printtok1 y =.16x y = 14.68Ln(x) y = -.1x^2 +.44x printtok2 y =.7x y = 6.82Ln(x) y = -.1x^2 +.19x replace y =.11x y = 13.7Ln(x) y = -.1x^2 +.41x In contrast to the plots of size reduction effectiveness, the plots of fault detection effectiveness reduction do not give a strong impression of closely fitting any curve or line: the data is much more scattered than the data for test suite size reduction. Our attempts to fit linear, logarithmic, and quadratic regression curves to the data validate this impression: the data in Table 3-3 reveals little linear, logarithmic, or quadratic correlation between reduction in fault detection effectiveness and original test suite size. These results indicate that test suite minimization can compromise the fault-detection effectiveness of coverage-adequate, coverage-redundant test suites. However, the results only weakly suggest that as test suite size increases, the reduction in the fault-detection effectiveness of those test suites will increase.

An Empirical Study of the Effects of Minimization on the Fault Detection Capabilities of Test Suites

Proceedings of the International Conference on Software Maintenance, Washington, D.C., November, 1998 An Empirical Study of the Effects of Minimization on the Fault Detection Capabilities of Test Suites