PROFILE ANALYSIS TECHNIQUES FOR OBSERVATION-BASED SOFTWARE TESTING DAVID ZAEN LEON CESIN. For the degree of Doctor of Philosophy

Size: px

Start display at page:

Download "PROFILE ANALYSIS TECHNIQUES FOR OBSERVATION-BASED SOFTWARE TESTING DAVID ZAEN LEON CESIN. For the degree of Doctor of Philosophy"

Randall Lewis
5 years ago
Views:

1 PROFILE ANALYSIS TECHNIQUES FOR OBSERVATION-BASED SOFTWARE TESTING by DAVID ZAEN LEON CESIN Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy Dissertation Adviser: Dr. Andy Podgurski Electric Engineering and Computer Science Department CASE WESTERN RESERVE UNIVERSITY January, 2005

2 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the dissertation of candidate for the Ph.D. degree *. (signed) (chair of the committee) (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 Table of Contents List of Tables... 4 List of Figures... 5 Abstract... 9 Chapter Introduction Related Work Main Contributions Chapter Experimental Test Bed Data sets Jikes Data Set Javac Data Set Synthetic GCC data set Hybrid GCC Data Set Profiling Techniques Dissimilarity Metrics Hierarchical clustering Chapter Test Suite Visualization Introduction

4 3.2 Visualization Techniques Correspondence Analysis Multidimensional Scaling Scaling by Majorizing a Complicated Function (SMACOF) Hierarchical MDS Energy minimization Ordinal MDS Applications: Comparison of test populations Analysis of the profile distribution Analysis of the distribution of failures Experimental Section Comparison between correspondence analysis and Multidimensional Scaling Comparison between Multidimensional Scaling techniques Data sets Evaluation criteria Results Conclusions Chapter Test case selection Introduction Techniques for Test-Case Selection

5 4.3 Experimental Setup Experimental Results and Discussion Baseline techniques Simple Random Sampling Test Suite Minimization Distribution-Based Techniques Effects of dissimilarity metric Effects of granularity Comparison between techniques Conclusions Chapter Test Case Prioritization Introduction Techniques for Test Case Prioritization Experimental Setup Experimental results and discussion Coverage-based techniques Distribution-based techniques Conclusions Conclusions References

6 List of Tables Table 1 - Average nearest-neighbor rank change obtained for every combination of dataset and algorithm Table 2 - Median nearest-neighbor rank change obtained for every combination of dataset and algorithm Table 3 - Percentage of points whose nearest neighbors had a rank change of zero for every combination of dataset and algorithm Table 4 - Raw stress values obtained for every combination of dataset and algorithm Table 5 - Raw energy values obtained for every combination of dataset and algorithm Table 6 - Results of coverage maximization Table 7 - Profile granularity comparison for Javac. Percentage of profile elements covered by a test suite that covers all the elements of a different profile Table 8 - APFD results for Coverage-based prioritization techniques Table 9 - APFD results for the mixed prioritization techniques. The best result for each row is in boldface

7 List of Figures Figure 1 - MDS display of the Large GCC data set. Test suite executions in black, user executions in grey Figure 2 - Sample dendrogram Figure 3 - Fitting of a line through a two-dimensional cloud of points Figure 4 - MDS display of the GCC data set, binary dissimilarity. Computed by Classical Scaling + SMACOF Figure 5 - MDS display of the GCC data set, binary dissimilarity. Computed by Classical Scaling + SMACOF. Lines are drawn between nearest neighbors Figure 6 - MDS display of the Large GCC data set. Test suite executions in black, user executions in grey Figure 7 - CA display of the Large GCC data set. Test suite executions in black, user executions in grey Figure 8 - MDS display of the Large GCC data set. Executions are shaded according to optimization level for the compilation. Darker points have higher optimization Figure 9 - CA display of the Large GCC data set. Executions are shaded according to optimization level for the compilation. Darker points have higher optimization Figure 10 - CA display of the Small GCC data set. Column points representing functions. Lighter points represent the functions of the stupid register allocator

8 Figure 11 - CA display of Small GCC data set. Executions are shaded according to optimization level for the compilation. Darker points have higher optimization Figure 12 - MDS display of the Small GCC data set. Executions are shaded according to optimization level for the compilation. Darker points have higher optimization Figure 13 - MDS display of the Small GCC data set. Stars represent failed executions Figure 14 - MDS display of the Javac data set. Stars represent failed executions 63 Figure 15 - MDS display of the Jikes data set. Stars represent failed executions 63 Figure 16 - CA display of the small GCC data set. Convex hulls represent the result of automated clustering Figure 17 - MDS display of the small GCC data set. Convex hulls represent the result of automated clustering Figure 18 - Ordinal MDS Display for the Small GCC data set. Binary dissimilarities Figure 19 - GCC Coverage and Simple Random Sampling results Figure 20 - Jikes Coverage and Simple Random Sampling results Figure 21 - Javac Coverage and Simple Random Sampling results Figure 22 - Frequency with which different numbers of defects were found by coverage maximization. GCC data set. Basic Block Pair profiles Figure 23 - GCC One-per-cluster sampling results across dissimilarity metrics

9 Figure 24 - Jikes One-per-cluster sampling results across dissimilarity metrics.. 94 Figure 25 - Javac One-per-cluster sampling results across dissimilarity metrics 95 Figure 26 - Javac One-per-cluster sampling results across dissimilarity metrics 95 Figure 27 - GCC One-per-cluster sampling results across profile granularities Figure 28 - Jikes One-per-cluster sampling results across profile granularities Figure 29 - Javac One-per-cluster sampling results across profile granularities.. 97 Figure 30 - Javac One-per-cluster sampling results across profile granularities. Selected granularities Figure 31 - Comparison between techniques, GCC data set. Number of defects found Figure 32 - Comparison between techniques, GCC data set. Number of failures selected Figure 33 - Comparison between techniques. Jikes data set. Number of defects found Figure 34 - Comparison between techniques. Jikes data set. Number of Failures selected Figure 35 - Comparison between techniques, Javac data set. Number of defects found Figure 36 - Comparison between techniques, Javac data set. Number of failures selected Figure 37 - Repeated coverage results. GCC data set

10 Figure 38 - Repeated coverage results. Jikes data set Figure 39 - Repeated coverage results. Javac data set Figure 40 - Prioritization results. GCC data set. Function call profiles Figure 41 - Prioritization results. GCC data set. Basic block pair profiles Figure 42 - Prioritization results. Jikes data set. Function call profiles Figure 43 - Prioritization results. Jikes data set. Basic block pair profiles Figure 44 - Prioritization results. Javac data set. Method call profiles Figure 45 - Prioritization results. Javac data set. Combined profiles Figure 46 - Prioritization results. Jikes data set. Basic block pair profiles. Combined methods

11 Profile Analysis Techniques for Observation-Based Software Testing Abstract by David Zaen Leon Cesin Observation-based testing is a software-testing paradigm based on the idea of observing the behavior of the program when executed under a variety of test cases. The runtime behavior of a program can be summarized in profiles, which can then be analyzed for a variety of purposes useful for the tester. This dissertation presents techniques for test suite visualization, test case selection and test case prioritization based on profile data and includes extensive experiments on large, real-world applications to compare these techniques with ones from the literature. Test suite visualization is the application of multivariate visualization techniques to profile data in order to visually study the composition of the test suite and its interaction with the program. Two techniques are examined for this purpose, Correspondence Analysis and Multidimensional Scaling, and a novel algorithm for the latter is presented and studied. Example applications of test suite visualization are provided. Test case selection is the problem of selecting a small set of tests from a large test suite such that the most defects are revealed when this subset is executed. Test case prioritization is the problem of finding an optimal scheduling of the tests in a test suite so 9

12 that the number of defects found earlier during testing is maximized. Other researchers have tried to address these problems using profile information, by looking at the amount of code executed by a subset of tests. Dickinson proposed some methods for test-case selection that consider the distribution of the profiles in the profile space by using cluster analysis on the profiles. This work was later extended in conjunction with the author. These methods will be presented in this work, together with novel methods for test case prioritization. Experimental validations and comparisons of all of these methods will be presented, including comparison criteria that were missing from earlier work. The results suggest that profile analysis is a useful tool for software testers, and that studying the distribution of tests in a profile space can be more beneficial than concentrating on code coverage. 10

13 Chapter 1 Introduction Most modern software programs have a very long lifetime, and changes are made thorough their lifecycle, all of which have to be tested. It is common for a program to first undergo testing on each component (unit testing), testing once the components are assembled into a working program (integration testing), usage testing done by the developers (alpha testing), limited deployment for testing by the final users (beta testing), and then, after the software is completed and widely deployed, the software undergoes small changes, all of which have to be checked to make sure that old functionality is not damaged (regression testing). Software testing research has traditionally focused on how to generate tests for all the above stages, how to decide when a program has been thoroughly tested, how to write programs so as to simplify the testing process and so on. On the other hand, observationbased testing is a new testing paradigm that seeks to leverage pre-existing test sets in ways that can help the testing process. For example, during beta testing users might witness a failure but fail to report it, or they might not realize that something went wrong since they are not experts on the program. This is a large pool of tests which, if recorded and reviewed by a developer, might reveal many defects in the program. On the other hand, the large number of executions would prevent a developer from reviewing all these tests. The techniques presented in this work would allow a developer to examine the test suite as a whole and select a subset of the tests that are not redundant, on which effort can be concentrated. 11

14 Observation-based testing techniques can be used whenever there is a large pool of tests that can be applied to the current program, but only a subset of which can be reviewed by a developer, and sometimes also in cases where these tests can t all be executed in a timely manner but some information is already known about them. With the large amount of testing done on modern programs, and the possibility of recording user executions, it is common for mature software products to have large test suites, which would benefit from these analysis techniques. The main feature of the techniques presented in this work is that, while traditional techniques consider the effects of each test on the program separately, the techniques here presented also consider the relationship between the behaviors induced by different tests. For example, it is common wisdom of programmers that programs tend to fail in rarelyused corner cases. This implies that when looking at a test suite, if a test is identified that makes the program behave in a very different manner than any other test, it is important to check its output, since it may be more likely to fail. Observation-based testing relies on using program profiling to characterize test executions and then using multivariate statistical analysis techniques to extract useful information from these profiles. A program profile is a set of statistics about a given program execution. For example, one can summarize the behavior of a program during an execution by the number of times each statement in the program was executed. This produces a profile consisting of a list of numbers, one for every statement, which indicates the execution count of that statement. This kind of low-level information is hard to interpret automatically. For example, it is hard to determine which features of the program were actually used, unless a developer determines by hand which statements 12

15 implement each feature. On the other hand, these profiles can be easily compared to one another, allowing a developer to determine, for example, whether two tests used the same sections of the program. Moreover, one can compare two profiles using a dissimilarity metric, which measures how different two test executions are to each other by doing a calculation on the two profiles, and this information can be used for software testing purposes. Analyzing program profiles of executions made by end users is a way of studying the operational distribution of a program [33]. Not all users actually use all the features of a program, or even the same features. Any given program execution may match one of many usage scenarios, each with a different probability. It then becomes necessary to test the software under each of these usage scenarios. By analyzing the distribution of the profiles, one can partition the tests into groups of similar executions, and these would approximate the usage scenarios the tester is interested in. This dissertation presents techniques for analyzing these profiles for three different purposes. Chapter 3 describes techniques for visually displaying the relationship between the profiles to allow the test suite to be studied as a whole. Chapter 4 describes techniques for selecting a small set of tests out of a large universe of tests in a way that tends increases the likelihood of finding defects with this subset. Finally, Chapter 5 describes techniques for reordering the tests in a test suite so that defects are likely to be found earlier in the execution of the test suite. A short high-level introduction to these ideas will be given here, and later expanded in the appropriate chapters. 13

16 Test suite visualization, as done in Chapter 3, allows the developer to study the distribution of the tests, and of certain subsets in the test suite. For example, Figure 1 shows an MDS Figure 1 - MDS display of the Large GCC data set. Test suite executions in black, user executions in grey. display for executions of the GCC compiler, including runs belonging to the GCC test suite, and a set of tests in which GCC was used to compile some standard user programs, (Emacs, ls, gcc, etc). Each point in the display represents a test execution, with dark points representing runs belonging to the test suite and gray points representing user executions. Points are placed close together in the display when the executions they represent are similar, and far apart when they are different. This display contains some horizontal patterns that, after some manual investigation, were found to represent the different optimization levels that can be requested from the compiler. That is, the executions in which the compiler was not asked to perform any optimization were placed on the bottom group in the display. This implies that, as expected, optimization level heavily influences the runtime behavior of the compiler. Additionally, when looking at the display one can see that the user and test executions are separated in the display, which confirms the belief that hand-written test cases are not representative of how the program is used by end users. This shows that automatic analysis of the profile information can lead to meaningful information about the executions, some of which can be used during software testing. 14

17 Groups like the ones in Figure 1 can be found automatically by cluster analysis, which is a technique used to partition a set into subsets, or clusters, which contain points that are more similar to each other than to points in other clusters. This forms the basis for the test-selection techniques discussed in Chapter 4. The idea is that, when testing a program, one wants to test all the different behaviors of the program, and by selecting one test from each cluster of tests a large number of different behaviors are tested. This also minimizes the amount of redundancy in the selected subset, since the groups from which the tests are selected are, by construction, maximally different from each other. Sampling from clusters in the profile space can also be used for the related problem of prioritizing test cases, that is, reordering the tests so that more defects are found earlier during the execution of the test suite. All the experiments presented in this work were performed on large, real-world subject programs. This prevents the question of whether these algorithms are efficient enough or whether the results even generalize to larger programs, which are common to software engineering research which only tests their algorithms in small, often speciallycreated programs. The test suites employed are likewise large test sets written by third parties. On the other hand, all three programs included in the experiments are compilers, for two different programming languages and written by three different groups. This work does not answer the question of whether these techniques can be applied to other kinds of programs. Some work has been done by other researchers on using these techniques for other kinds of programs [8][44][45], but the analysis was not as thorough as the one presented in this work. 15

18 This dissertation is organized as follows. The rest of this chapter will present previous work related to this dissertation, as presented by the author and other researchers, followed by a list of the main contributions to the field contained in this dissertation. Chapter 2 presents background material for profiling and clustering that will be used thorough the dissertation, as well as the experimental setup used to obtain the data that will be analyzed. The actual discussion of techniques and experimental evaluation is separated in to three chapters. Chapter 3 contains the description and evaluation of the test suite visualization techniques. Chapter 4 describes all the test case selection techniques and presents experiments for evaluation. Chapter 5 does the same for test case prioritization techniques. Finally, overall conclusions from this work are presented. 1.1 Related Work This section describes the main research areas that have motivated this research, and other related work by other researchers. Additionally, Chapters 3, 4 and 5 include more detailed reviews of related work for visualization, test-case selection and test-case prioritization, respectively. The idea of examining the distribution of program profiles was first examined by Podgurski, et al [35][36], who used cluster analysis of execution profiles and stratified sampling to estimate software reliability, and found that failures are often clustered together in small clusters. This observation was later used by Dickinson in his doctoral work [8] for testing purposes, where he studied the idea of using a similar technique for selecting a subset of tests from a large test suite, and also an adaptive technique to select additional tests once some failures are found. Dickinson, Leon and Podgurski presented 16

19 additional results for these techniques in [9], and a second adaptive technique in [10]. These experiments found that all of these techniques selected a subset with more failed executions than random sampling did. No evaluation was done to compare them with other techniques in the literature, or to measure the number of defects actually selected by these subsets, two shortcomings that will be addressed in Chapter 4. Using multivariate visualization techniques for analyzing the distribution of the tests in the profile space was first proposed by the author in his Master thesis [28], which included the usage of Correspondence Analysis [15] for studying test suites, including some experiments with small programs that showed the feasibility of the approach. Leon, Podgurski and White [30] extended this work by applying to a large test set for a realworld application, the GCC C-compiler. Another way to display profile information for software development was demonstrated by Jones et al [25]. They propose displaying the profile information, together with knowledge about test failures and the program text to aid in debugging. A colored listing of the program is displayed, with lines that were only executed in failed tests being colored red, lines only executed in successful tests being green, and the rest shaded in between. The brightness of the color for each statement is determined by how often that statement is executed. This helps the developer localize faults, by concentrating on lines that were actually executed on failed executions. Other researchers have proposed the usage of profile information for testing purposes without looking at the overall distribution in the profile space, for the objectives of test-case selection and prioritization, particularly in the area of regression test selection. In this scenario it is common to know the changes made to the program, and 17

20 many techniques have been proposed to make use of this information, together with the profiles for an older version, to select test cases that may be affected by these changes [7][14][18][38][42]. The test case selection work in this dissertation concentrates on the idea of selecting a subset of tests for a single version of the program, based on the profile information. Perhaps the earliest technique for reducing the size of a test suite based, at least partially, on profile information was proposed by Leung and White [31]. Their work included determining which tests in a regression test suite had become obsolete. Tests were first divided into specification and structural tests, the latter of which exist only to cover additional parts of the test suite. Leung and White proposed running the test suite and looking at the profile information to determine whether any structural test was no longer adding to the coverage, at which point they could be considered obsolete, thereby reducing the size of the test suite. Harrold et al [19] formalized the problem of reducing the size of the test suite without decreasing program coverage and gave a heuristic for finding a small subset. Their definition was very broad, including both regression test selection and the test case selection problem. Basically, every execution tests for a set of test-case requirements, and the test suite reduction problem then becomes finding the smallest test suite such that all those requirements are met. If these requirements correspond to coverage of statements, then this matches Leung and White s characterization for structural tests. Harrold et al point out that finding the smallest such subset is NP-Complete, and therefore suggest a heuristic algorithm involving first selecting the test that satisfy the most rarely satisfied requirements. They perform some experiments on data flow profiles 18

21 of small programs and small test suites and find that the algorithm does reduce the test suite by as much as 60%, but their experiments don t include fault detection comparisons for these subsets. These experiments were later extended with larger programs in [39], there the rate of fault detection for each test set was evaluated. They found that the reduced test suite was not as effective as the original, and that these results varied widely. Wong et al [47][48] also consider this problem, which they call test set minimization, and use an exponential algorithm to solve the NP-complete problem rather than approximating it. Their experiments, with relatively small programs and automatically generated test suites, show a marked reduction on the size of some of their test suite, with small loss in fault detection ability. Harden et al [16] propose a different method for, among other things, minimizing test suites, which is based on their concepts of operational abstractions and operational differences. An operational abstraction is basically an approximation of the program s formal specification, which is arrived at by observing the runtime behavior of the program. For example, in [16] they use program invariants at different points of the program as their operational abstractions. Since these are only guesses at the correct specifications, or invariants in this case, a new test case might violate those invariants, at which point these invariants have to be relaxed or modified. Their operational difference technique compares the abstractions deduced from two different test suites to decide which is better. They propose using operational differences for test suite minimization by starting with the whole test-suite and examining every test one at a time. Any test that can be removed without changing the operational abstractions induced by the test suite is considered redundant and can be discarded. Notice that in this case, the operational 19

22 abstractions are a kind of profile of the execution, describing what effects the test had on the program. A related problem is that of prioritization of a large test suite. Rothermel et al [40][41] point out that some programs can have very large test suites, whose running times is in the order of weeks. It is advantageous if defects are found earlier in the testing process. They present some techniques for prioritizing test cases based on profile data and also on some previous knowledge about the likelihood that any given test would expose a failure, and found that their techniques could outperform a random ordering. Elbaum et al [11] extend this work for usage on regression testing, and create new versions which can use knowledge about the changes made to the software to influence the prioritization process. Additional work has been done by various researchers to include additional information, such as historical information about the test suite [27], varying test costs and fault severities [12], and some of these techniques have been implemented and used in industry [43]. An important question for observation-based testing is the relationship between the profile information and the actual occurrence of failures. Harrold et al [17] evaluated different profiles types or program spectra and the effects on these of triggering a failure in the software. They found that the complete-path spectra, path-count spectra and branch-count spectra were all altered during a failed execution. Reps et al [37] proposed examining the effect of the date in the path spectra to discover year 2000 problems and related issues. In this dissertation, the effect of each profile type will be studied in terms of their effect on the results of the test case selection and prioritization techniques presented. 20

23 In addition, [34] presents some experiments in which they select a subset of the profile features that correlate with the occurrence of failures, and then the above formulae are used to calculated distances between failed executions according to those selected features. Their results show that the distances calculated in this way can be used to identify sets of tests which fail because of different defects. 1.2 Main Contributions This dissertation extends previous work on test suite visualization, and test case selection and prioritization. This section details the contributions to each area. Using multivaritate visualization for studying test suites was first proposed by the author in [28], which suggested using Correspondence Analysis for this purpose. This work was later extended with more experiments in [30]. This dissertation evaluates another technique, Multidimensional Scaling [5], and compares it with Correspondence Analysis for test suite visualization (See section 3.4.4). Additionally, a new algorithm for computing a Multidimensional Scaling display was developed in order to decrease the error in the display (Section ), and its performance is compared with other techniques presented in the literature (Section 3.4.5) The test case selection techniques evaluated in this dissertation were first described by Dickinson [8] and Dickinson, Leon and Podgurski [9][10]. The experiments in these original papers only evaluated the number of failed executions that are selected by their techniques, and only compared their results to those achieved by simple random sampling. Additionally, only function-call counts were used as profiles in their experiments. The experiments in this dissertation extend the previous work in multiple ways. First, the results of the test case selection algorithms are evaluated not only in 21

24 terms of the number of failed test cases selected, but also in terms of the number of program defects revealed by those tests. Secondly, the distribution-based techniques were compared against a coverage-based test-suite minimization technique as described in the literature. Finally, all of these experiments were done on different kinds of profiles for each subject program, to study the effects of profile type on the results. We have already published some of these results in [29]. Using distribution-based techniques for test-case prioritization was first proposed by the author and Andy Podgurski in [29]. These experiments are repeated, including additional profile types with one of the programs. Additionally, a two new prioritization algorithms based on the dendrogram of a hierarchical clustering are presented and evaluated. Both of these compare favorably to coverage-based methods, and to the previous distribution-based methods. In addition, all of the data sets used in these experiments come from large, realworld subject programs, which, until recently, was rare in testing experiments. Evaluating third party coverage-based techniques with these data sets provides information about whether these techniques generalize to these kinds of programs. Additionally, these experiments include data-flow profiles for a large program (see section 2.2). These kind of profiles has only been used for test-case selection in very small programs, usually in a single function. This is probably the first time data flow across a whole program has been studied for these purposes. 22

25 Chapter 2 Experimental Test Bed The Observation-based testing techniques presented in this dissertation were evaluated on data derived from real-world subject programs and test suites. In general, all experiments were done so as to reflect real-world usage of these techniques. This means that the programs under test are all real-world applications written by third parties, and the test suites are also widely distributed. This chapter describes the programs, test suites and profiling tools used to derive this data, as well as some basic algorithms common to the multiple analysis techniques described in this work. Readers interested in the observation-based testing techniques themselves may prefer to skip this chapter. On the other hand, some of the information might be useful to people attempting to implement these techniques in practice, as this chapter describes the methods used when gathering data (profiling) and some concepts used for the analysis discussed in the next chapters. This chapter is divided into four sections. First is a description of the subject programs and test suites used in this work. Second is an overview of the different profiling techniques available for analyzing executions. Third, a description of the techniques used for comparing profiles in this dissertation, and finally, a description of the algorithm used for clustering executions used thorough this work. 2.1 Data sets Four data sets were studied for this work. All come from profiling compilers for the C and Java languages. Compilers were chosen because they are complex programs for which there are readily available inputs, which are all easy to replay. All the inputs to the 23

26 compilers were contained in source files and command-line arguments, thus avoiding the issues with replaying interactive programs. Three compilers were studied, The GCC C compiler, IBM s Jikes compiler and Sun s Javac compiler. All three were tested with synthetic test suites, and GCC was additionally tested with a mix of synthetic tests and normal user inputs. The rest of this section will discuss the program and inputs for each of these data sets in more detail, and 2.2 will discuss the different profiling tools used in the experiments Jikes Data Set This data set was obtained by profiling IBM s Jikes compiler for Java, version 1.15 [24]. This compiler is written in C++ and compiles Java sources code to java bytecode. For these experiments, the compiler was tested against IBM s Jacks test suite [20], which tests conformance to the Java Language Specification [21]. This test suite is refined over time. We used the test suite current as of February, Each test comprises one or more source files which can be compiled in one or more program executions. For each test, it is known whether the compiler should accept the source code, accept and produce warnings, or reject with an error. Additionally, there are some tests in which the resulting bytecode is then executed to check that it can be loaded by the JVM and that it works correctly. Given this information, the scripts included with the test suite can determine whether the compiler passed the test or not. For simplicity, only those tests in which the compiler is executed only once are considered, since in a failed test with multiple executions it is not known which the executions worked correctly and which failed. 24

27 Additionally, all the failed tests were manually classified by Wes Masri according to their cause, in order to study automated techniques for classification of software failure reports, as reported in [34]. This classification of the failed executions will be used when determining how many defects were found by the test selection and prioritization techniques. Overall, there are 3149 runs in this data set, of which 225 failed, manually classified into 107 defects. Jikes was profiled with the gcov profiler, which produced profiles for function calls, basic blocks and basic-block edges. There were 3,644 function execution counts, 11,502 basic-block execution counts, and 12,996 basic-block-edge execution counts (after removing duplicate counts, as described in section 2.2) Javac Data Set This test set uses Sun Microsystems s Javac compiler for java. This is the standard compiler used by most java programmers, and it is itself written in java. For this data set, Javac was tested with the same test suite as used for the Jikes data set. Build 1.3.1_02-b02 of the compiler was used. The same version of the Jacks test suite was used for this experiment, but a different number of tests were obtained since the test suite contains some Javac-specific and some Jikes-specific tests. Since Javac is written in java, a different profiler was used than for the other programs. In this case a data-flow profiler was used which provided profiles for functions, caller-callee pairs, basic-blocks, basic-block edges, definition-use pairs and combined profiles, as described in section 2.2. This profiler was written by Andy Podgurski and the author, and works by instrumenting the bytecode of the program. 25

28 For this data set, there were 3140 executions, with 223 failures which were classified into 67 defects. After removing redundant information this produced 1022 function execution counts, 2123 function-pair counts, 3655 basic-block execution counts, 4307 basic-block-edge execution counts, 9620 def-use-pair execution counts, and combined counts Synthetic GCC data set The Gnu Compiler Collection includes compilers for different languages, including C, C++, Java, Fortran and others. In this study version of the GCC C compiler was used, as distributed with the Debian GNU/Linux distribution. The GCC team also maintains a regression test suite for their compilers, to which tests are often added when a defect is found or fixed. In order to include tests for defects contained in version 2.95 of the compiler, the tests were conducted using the test suite distributed with GCC To further contrast with the GCC and Jikes data sets, only the execution tests from the test suite were used. These are tests where the compiler accepts the source code without error and produces an executable, but, on failing test cases, the resulting executable is miscompiled and does not work correctly. The defects exposed by these tests should be harder to detect by profile analysis than tests which simply check whether the compiler accepts a language construct. For example, in a test which wrongfully rejects some conformant source code, the profile would reflect the printing of an error message and subsequent exit without code generation. By including only execution tests, this data set should reflect the behavior of subtle, hard-to-find defects in a software program. 26

29 For this data set, GCC was profiled with the gcov profiler, as explained in section 2.2, providing profiles for function calls, basic blocks and basic-block edges. The test executions were also classified according to the defects they found. In the end, this data set included 3,333 tests, of which 136 failed for 26 defects. The profiles included 2,214 function counts, 28,144 basic-block counts and 36,407 basicblock-edge counts Hybrid GCC Data Set The above data sets are all based around synthetic test cases, since these test suites include tools to determine which tests failed, making these experiments possible. The hybrid GCC data set contains both synthetic tests and operational executions based on real world usage. This data set will be used in section to compare synthetic tests to operational execution. The experiments were conducted by profiling the GCC C compiler, version Two sets of inputs were used, synthetic tests and operational executions. The synthetic tests consisted of the GCC test suite as shipped with GCC version 2.95 (The test suite for was not available at the time). All the tests were included, not just the execution tests as in the Synthetic GCC Data Set, for a total of 6064 executions. For the operational tests, the compiler was used to compile release versions of publicly available programs, including the emacs editor, the bash shell, bison parser generator, the GNU chess AI, the dejagnu testing framework, the GNU file utilities, the finger network client, the gcc compiler, the gdb debugger, the gettext internationalization package, the GNU interactive tools, the guile scheme plugin framework, the GNU hello world program, the indent pretty printer, the GNU internet utilities, a jpeg library, the less pager, the GNU 27

30 C++ library, the lynx web browser, the mc file manager, the oleo spreadsheet, the rcs version control software, GNU shell utilities, the GNU smalltalk implementation, the spell spell checker, sunclock, the wget file download client, xabacus, xcreensaver and the zlib compression library. All of the programs were compiled using their default make file, including optimization and other parameters passed to the compiler, in order to imitate the way these programs would be compiled in the field. A total of 32 programs were included, adding 1807 more compilations, for a total of 7871 executions. For this data set, GCC was profiled using GCC s function call profiler gprof. Only function call profiles were obtained. A total of 2370 functions were profiled. Notice that for this data set, there is no information about which tests succeeded, so it will not be used for the test selection and prioritization experiments. Instead it will be used in Chapter 3 when examining the usage of visualization techniques to study the relationship between the profiles of test executions and those of real-world executions. 2.2 Profiling Techniques Observation-based testing relies on obtaining data about the behavior that a program exhibits when running under different inputs. This section describes the different profiling techniques used in this work. These are a sampling-based technique, an execution-monitoring technique based on code instrumentation, and a more advanced instrumentation-based technique for data-flow profiling. The most basic profiling technique is a statistical technique, done by sampling of the state of the program at regular intervals. This profiling technique is based on the idea that, at any given time it is possible to determine what code the program is currently executing, and its context, by simply stopping the program and examining the program 28

31 counter and call stack. This process can be repeated regularly, thousands, or hundreds or thousands of times a second, to gather information about the most commonly executed portions of the program. Moreover, this technique allows for an estimation of the percentage of time that the program spends executing each element. Because of this, this technique is very commonly used when doing code optimization, as it can be used to identify hot-spots in the program. That is, this technique tells the developer which function, loop or basic block the program spends the most time on, and therefore might be a good idea to optimize. The sampling method allows this identification to be independent of whether a section of code takes a long time to run, or whether it is a quickly-executing section of code that is executed many times. Many programs are available to perform this kind of analysis. For example prof and gprof for unix environments, hprof for Java environments (bundled with Sun s Java Virtual Machine) and the Visual Studio 6 s profiler for C and C++ programs can all be used for samplingbased profiling. The downside to this technique is that quickly-executing, rarely-used functions might not be reflected in the profile, given the low likelihood of a sample being taken while the function is executing. This effect is undesirable when using techniques such as coverage maximization, which intends to make sure that all the code in a program is executed by the test suite (see chapter ). Another downside of this technique is that if the program is executed twice on the same input, the profiles might be slightly different simply because the samples were taken at different times. Another profiling technique involves directly monitoring the execution of each section of code. Two techniques are available for this purpose, code instrumentation and using properties of the runtime environment: 29

32 Code instrumentation means changing the actual executable code of the program to insert bookkeeping statements at select points. For example, statements can be inserted at the beginning of a function, so that a counter is increased whenever that function is executed. This approach is implemented in the GNU coverage profiler gcov. Gcov works in conjunction with GCC to profile every basic block in the program. This is done by passing a special command-line flag (-pg) to GCC when the program is being compiled, which causes GCC to insert instrumentation on each basic block during the code-generation phase. When the resulting program is executed, every time a branch between basic-blocks is traversed, the inserted instrumentation increases a counter specific to that branch. These counters are written out to a file when the program shuts down. The program gcov reads this information and the program s debugging information and prints a human-readable program listing annotated with execution counts 1. GCC itself can also read this information for use in feedback-based optimization in which optimization decisions are on knowledge of branch probabilities derived from the profiles. In some environments it is possible to monitor the execution of a program without changing the executable code. For example, the profiler included with Microsoft Visual Studio 6.0 can perform code coverage profiling by setting debugging breakpoints on every function or line of code. Whenever execution reaches such a breakpoint, the program is stopped and a secondary profiler process is informed, which records the reaching of the breakpoint and then resumes the program. This 1 This program was modified to create a more machine-readable output. 30

33 has the advantage of not changing thee program, but is very slow since a context switch is done every time a function or line of code is executed. Another example of this approach is present in the Java Virtual machine Profiling Interface [22], an API which allows a library to be loaded by a Java virtual machine and monitor the execution of the program being executed by the JVM. One of the capabilities provided is to notify the profiler whenever the JVM starts execution of a new method. This capability was used in [29] to build a function-call profiler for Observation-based testing experiments. Again, this technique allows precise monitoring of an execution without changing the program s bytecode. The third technique used in this work is a more advanced instrumentation-based profiling that also monitors data flow in the program. Basically, data flows between program elements when an instruction B uses data that was stored by an instruction A. (A,B) is called a definition-use pair, or DU Pair. The data flow profiler counts, for each pair of statements, how many times data stored by one is used by the other. This includes data stored in local variables, function parameters, class static variables, object s member variables and array elements. The data flow profiler used in the Javac data set was written by Andy Podgurski and the author. It is composed of two parts, the instrumenter and the profiler proper. The instrumenter uses the Byte-Code Engineering Library (BCEL) [4] to modify the program to be profiled, in order to insert calls to the profiler proper in appropriate places. These calls include: Notifications of function entry and exit Notification of entry into a basic block 31

34 Notification of execution of a data-storage instruction (including modification of local variables, array elements, object s fields and static variables) Notification of execution of data-read instruction. At runtime, the instrumented program executes those calls, which notify the profiler proper of these events. The profiler proper s main task is to keep track of the last instruction that defined each data element. Whenever a data-storage notification is received, the data element s last-defining instruction is updated. When a data-read instruction is reached, the profiler looks up the last defining instruction and records the exercise of a DU-pair between the defining instruction and the currently executing one. All of these DU-pair usages are tallied and counted, and a list of DU-pairs and counts is printed when the program exits. The function entry and exit notifications are needed to correctly handle local variables, by keeping, for each thread, a copy of the call stack with information about the last definition of each local variable. Note that this also allows the profiler to handle multi-threaded programs, including detecting data flow between threads. In addition, the profiler also keeps track of the number of function calls, callercallee pairs (number of times each function calls each other function), basic-block executions and transfers of control between basic blocks. This data-flow profiler works on java programs, without need for source code. It has been applied to real world applications successfully, and can keep track of a large number of constructs, as evidenced by the Javac data set. The above profiling techniques are a good example of the different approaches to profiling a program, and give an overview of the range of information that can be obtained about an execution by using profiling. Other profiling methods are available. 32

35 For example, Xiaohong Wang used variable value profiling when applying Correspondence Analysis to examine numerical programs, for which control-flow profiles don t reveal much, since most of the code runs most of the time. The idea is to gather information about the values taken by a program s variables at runtime. Independently, the GCC team included a variable value profiler in GCC 3.4 for use in Value Profile based Transformations for code optimization [6], which might be useful for OBT in the future. Wes Masri has been working on Information flow profiling, a more advanced version of data-flow profiling in which all of the variables influencing each other variable are identified, which allows interactions between subsystems in a program during each test to be identified. Similarly, dynamic slicing [46] can be used to determine which statements in a program influence which other statements during each test execution. These techniques would produce a much larger amount of data than the ones used in this dissertation, and it is not obvious what changes would need to be made to the current analysis techniques to cope with the larger amounts of data once the full profile exceeds the available memory, e.g. once the profiles become multiple Gigabytes in size. 2.3 Dissimilarity Metrics Once profiles have been gathered for all the tests, it is necessary to be able to meaningfully compare these profiles. Comparing two profiles for equality is simple, since these are just vectors, but it is also useful to determine the degree of similarity between two profiles. For example, in debugging it can be useful, given a test A, to find another, similar test, in order to check if it fails in the same way. The question of how to measure the dissimilarity, or distance between two executions has been examined before. Dickinson [8] described multiple dissimilarity 33

36 metrics for use in Observation-Based Testing. For example, the simplest measure would be the Euclidean distance: d Euclidean ( P a, k b, k ) ( a, b) = P k 2 Where P is the profile matrix, in which rows represent test executions and columns represent profile features (e.g. functions, basic blocks, etc.). how many times test i exercised feature j. P i, j is the count of Dickinson found that the simple Euclidean distance metric did not work well in observation-based testing. In essence, if a profile feature j can be executed up to 1000 times, and feature k is only executed 0 or 1 time in any given run, the distance between two executions will be affected much more by feature j than by feature k, and it may be that feature k is related to a defect while feature j is not. The best-performing dissimilarity metrics in past experiments [8][9][10][28] have been the three that directly deal with this scaling issue. These three were proposed by Dickinson and will be used thorough the current work. These are called the proportional, binary and proportional-binary metrics. All three of these perform a transformation of the profile matrix, and then apply the Euclidean distance formula to the transformed profiles. The proportional metric attempts to deal directly with the scaling issue by normalizing every column of the profile metric so that the minimum value is zero and the maximum one. The final distance formula then becomes: d Proportional ( a, b) = k Pa, k Pb, k max Pj, k min Pj j j, k 2 34

37 The binary metric is based on the hypothesis that most software defects are related to whether certain parts of the program are executed or not, and not to the actual number of times these are executed. To this end, all of the profile features are transformed into 0 or 1, depending on whether the count was nonzero. The final distance formula is: d Binary ( NonZero( P a, k ) NonZero( b, k )) ( a, b) = P k That is, for each profile element, the inner summation will add 1 if one test executed the feature but the other did not, and zero otherwise (either both tests executed it, or neither did). Notice that if two tests exercised the same parts of the program, but a different number of times, this distance metric will consider the two runs identical. The proportional-binary metric combines these two metrics, giving the same weight to the execution information and the count information. This is done by creating a table in which that the normalized, proportional profile, is augmented with a second set of columns for the binary profile. Euclidean distance is calculated between the rows of this extended table, giving the final formula: d = P P 2 ( NonZero( P ) NonZero( P )) 2 a, k, (, ) b k Prop bin a b + a, k b, k k max Pj, k min Pj, k j j These three metrics have been shown to produce good results when used with distribution-based test case selection algorithms (see Chapter 4), and also produce useful displays when combined with Multidimensional Scaling (see Chapter 3). 2 35

38 2.4 Hierarchical clustering In this dissertation, a recurring theme is that of partitioning test executions into groups of similar executions. This is known in the literature as Cluster Analysis. Cluster analysis algorithms analyze a population in order to produce a partition of the members of the population such that the elements in each partition, or cluster, are more similar to one another than to members of other clusters. Many clustering algorithms have been proposed in the literature, varying in accuracy, runtime performance and other attributes, such as the type of input expected. All clustering algorithms analyze a set of points, but some take as input a coordinate for each point (in our case, the profile counts), and others take as inputs a table of distances between points. The simplest clustering algorithm to describe is the k-means algorithm. This is an iterative algorithm that takes the coordinates of a set of points, and at every step keeps track of a set of k partitions of these points. For each partition, the centroid is calculated by averaging the members coordinate vectors (hence k-means). Once the centroids are computed, points are redistributed to the partition whose mean is closest to that point. These two steps are performed repeatedly until convergence. K-means usually produces a good clustering but is very slow in practice, since it needs to calculate kn distances on every iteration, and it may be necessary to execute many iterations. For observation-based testing research, it is desirable to have an algorithm that is efficient, allows separate calculation of distances, in order to study the influence of the type of metric used (see section 2.3), and that can efficiently produce multiple clusterings of different sizes. In the current work, hierarchical agglomerative clustering with 36

39 Figure 2 - Sample dendrogram average linkage is used. This algorithm takes as input the number of points n, the number of desired clusters k, and the distances between every pair of points. The algorithm works as follows: Start with n clusters, each containing a single point. Distances between these clusters are the given distances between points. For i = n-1 down to k o Find the closest pair of clusters, call these A and B, and remove them from the set of clusters. o Generate a new cluster C, containing all the points of A and B. o Generate new distances from cluster C to all the other remaining clusters. The distance between cluster C and some other cluster D is the average distance between the elements of C and the elements of D. Return the current set of clusters. (k clusters remain) Another possible product of this algorithm is a representation of the progress of the algorithm, called a dendrogram. Figure 2 shows a dendrogram for a possible clustering for a set of 6 points. In this case, runs 1 and 2 were the most similar, and so were joined first, then 3 and 4. Then the cluster containing 3 and 4 was closest to run 5, 37

40 and so on. The dendrogram allows the user to determine given a desired number of clusters, which points belong to which cluster, and also how to split or join clusters if a different number of them is desired. This algorithm has the following advantages: - Distances between executions are calculated a-priori, allowing the developer to choose a dissimilarity metric suitable for the data. Likewise it allows a researcher the freedom to test different dissimilarity metrics and their effect on the resulting clustering. - The algorithm is relatively efficient, needing only n-k passes through the data in order to complete. In practice, the running time of this algorithm is small when compared to the time necessary to calculate the matrix of distances between executions. - The algorithm iteratively calculates clustering for different number of clusters, ranging from the number of points, n, down to some stopping limit, k. This makes it simple to modify the algorithm to print multiple clusterings, of different sizes, in a single run. - There is a nesting property between clusterings of different sizes produce by this algorithm. The clustering for k+1 clusters is the same as that for k clusters except that one of the clusters in the latter is split into two for the former. This property will be used for the Hierarchical MDS algorithm (Section ) and also for the dendrogram-guided prioritization algorithm (Section 5.4.2) The downsides to this algorithm are that: 38

41 - The produced clusterings are not as accurate as those produced by other algorithms, e.g. k-means. This hierarchical algorithm is basically a greedy algorithm, and once two clusters are joined, they cannot be separated. It is not clear whether this affects the results of the test-case selection techniques described in Chapter 4 - The algorithm requires a dissimilarity matrix to be calculated and kept in memory. The calculation of this matrix is a time-consuming process in practice. For research purposes, this is not a problem, since each matrix can be calculated once for each data set and reused for different experiments. Additionally, the size of the dissimilarity matrix poses a constraint on the size of the data set. A test suite with 8000 executions produces a dissimilarity matrix 128MB in size, and this memory requirement grows quadratically with the test suite size. The rest of this dissertation will repeatedly use this clustering algorithm. The limitations of this algorithm does not affect the data sets here studied. If it becomes necessary to apply the test selection and classification techniques to a much larger data set, in which the calculation of the dissimilarity metric poses a problem, some clustering algorithms specialized for large data sets are discussed in the data mining literature. For example, CLARA [26] uses a two-stage algorithm where a dissimilarity matrix is calculated for a random sample of points and those are clustered first, and then the rest of the points are assigned to those clusters only. Notice that all the dissimilarity metrics in section 2.3 rely on a transformation of the data followed by an Euclidean distance calculation, so it should be possible to first transform the data and then use a standard 39

42 clustering algorithm. These optimizations on the kind of clustering algorithm used are beyond the scope of this dissertation. 40

43 Chapter 3 Test Suite Visualization 3.1 Introduction In order to thoroughly test a software program, a large number of tests have to be executed, which makes it hard or impossible for the programmers to know the composition and properties of the test suite. Even for a regression test suite where each test is introduced by a programmer for a specific purpose, it is impossible for anyone to know all the thousands of tests in the test suite, and what they were meant to test. This makes it hard for the tester to answer questions like: - Are there redundant tests in the test suite? - Does the behavior of the software exhibit similar patterns across multiple tests? - Are some behaviors of the program more heavily tested than others? - What properties of the tests have the most influence in the global behavior of the program? Multivariate test suite visualization techniques are useful in answering this kind of exploratory questions, allowing a tester to get an intuitive feel for the contents of the test suite. These techniques have proven to be invaluable research tools for developing the test selection and prioritization techniques presented in later chapters. The basic idea of test suite visualization is to create a graphical representation of the profiles for the tests in the suite. By exploring these displays, the developer can look at the display and examine the runtime behavior of the program on the different tests. Two different techniques are discussed in this chapter, correspondence analysis and multidimensional scaling. Both of these techniques create a scatter plot of points, where 41

44 each point represents a test in the test suite, and points are placed together in the display if the tests they represent behaved similarly. This implies that patterns and clusters in the test suite will be reflected in the display. Of course, not all the information about the relationships between tests in a large test suite can be represented in two dimensions, so any such display will have an error associated with it. Part of the objective of this chapter is to select techniques that minimize this error, and that can be computed efficiently for such large data sets. The remainder of this chapter will present the techniques we use for this purpose, series of examples of the applicability of each technique, and finally a comparison of the different techniques, in terms of usefulness and accuracy. 3.2 Visualization Techniques Correspondence Analysis. Correspondence Analysis is a technique used to analyze a data matrix by creating a low-dimensional display with the minimum loss of information. Such a display consists of a scatter-plot with two sets of points, one representing the rows of the matrix and one representing the columns. The points are arranged in such a way that similar rows in the matrix (those whose values are correlated) will be represented by row points in the display that are close to each other, and likewise for columns. In addition, row points will be placed in the direction of column points representing the columns which most influenced the given row s placement. 42

45 0 (b) 0 (a) (c) In an observation-based testing application, the row points would represent individual test runs, and the columns would represent the profile features, e.g. functions in a function call profile. This way, one can examine the relationship between tests, between functions and between runs and the functions they use. The use of correspondence analysis for observation-based testing was researched in the author s master s thesis and is discussed there in more depth. The rest of this section will include an overview of the algebra behind the technique and discuss its properties and limitations. Figure 3 - Fitting of a line through a twodimensional cloud of points. Correspondence analysis is based on the principle of orthogonal projections. Consider Figure 3a, which shows a two-dimensional cloud of points that we would like to represent in along a line (one dimension) while preserving the geometric relationships between the points as much as possible. This can be done by fitting a line (called the principal axis) through the cloud of points and then projecting the points perpendicularly onto this line, as shown in figure 3b, giving the display in figure 3c. The process of selecting the line onto which the points are projected is at first reminiscent of the process of fitting a line by linear regression so as to minimize the sum of squared distances 43

46 (errors) between the line and the data points. The difference is that in linear regression the distance is measured as the difference in Y coordinates, while in correspondence analysis the error is the actual distance between the line and the point, that is, when measured along a line segment perpendicular to the fitted line. The process described above can be applied just as easily to an n-dimensional set of points to determine a one-dimensional representation. Usually, this representation will not include all the information in the original n-dimensional data. On the other hand, the information not represented in this display can be easily extracted by projecting the data onto the n-1-dimensional hyperplane perpendicular to the principal axis. This results in a cloud of points in n-1 dimensions, which can be analyzed again in the same manner. Repeated applications of this algorithm produce additional principal axes, each perpendicular to each other, together with coordinates along each axis for each original point. This process produces an ordering, giving a first, second, third principal axis and so on. Since each principal axis is fit so as to maximize the amount of information displayed, and each remaining axes only represent a part of the remaining information, the axes end up being ordered so that the earlier axes are more important, and the final ones end up basically representing noise in the data. By using coordinates along two axes, e.g. the first and second principal axes, one can produce a two-dimensional scatter plot that can be displayed on a computer screen and shows the main features of the distribution. Additionally, since the latter principal axes are less important, the program only needs to calculate the first few, which greatly decreases the time required to generate the display. 44

47 Another interesting property of this algorithm is that the columns of the matrix can also be represented. The i-th column can be represented by a point along the i-th axis, at a certain distance from the origin, and this points can be projected onto the principal axes the same way as the row points. This makes it easy to see which columns explain the placement of row points along each axis. The closer the column points are to the origin in the projection, the less the influence the placement of the row points. Row points will be placed in the direction of column points in which they have a high count and away from those in which they have a low count. Correspondence analysis is based on the above principle, but additional calculations are done to give the display more statistical soundness, symmetry and to make up for the fact that some columns of the data will have different scales. In a correspondence analysis display, two row points are close together if the in the corresponding rows have values with high correlation. Likewise with columns, as transposing the input table produces the same display, except for swapping the row and column points. As above, row points are placed in the direction of column points that are stressed in the row, and vice versa. For more detail, the reader is referred to [15] and [28]. In practice, Correspondence analysis is defined in terms of the matrix s Singular Value Decomposition (SVD) and its generalized version. Calculating the SVD of a matrix is a well-understood problem, and there are multiple efficient algorithms for calculating it. Since we only want the most significant principal axes, an iterative algorithm can be used to compute the solution effectively. Our implementation of the correspondence analysis algorithm reduces the problem to finding the eigenvalues and eigenvectors of a matrix and then uses an implicitly restarted Lanczos iteration which has the properties of 45

48 efficiently calculating the largest eigenvalues and corresponding eigenvectors [1][2], and it is also easily distributed across computers, if necessary. Our current program can distribute the input data across different computers and perform the calculation in a distributed manner, so long as the total memory across the computers is enough to hold the input. This was very useful when this research was started, but current computers have enough memory to store common data sets. In the future, this program will allow developers to examine data sets much larger than the ones presented in this work Multidimensional Scaling Multidimensional Scaling (MDS) is the name for a family of techniques that, given a set of points and the ideal distances between those points, create a lowdimensional arrangement of points such that the distances in this output arrangement match the input distances as closely as possible. This can be used to produce a twodimensional scatter-plot where each point represents an execution, and the distances approximate the precalculated distances. For software testing this allows the tester to calculate distances between tests with any dissimilarity metric, and then see a picture representing the way the metric organizes the executions. Also, as described in section 2.4 clustering algorithms can also use a dissimilarity metric, making MDS useful to display the results of automatic clustering. The classical approach to MDS uses a dimensionality reduction technique based on orthogonal projections. It was first proposed by Torgenson and Gower and relies on taking the eigen-decomposition of the matrix of squared distances, in a similar manner to the way that Correspondence Analysis uses the singular value decomposition to arrive at an optimal arrangement of points [5]. This constrains the resulting configuration to an 46

49 orthogonal projection of the original n-dimensional points, but can be done very efficiently, as it only involves some preprocessing of the matrix and then finding the largest eigenvalues and corresponding eigenvectors of the matrix. Modern Multidimensional Scaling is based on the notion of using an iterative approach where the output arrangement is refined at every step to minimize an error function, usually called a loss function. The most common family of functions are called Stress functions, and are based on the sum of the square error between the input and output dissimilarities. For example, the raw stress of an arrangement X is defined as: σ ( X) = w [ δ d ( X)] r Where i< j ij ij ij 2 δ ij is the desired dissimilarity between points i and j, and d ij (X) is their distance in the configuration X. Typically, the weight w ij is set to 1 for all i,j except in those applications where some dissimilarities are unknown, in which case they are set to 0. In our applications we assume all dissimilarities are known, since they can be calculated from the profiles. The value of the raw stress function is affected by the scale of the dissimilarities, so often times it is normalized as: σ σ r ( X 2 n ( X ) = ) i < w 2 j ijδ ij This measure will be used when comparing different MDS techniques. Notice that minimizing the above measure does not guarantee that local features will be preserved, but rather that the overall characteristics of the display will be represented closely. This means, in particular, that the nearest neighbors of a point might end up far away in the display as long as the overall error is still low. This effect, 47

Figure 4 MDS display of the GCC data set, binary dissimilarity. Computed by Classical Scaling + SMACOF. Figure 5 MDS display of the GCC data set, binary dissimilarity.

combined with the fact that MDS algorithms have to stop at a local minimum for the stress function, makes it so that the nearest neighbor relationship isn t always preserved.

50 Figure 4 MDS display of the GCC data set, binary dissimilarity. Computed by Classical Scaling + SMACOF. Figure 5 MDS display of the GCC data set, binary dissimilarity. Computed by Classical Scaling + SMACOF. Lines are drawn between nearest neighbors. combined with the fact that MDS algorithms have to stop at a local minimum for the stress function, makes it so that the nearest neighbor relationship isn t always preserved. This becomes a problem in observation-based testing where the nearest neighbor relationship is important, as we want to be able to graphically see which test is most similar to another one of interest. For example, consider Figure 5 where lines are drawn connecting the points in Figure 4 that were meant to be nearest neighbors. In this case, the algorithm converged to a local minimum where nearest neighbors are far from each other. Unlike Correspondence Analysis, there are many variants of MDS, involving different loss functions, algorithms, and ways of arriving at the starting MDS configuration. In the rest of this section some of these different alternatives will be presented, focusing on variants that preserve the nearest neighbor relationship Scaling by Majorizing a Complicated Function (SMACOF) The basic multidimensional scaling algorithm employed in this work is the iterative majorization algorithm described by Borg and Groenen [5], which they call the 48

51 SMACOF (Scaling by Majorizing A COmplicated Function) algorithm. Iterative Majorization attempts to minimize the stress function σ r (X) by iteratively replacing it with a simpler function τ(x, Z), called the majorization function of σ r (X), with the following properties: τ(x, Z) can be minimized over X in one step; τ(x, Z) dominates σ r (X), that is, σ r (X) τ(x, Z) for all X and Z; the majorization function touches the surface defined by σ r (X) at the supporting point Z, that is, σ r (Z) = τ(z, Z). These properties imply that the following inequalities hold for all X and Z, where the minimum of τ(x, Z) over X is attained at X * : * * σ ( X ) τ ( X, Z) τ ( Z, Z) = σ ( Z) r r At each iteration of a majorization algorithm, the minimum of the function τ(x, Z) over X is found and, if a stopping condition is not satisfied, Z is replaced by the corresponding value of X. Thus, the algorithm produces a non-increasing sequence of σ values, whose last element is a local minimum. Borg and Groenen show that the stress function can be Majorized with the following inequality: 2 σ ησ + tr( X' VX) 2 tr( X'B(Z)Z) = τ ( X, Z) r where η 2 = w σ 2 σ, i< j ij ij tr is the trace function tr( X) =, i X ii V has components v ij wij = n w j= 1 ij if i if i = j j B(Z) is a function of Z that produces a matrix with components: 49

52 b ij wijδ ij dij ( Z) = 0 n bij j= 1, j i if i if i if i = j and d j and d j ij ij ( Z) 0 ( Z) = 0 and w ij, δ ij and dij (Z) are as defined above. τ ( X, Z) is minimized when 2 VX 2B(Z)Z = 0, which can be solved for X as a system of linear equations. This provides a way to generate a new configuration X from a given configuration Z. Since the matrix V is not of full rank, the Moore-Penrose inverse, r r r r V = ( V + 1 1' ) + n 1 1' is used instead, and the last term is dropped, generating the recurrence relation r r X = ( V + 1 1' ) 1 B(Z)Z. Since this requires multiplication by the inverse of a matrix, the numerical accuracy of this algorithm on a fixed-precision floating point system depends on the composition of the weight matrix. If the generated matrix V + 1 r 1 r ' is ill-conditioned, the algorithm will not converge correctly because of the loss of precision. This is a property inherent to the matrix being inverted, and not of the algorithm that uses it. Section includes information as to how often this poses a problem in actual usage. This algorithm converges to a locally-optimal solution efficiently. As mentioned before, it does not try to keep any sort of geometrical correspondence between the output arrangement and the input points, but rather tries to move points to minimize the error. This lack of constraints allows it to represent the input more closely, but it loses the ability to easily interpret the display, as there s no feature similar to the column points in 50

53 correspondence analysis. In practice we ve found that manual examination often allows one to determine what the main features of the distribution are, based on test names and other data. Notice that this is an iterative algorithm in which the error decreases on every iteration. While this is a reasonable approach, consider figures 4 and 5. In this case, a set of points have their nearest neighbors placed far away in display. To put one of the edge points near the other edge, an iterative algorithm one would have to move the points towards the middle first, which temporarily increases the error, as these are meant to be kept far apart from the middle points, and can t therefore be done by SMACOF. This illustrates the importance of choosing a good initial configuration for the algorithm. Borg and Groenen [5] suggest using classical scaling, as described earlier in this chapter, to generate an initial configuration, but this is what leads to figure 5. Another simple technique is to use a random starting configuration, but this is not likely to produce better results Hierarchical MDS When using iterative algorithms to calculate a display, points that start out far apart tend to stay apart, since position changes from one iteration to another are small, and sometimes in order to move a point to its optimal spot, the total stress might have to temporarily increase, which an iterative algorithm will avoid. Since clustering can be used to determine a grouping of points such that similar points are together, one can use the results of clustering to guide the placement of points, or to arrive at a starting configuration. The first to explore this idea was Basalaj [3] in an incremental algorithm for MDS based on single-link clustering. In Basalaj s algorithm, the clustering results are 51

54 exploited to speed up the computation without sacrificing too much accuracy, by first doing a full scaling step a subset of the points are placed by an algorithm where all their positions are optimized, thereby creating a set of points to act as a skeleton, and then adding additional points using single scaling, which does not affect the position of previously placed points. The clustering results are used to select the skeleton points and to decide the order in which the additional points are added. The current work presents a new clustering-assisted MDS algorithm, called Hierarchical MDS. This algorithm uses agglomerative hierarchical clustering together with the SMACOF algorithm to create a display where the nearest-neighbor relationship is kept as closely as possible. Hierarchical MDS uses the SMACOF algorithm to come up with a good representation for clusters rather than points, and iteratively increases the number of clusters until only one-element clusters remain. Once an optimal configuration for n clusters is found, one can use that result as the starting configuration for the SMACOF algorithm for n+1 clusters by simply placing the two new cluster points in place of the cluster point to be split, as described below. This makes it so that similar points will be initially placed close together, since they start at the same position once their cluster is split, which makes it easier for the SMACOF algorithm to keep them together if necessary. To understand how the clusters are split, it is useful to remember that the hierarchical algorithm described in section 2.4 iteratively joins the most similar pair of clusters to reduce the number of clusters in the set. This process can be summarized in a dendrogram, as shown in section 2.4. This dendrogram also shows how the process can 52

55 be reserved, that is, starting with one big cluster, the dendrogram gives a way to split the cluster into two, and, in general, how to arrive at n+1 clusters when one has n clusters. Hierarchical MDS uses the dendrogram to guide the splitting of clusters into their component parts, and it also uses the distances between clusters calculated by the clustering algorithm as input to SMACOF. More formally, Hierarchical MDS works in two phases. First is the clustering phase, in which a dendrogram is created by repeatedly joining the most similar pair of clusters. Whenever a new cluster is created, its distance to other clusters is calculated by average linkage, that is, the distance between two clusters C a and C b is, by definition, the average over all pairs of objects a C b and b C b of the dissimilarity between a and b. There are other methods of measuring distances between clusters, most notably, the minimum- and maximum-linkage methods where the distance between two clusters is defined as the distance between the closest (or farthest) pair of points a and b, where a C b and b C b. Average clustering was selected because it provides a better and less biased approximation of the distances between points across clusters, which lends itself better to minimizing the error in the display. In the second phase, HMDS undoes the clustering, one step at a time, as follows: 1. When producing a k-dimensional display, start with the k+1 clusters at the top of the dendrogram. One can always calculate a configuration for k+1 points in k dimension without any error. For simplicity, one can assign random initial positions to the corresponding points and use the SMACOF algorithm with the cluster dissimilarities (computed during the clustering phase) to reposition the points in k dimensions. 53

56 2. For a = 3 to n, where n is the number of execution profiles: a. Let C i and C j be the two clusters that were merged during the clustering phase to obtain the current set of clusters, and let C m be cluster obtained by merging them. Replace the point corresponding to cluster C m in the current configuration with two points, for clusters C i and C j. b. For each pair of clusters C r and C s, set the weight associated with the dissimilarity δ rs between C r and C s to w rs = C r C s. c. Use the SMACOF algorithm with the cluster dissimilarities and associated weights to update the configuration generated in step (2a). Note that, in practice, the latter iterations often converge after a single step of the SMACOF algorithm in (2c). This is to be expected since those iterations introduce pairs of points whose clusters are very similar to each other. Since the setup time for the SMACOF algorithm grows with the number of points, one can speed up the procedure by splitting a number of clusters before running SMACOF, once enough iterations have been done. In practice we ve found that a good compromise, for data sets of about 3000 points, is to do the first 1000 splits one by one, and then split 500 clusters at a time. This allows the main features of the population to be laid out carefully, and later saves time for the calculation without sacrificing much accuracy, as the later iterations converge after only a few steps of the SMACOF algorithm, even though many new points are added. Note that the algorithm described above also assigns weights to the dissimilarities before executing the SMACOF algorithm. There are two obvious possibilities for assigning the weights used by SMACOF to calculate stress, one is to set all cluster differences to have the same importance (e.g., unit weights), and the other is to make the 54

57 cluster weights proportional to the size of the cluster, as shown in step (2b). The motivation behind the latter option is that the distance δ rs is meant to approximate the actual distances between the points in clusters r and s, and there are C r C s such distances. Assigning these weights forces the SMACOF algorithm to behave as if there were actually that many points being fit, but with the constraint that points in one cluster must be kept together. Experimentally, we found that when using unit weights, freshly split points would sometimes move to opposite sides of a dense cluster, containing hundreds or thousands of points. These freshly split points would move even further apart when that dense cluster was split. By using weights proportional to the size of the cluster, points are less unlikely to move over a point representing a dense cluster, since the distance between these two points would be kept more accurately thorough the SMACOF algorithm, rather than be allowed to go to zero and back up Energy minimization Basalaj [3] describes another alternative for obtaining a better representation of small dissimilarities with multidimensional scaling, independently of the algorithm used to calculate it. His technique sets each weight w ij of the stress formula equal to the square inverse of the corresponding dissimilarity value δ ij. This has the effect of penalizing a representation error (δ ij d ij ) 2 more if δ ij is small than if it is large. Basalaj calls the resulting loss function energy: 2 σ E ( X ) = n( n 1) i< j ( δ d ij δ ij 2 ij ( X)) 2 55

58 If the SMACOF algorithm is used to minimize energy instead of stress, it should in principle do a better job of representing small dissimilarities, possibly at the expense of increased error in representing large dissimilarities. Basalaj states that the energy function may be interpreted as the total energy of a fully connected spring system, with an anchor for each object and springs connecting it to all other anchors. The relaxed length of the spring connecting two anchors is given by the (scaled) dissimilarity between the corresponding objects, and the actual length of the spring is the Euclidean distance between the anchors. The spring constant for the spring connecting anchors i and j is 2 δ. ij One can also use Hierarchical MDS when minimizing energy, by setting the weight for the dissimilarity δ rs between C r and C s to w rs = C r C s /δ 2 rs. Note that the resulting configuration, while a locally-optimal solution to energy minimization, will not in general be an optimal solution to the stress minimization problem Ordinal MDS For software engineering purposes, the tester is not necessarily interested in the exact value of the distance between two executions, but rather on the ranking of executions according to the distance to a known one. That is, a tester might be interested in finding what is the nearest neighbor of the execution, the second nearest, and so on. This suggest that it would be useful to choose a visualization algorithm that conserves the ordering of the dissimilarities, even if not necessarily the actual values. Borg and Groeneng present a technique for this purpose, called Ordinal Multidimensional Scaling. 56

59 The stress function presented in section defines an MDS model called Absolute MDS. In general, one can modify the stress function to allow transformations of the input dissimilarity. That is, each dissimilarity δ ij is replaced by a disparity value f(δ ij ), where f is an admissible function, chosen in some optimal way. In this case, the formula for raw stress is: σ ( X) = w [ f ( δ ) d ( X)] r i< j Depending on the kind of function used, MDS models can be divided into metric and nonmetric. Metric models use some transformation of the dissimilarities where the actual magnitude of the values influences the result, such as linear, logarithmic or exponential functions. In nonmetric models, only the ordinal properties (ordering, equality) of the dissimilarities are used in the transformation. For an example of when these transformations might be useful, consider a study where a person is asked how similar certain smells are to each other. These measurements imply a dissimilarity matrix that can be plotted with MDS, but they will also be subjective and will not represent measurements in any formal scale. In this case, it is necessary to decide how much of the data to preserve, and it might make sense to only keep track of the ordering of dissimilarities. The function f for transforming the dissimilarities is not completely defined in advance, but rather a family of functions is selected, and the function is fit at the same time that the MDS display is calculated, in order to arrive at a function and display that minimize the stress measure. For the SMACOF algorithm, after each iteration is carried out as before, the current distances d ij (X) are calculated from the new point configuration X, and the function f is fit so that the disparities f(δ ij ) match the distances in the display as 57 ij ij ij 2

60 closely as possible. In ordinal MDS, any transformation that preserves the rank order of the dissimilarities is admissible, that is, f is chosen so that δ ij < δ kl implies f(δ ij ) f(δ kl ), such as, for example, a monotonically increasing step function. Thus, ordinal MDS tries to produce a display in which the ordering of the dissimilarities is preserved but not necessarily their ratios. Preserving the ordering of dissimilarities between program executions would of course preserve nearest-neighbor relationships, up to ties in the input or output dissimilarities. Kruskal s up-and-down-blocks algorithm for monotone regression can be used to find an optimal update of the disparities in ordinal MDS [5]. 3.3 Applications: In this section, some applications of MDS and Correspondence analysis will be presented. The data sets presented in section 2 will be displayed with the correspondence analysis and hierarchical MDS algorithms explained in section 3.1. What follow are some examples of different applications of these visualization techniques, as applied to the data sets from section Comparison of test populations When combining multiple sets of test, it is possible that both sets make the software behave similarly, and therefore are indistinguishable from each other. One says, in that case, that both sets come from the same operational distribution. An MDS or CA analysis display can be useful when comparing sets of tests, in that if both sets produce similar profiles, they will be indistinguishable in the display. 58

61 Figure 6 - MDS display of the Large GCC data set. Test suite executions in black, user executions in grey. For a practical example of this technique, consider Figures 6 and 7, which show, respectively, MDS and CA displays of the large GCC data set, with test and operational executions shaded differently. In the displays, test executions are separate from operational executions, indicating that the profiles are different enough from one another. This confirms the intuition that synthetic test cases make the program behave in a different manner than normal usage. This also has implications for software reliability estimation, that is, the estimation of how often software fails when deployed. If a developer wants to estimate the reliability of the software during normal operations, test executions should not be taken into account, since they have a different distribution, and therefore their reliability is not related to the reliability that the software would achieve in the field. Using actual operational executions instead of test executions is already standard practice in software reliability estimation, because of the above intuition, which is confirmed by these displays. Figure 7 - CA display of the Large GCC data set. Test suite executions in black, user executions in grey. 59

62 Figure 8 - MDS display of the Large GCC Figure 9 - CA display of the Large GCC data set. Executions are shaded according to data set. Executions are shaded according to optimization level for the compilation. optimization level for the compilation. Darker points have higher optimization. Darker points have higher optimization. Likewise it should be possible to compare other sets of executions, for example, comparing executions from different users, to check whether there are differences in the way people use the software. Notice that this is partly a subjective comparison. For a more accurate comparison, one can use statistical test, for example permutation tests, to accurately determine the probability that two sets of executions come from the same distribution. These two techniques are complementary to each other, in a manner similar to simple single-variable statistical tests can be used in addition to histograms to compare two populations. Multivariate visualization techniques allow the user to analyze the distributions as a whole and see, for example, which kinds of runs are more common in one set than the other. This application was first explored in [30] on the above data set Analysis of the profile distribution For a developer, it is interesting to study the profiles and to see which characteristics of the execution most influence the profiles. Since multivariate 60

63 Figure 10 CA display of the Small GCC data set. Column points representing functions. Lighter points represent the functions of the stupid register allocator. Figure 11 - CA display of Small GCC data set. Executions are shaded according to optimization level for the compilation. Darker points have higher optimization. visualization displays reflect the distribution of the profiles, one can analyze the former to get some intuition about the latter. As an example, consider the small and large GCC data sets. Some linear patterns can be seen in the correspondence analysis display on figure 9, most notably a mostlyseparate group in the left side of the display. When examining the column points (Figure 10) on the left side of the display, it was noticed that among them were the functions for GCC s stupid register allocator, a subsystem used only when GCC is requested to compile a program with no optimization. From this it was hypothesized that the distribution was influenced by the optimization level for the compilation. Figures 9 and 11 show correspondence analysis displays for the small and large GCC data, respectively both of which have a separate section with the no-optimization runs, and higher optimizations are placed towards the right of both displays. Figures 8 and 12 show similar patterns on the MDS displays, where the different optimization levels are somewhat layered on top of one another. This indicates that the optimization level requested of the compiler is a large factor in the execution, as would be expected. 61

This kind of analysis is also useful when examining subsets of the test suite, e.g. failed executions or tests selected by a selection technique, since it allows to visually determine the kinds of tests being selected.

64 This kind of analysis is also useful when examining subsets of the test suite, e.g. failed executions or tests selected by a selection technique, since it allows to visually determine the kinds of tests being selected. This application was first studied in [28] and [30]. Several other studies have included this technique since. For example, Joyce Varghese used multidimensional scaling and correspondence analysis displays to analyze the distribution of test executions for a multicasting, file-distribution middleware layer.[44] Analysis of the distribution of failures For software engineering research, it is interesting to study the distribution of failures in the profile space. This has been very useful in developing the test selection and prioritization techniques presented in Chapters 4 and 5. For an example of this application, consider Figures 13-15, which show MDS displays of the small GCC, Javac and Jikes data sets, highlighting the position of the failed executions in each test suite. The following conclusions have been drawn from these displays: - The distribution of failures is uneven. All three displays have regions with Figure 12 - MDS display of the Small GCC Figure 13 - MDS display of the Small GCC data set. Executions are shaded according to data set. Stars represent failed executions. optimization level for the compilation. Darker points have higher optimization. 62

65 many failures and other regions with only successes. This suggests the usage of clustering together with stratified sampling to efficiently find failures. - Failures are usually concentrated in groups. This means that, once a failure is found, it should be possible to look for other failures with high probability of success by looking for tests in the neighborhood of the know failures. This suggests the adaptive sampling and failure pursuit techniques in chapter 4. - In some cases, the failures produce linear patterns in the display, rather than a round one. This motivated the creation of the failure pursuit algorithm described in section 4.2, which can follow these patterns trough the profile space. More work has been done in the study of the distribution of defects in profile space and determine the profile features that influence them. See for example [34]. 3.4 Experimental Section This section analyzes the results of the different techniques described in section 3.2. First, a comparison between Correspondence Analysis and Multidimensional Scaling Figure 14 - MDS display of the Javac data set. Stars represent failed executions. Figure 15 - MDS display of the Jikes data set. Stars represent failed executions. 63

66 will be provided, showing the qualitative advantages of each technique, as a guide for choosing one when evaluating a data set. Then, section will present a comparison of the different techniques for creating an MDS display, among those discussed in section Comparison between correspondence analysis and Multidimensional Scaling This chapter has so far described two techniques for multivariate data visualization, but no attempt will be made to select a better one, since these two techniques are complementary to each other. This section describes the differences between the two techniques and shows under which situations one is more suitable than the other. - Correspondence analysis column points and row points: Correspondence analysis computes display positions not only for each row in the profile matrix (test executions), but also for each column (profile feature). While the algebra relating row and column points is complicated, the intuition is simple to understand. Row points are attracted to column points in which that row has a higher count than average. This allows for quick interpretation of the correspondence analysis displays. This technique was used in section to understand the patterns in the displays of the small and large GCC data sets. Likewise, the column points relate to one another in that profile features that have similar counts end up together in the display. This allows the exploration of the runtime characteristics of the different parts of the program. While MDS does produce a meaningful display, whose gross features can be related to those in a CA display, MDS does not produce column points, or any other 64

- Display of automatic clustering results. While CA uses a preset dissimilarity metric, MDS can be applied to any distance matrix calculated.

67 simple way to study the resulting display to understand why points are placed in that way. It is possible to use CA to determine the main contributors to the profile distribution, and then see how those correlate to the points positions in the MDS display. - Display of automatic clustering results. While CA uses a preset dissimilarity metric, MDS can be applied to any distance matrix calculated. One useful application of this is to use MDS to visualize the results of automatic clustering, by using the same distance metric for both. For example, Figures 16 and 17 show, respectively, correspondence analysis and multidimensional scaling displays of the small GCC data set. The test executions were clustered into 33 clusters (1% the number of runs in the test suite) by the automated clustering algorithm discussed in section 2.4, and convex hulls were drawn around the points corresponding to each of the clusters. While the two displays show the same information, it is easy to see in the MDS display that the runs with no optimization (see section 3.3.2) are clustered on their own. Then the middle of the display contains two large clusters which enclose most of the tests, and the rest of the clusters encompass smaller subpopulations and outliers. Again, these two displays contain the same information, but the MDS display is easier to Figure 16 - CA display of the small GCC Figure 17 - MDS display of the small GCC data set. Convex hulls represent the result of data set. Convex hulls represent the result of automated clustering. automated clustering. 65

Multivariate Visualization in Observation-Based Testing

Multivariate Visualization in Observation-Based Testing David Leon, Andy Podgurski, and Lee J. White Electrical Engineering and Computer Science Department Case Western Reserve University Olin Building