Multivariate Visualization in Observation-Based Testing

Size: px

Start display at page:

Download "Multivariate Visualization in Observation-Based Testing"

Gwenda Dalton
5 years ago
Views:

1 Multivariate Visualization in Observation-Based Testing David Leon, Andy Podgurski, and Lee J. White Electrical Engineering and Computer Science Department Case Western Reserve University Olin Building Cleveland, Ohio USA , , dzl@po.cwru.edu, andy@eecs.cwru.edu, leew@eecs.cwru.edu ABSTRACT We explore the use of multivariate visualization techniques to support a new approach to test data selection, called observation-based testing. Applications of multivariate visualization are described, including: evaluating and improving synthetic tests; filtering regression test suites; filtering captured operational executions; comparing test suites; and assessing bug reports. These applications are illustrated by the use of correspondence analysis to analyze test inputs for the GNU GCC compiler. Keywords Software testing, observation-based testing, multivariate visualization, multivariate data analysis, data visualization, correspondence analysis. 1 INTRODUCTION The traditional paradigm for testing software is to construct test cases that cause runtime events that are likely to reveal certain kinds of defects if they are present. Examples of such events include: the use of program features; execution of statements, branches, loops, functions, or other program elements; flow of data between statements or procedures; program variables taking on boundary values or other special values; message passing between objects or processes; GUI events; and synchronization events. It is generally feasible to construct test cases to induce events of interest if the events involve a program s external interfaces, as in functional testing (black-box testing, specification-based testing). However, it often extremely difficult to create tests that induce specific events internal to a program, as required in structural testing (glass-box testing, code-based testing). For this reason functional testing is the primary form of testing used in practice. Structural testing, if it is employed at all, usually takes the form of assessing the degree of structural coverage Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ICSE 2000, Limerick, Ireland ACM /00/06 $5.00 achieved by functional tests, that is, the extent to which the tests induce certain internal events. Structural coverage is assessed by profiling the executions induced by functional tests, that is, by instrumenting or monitoring the program under test in order to collect data about the degree of coverage achieved. If necessary, the functional tests are augmented in an ad hoc manner to improve structural coverage. The difficulty of constructing test data to induce internal program events suggests an alternative paradigm for testing software. This form of testing, which we call observationbased testing, emphasizes what is relatively easy to do and de-emphasizes what is difficult to do. It calls for first obtaining a large amount of potential test data as expeditiously as possible, e.g., by constructing functional tests, simulating usage scenarios, capturing operational inputs, or reusing existing test suites. The potential test data is then used to run a version of the software under test that has been instrumented to produce execution profiles characterizing the program s internal events. Next, the potential test data and/or the profiles it induces are analyzed, in order to filter the test data: select a smaller set of test data that induces events of interest or that has other desirable properties. To enable large volumes of potential test data to be analyzed inexpensively, the analysis techniques that are used must be fully or partially automated. Finally, the output resulting from the selected tests is checked for conformance to requirements. This last step typically requires manual effort either in checking actual output or in determining expected output. Many forms of execution profiling can be used in observation-based testing. For example, one may record the occurrences of any of the kinds of program events that have traditionally been of interest in testing. Typically, a profile takes the form of a vector of event counts, although other forms, such as a call graph, may be used in observation-based testing. Since execution profiles are often very large ones with thousands of event counts are common automated help is essential for analyzing them. In structural testing, profiles are usually summarized by computing simple coverage measures, such as the number 116

2 of program statements that were executed at least once during testing. However more sophisticated multivariate data analysis techniques can extract additional information from profile data. For example, [10] and [11] report experiments in which automatic cluster analysis of branch traversal profiles, used together with stratified random sampling, increased the accuracy of software reliability estimates, because it tended to isolate failures in small clusters. Among the most promising multivariate data analysis techniques for use in observation-based testing are multivariate visualization techniques like correspondence analysis and multidimensional scaling. In essence, these computer-intensive techniques project many-dimensional execution profiles onto a two dimensional display, producing a scatter plot that preserves important relationships between the profiles. This permits a human user to visually observe these relationships and, with the aid of interactive tools, to explore their significance for software testing. We present the initial results of a project whose goal is to explore the potential applications of multivariate visualization techniques to software testing and ultimately to develop a methodology for employing them. Section 2 gives an overview of two applicable multivariate visualization techniques: correspondence analysis and multidimensional scaling. Section 3 describes several applications of multivariate visualization in observation based testing. Section 4 presents a case study in which correspondence analysis is applied to a large data set. Future research is discussed in Section 5. Related work is surveyed in Section 6. Section 7 is the Conclusion. 2 VISUALIZATION TECHNIQUES In this section, we present a brief overview of two multivariate visualization techniques that are applicable to observation-based software testing: correspondence analysis and multidimensional scaling. Both techniques are distinguished by their ability to handle data of large volume and high dimensionality. Correspondence analysis is used in the case study described in Section 4. Correspondence Analysis Correspondence analysis is one of many names for a data analysis and visualization technique that has been independently discovered in many fields [3]. It is used to analyze an n-dimensional data set and represent it in few dimensions with the least possible loss of information. This representation allows the user to visually analyze the relationships between different data points, and also between these points and the original dimensions. One way to think about calculating the correspondence analysis display is to first fit a line through the n- dimensional space, so as to maximize the variance of the points along the direction of the line. Then the coordinates of the points are modified so as to take away any variance along this direction, and the process is repeated. The projections of the original points onto the fitted lines (axes) correspond to the points coordinates on the display. The displayed points are called row points, because they correspond to rows in the data matrix. Correspondence analysis assigns weights to both points and dimensions, e.g., to compensate for differences in measurement units. Display points representing the original dimensions, which are called column points, can also be displayed in order to show how the position of row points is affected by various dimensions. Correspondence analysis is usually computed using singular value decomposition (SVD) [7]. SVD is a well-known matrix decomposition technique for which there are efficient, highly-optimized algorithms. In observation-based testing, the input data for correspondence analysis is a matrix where each row corresponds to an execution of the software under test, and each column corresponds to a profile feature. Each row of the matrix can be considered as the coordinates of a point in a space with as many dimensions as there are profile features. Analyzing this matrix with correspondence analysis yields low-dimensional displays that show the relationships between the test points. Multidimensional Scaling Multidimensional scaling is the name for a family of techniques used to project a set of n-dimensional data points onto a plane, given only a matrix of dissimilarities between the points [1]. This matrix is computed from a data matrix by applying a dissimilarity metric (e.g., Euclidean or Manhattan distance) to each pair of rows. The positions of the points on the display are such that the distance between any pair of points reflects as closely as possible the degree of dissimilarity between them. A simple approach is to select a starting configuration of points, and evaluate how closely this approximates the input. The points are then moved to decrease the error, and the process is repeated until a minimum is found. This results in a very flexible approach to finding a display, since there are different ways of calculating the dissimilarity matrix, computing the error, and finding a solution. 3 APPLICATIONS Multivariate visualization techniques like correspondence analysis and multidimensional scaling enable a tester to visualize the distribution of execution profiles induced by a set of potential test cases. They also reveal significant features of the corresponding population of executions. 1 Typical examples of such features include unusual executions, clusters of similar executions, and regions of the profile space without any executions. In addition, visualization often reveals other features of an execution 1 These should not be confused with profile features

3 population that are visually striking but whose significance is not immediately obvious. Upon further investigation, such features may turn out to be significant for testing, as we shall see in Section 4. Besides revealing the distribution and features of an execution population, multivariate visualization techniques provide means of comparing two or more execution populations and of relating an individual execution to other executions. These capabilities of multivariate visualization techniques have several applications in observation-based testing, which are described in the remainder of this section. These include: evaluating synthetic test data; filtering regression test suites; filtering captured operational executions; comparing test suites; and assessing bug reports. Evaluating Synthetic Test Data Multivariate visualization techniques can be used to evaluate and improve a set of test data derived synthetically, e.g., by constructing functional tests or by simulating usage scenarios. As mentioned in the Introduction, it is customary to evaluate such test data by measuring the degree of structural coverage it achieves. Visualization techniques can go further by revealing relationships among test cases. Because the displays produced by these techniques are computed based on all columns of a data matrix, they can reveal relationships involving multiple event counts. An outlier or isolated point in a display indicates a test case that induces unusual behavior. Such test cases are usually desirable, because they exercise aspects of a program not exercised by other test cases. A dense cluster of points in a display suggests that the corresponding test data is redundant with respect to the kinds of events that have been profiled. It may be beneficial to eliminate most of the redundant tests and replace them with more varied ones. In order to decide whether this is appropriate, it is necessary to carry out more detailed analysis of the cluster, for example, by examining other views of the data or by using a different form of profiling. It is desirable for a visualization tool for use in software testing to support such analysis interactively. An empty region R in the display may indicate that the test set fails to exercise important behaviors of the software under test. In this case, it is desirable to augment the test set with one or more tests whose profiles yield points in R when displayed. However, it is possible there are no inputs to the software that will produce profiles in R, and in general it is undecidable whether there is any input to a program that will produce a profile with specified values. One approach to augmenting a test set is to trawl for suitable inputs: obtain additional inputs from any source (e.g., from beta testing); execute the software on them to produce new profiles; display the new profiles together with the original ones; and observe whether any of the new points fall into region R. Another approach to augmenting 3 a test set is to obtain a characterization of the kinds of profiles that would yield points in R and then attempt to construct one or more test cases that produce such profiles. Ideally, a visualization tool for use in testing would produce such a characterization of a display region on request. A display may reveal features of a test set whose explanation requires further investigation. These may be indicated by point sets with distinctive shapes. For example, the displays shown in Section 4 exhibit linear, curvilinear, and triangular point sets, among others. To determine whether the test set adequately exercises the software under test, it may be important to understand the factors that underlie such display features. To discover these factors it may suffice to examine the documentation for the corresponding test cases. If this does not exist or is not sufficient, it is necessary to conduct a detailed analysis of the profiles produced by the test cases and of the software under test. Because correspondence analysis can simultaneously display points representing the rows of the data matrix and points representing the columns, it can reveal the extent to which the position of a point is explained by individual profile features. Essentially, a row point is attracted to column points corresponding to the most prominent features in its profile (those with high values). By considering the affinity of row points for certain column points, the tester can gain an understanding of which events characterize a test case. A set of column points that are close together corresponds to a set of profile features (event counts) that are correlated with each other. The correlations between these features might be explained by an unobserved, latent variable or factor (in the sense of factor analysis [8]). This can be confirmed only by detailed analysis of the software under test. Filtering Regression Test Suites A notable special case of evaluating synthetic test data is analyzing a regression test suite in order to eliminate redundant test cases or to add new tests necessary to exercise new features of the software under test. Several authors have proposed techniques for identifying a minimal or safe subset of a regression test suite, e.g., see [5,13]. For this to be worthwhile, the cost of the analysis it entails must be less than the cost of running and evaluating the tests that are eliminated. Multivariate visualization techniques are applicable to this problem when the cost of evaluating regression tests dominates the cost of running them, e.g., because the tests must be evaluated manually. To apply these techniques, the executions induced by the original test suite must be profiled. Visualization is then used to select a subset that spans the range of tests in the original suite but which contains no redundant tests. This is done as follows. All outliers or isolated points are selected. One representative is chosen from each roughly elliptical cluster 118

4 of points. With other features of the display, one representative is selected from each region and extremity of the feature. Multivariate visualization techniques permit a variety of criteria to be used in filtering regression tests, since they can be used with any kind of profile. Filtering Captured Operational Inputs A serious problem with synthetic test data is that it does not reflect the way that the software under test will be used in the field. Even if it reveals defects, it may not reveal those having a significant impact on the software s reliability as it is perceived by users. By contrast, operational testing (beta testing, field testing) does reflect the way software is used in the field, and it also may reduce the amount of inhouse testing (alpha testing) software developers must do. In operational testing, the software to be tested is provided to some of its intended users to employ as they see fit over an extended period. The advantages of operational testing are somewhat offset by the fact that beta users often fail to observe or report failures, because they are unfamiliar with the software s specification and because testing is not their primary occupation. This problem can be addressed by using a capture/replay tool to capture executions in the field, so they can later be replayed and examined in detail by trained testing personnel. If many executions are captured, it may be practical to examine only a fraction of them in this way. Rather than examining a random sample of executions, it is desirable to filter the captured sample to identify executions with unusual characteristics that may be associated with failure. Multivariate visualizations can be used to filter operational executions in much the same way they can be used to filter regression test suites. Comparing Potential Test Suites Multivariate visualization techniques also provide a means of comparing test suites derived in different ways. As such, they are potentially useful both to practitioners and to researchers. For example, a set of synthetically generated tests can be compared with captured beta-test executions in order to see how well the synthetic tests approximate operational usage. Such a comparison might be used to modify testing procedures to better reflect patterns of operational usage. Captured executions obtained from different user populations can be compared visually in order to understand differences in their usage patterns that should be addressed in future testing. Assessing Bug Reports Software development organizations often have such a backlog of bug reports about a product that when a new report comes in, they cannot address it immediately. Rather, they must prioritize it and focus on repairing the high-priority bugs first. Multivariate visualization provides a means by which a developer can gain insight into the significance of a newly reported bug. This requires the developer to maintain a large repository of operational executions of the product, captured from a random sample of user sites. If a new bug report includes an input that elicits a failure, the execution E the input induces can be profiled, and this profile can be displayed together with profiles of the captured executions in the repository. The executions that are close to E in the display can then be identified, replayed, and examined to determine if they also fail in the same way. If the repository reflects the way the software is used in the field, this procedure will indicate the relative frequency with which the bug causes failures in the field. 4 CASE STUDY In this section we present a case study illustrating several of the applications of multivariate visualization described in Section 3. In this case study, correspondence analysis is used to analyze two sets of inputs to the C-language compiler of the GNU Compiler Collection (GCC), version [2]. The profile data consists of function call counts as reported by GNU's function coverage profiler, gprof. That is, each time the compiler was executed, the number of times each of the functions of the compiler was called was recorded. The execution platform was a Sun Ultra 5 workstation, running SunOS 5.7. One set of inputs was the test suite for GCC 2.95 (The test suite for was not publicly available at the time.) This set of inputs executed the C compiler 6064 times, yielding just as many profiles. A second set of inputs was included to compare with this test suite. It consists of publicly available programs for which the source code is also available. Most of them come from either the GNU project or the X Windows consortium X.Org [14]. These programs were selected to represent a wide variety of applications. Among them are some file, shell, and compression utilities, a compiler (GCC), a debugger (gdb), a text editor (Emacs), some X windows programs, an AI program (GNU Chess), and some network daemons and clients. A total of 32 programs were included, adding 1807 more compilations, for a total of 7871 executions. For all of these programs, the default make files were used, including optimization choices, etc. A total of 2370 different functions were called during at least one of these executions. The result was a data matrix of 7871 rows by 2370 columns. Calculating the correspondence analysis display for this data set takes roughly one hour on a Pentium III 450 with 256 MB of memory. Interpreting the Correspondence Analysis Display Once correspondence analysis has been carried out, one ends up with a set of axes and the data points coordinates on these axes. The axes are called principal axes. They are ordered with respect to the amount of variance in the data they account for, with the most important being the first principal axis. The plane defined by the first and second principal axes is called the first principal plane; the third and fourth principal axis make the second principal plane, and so on. It is also possible to take different pairs of axes, 4 119

(a) (b) Figure 1: (a) First principal plane for a subset of the GCC data, including row points (round) and column points (squares). (b) Names of the functions for the lower cluster of column points.

5 (a) (b) Figure 1: (a) First principal plane for a subset of the GCC data, including row points (round) and column points (squares). (b) Names of the functions for the lower cluster of column points. though the planes they define have no special names. Figure 1a shows the first principal plane for a subset of the GCC data. This is a scatter plot using the first principal axis as the x axis and the second as the y axis. Each round point in this figure corresponds to an execution of the GCC compiler. The distances between points in the figure reflect the n-dimensional distance between the corresponding profiles. That is, if two points are far apart in the display, then the corresponding executions have very different profiles. As mentioned previously, correspondence analysis can also provide information about the reasons for a point s location in the display. That is, it identifies which of the point s features were most important in determining its placement. Consider Figure 1a. This contains two sets of points: the round points correspond to row points (test cases) whereas the square points are column points. That is, each of those square points correspond to one column of the input matrix, which in turn represents one of the features being profiled, in this case one function. The column points provide two pieces of information. First, the distances between column points can be interpreted in the same way as those for row points. That is, if two column points are far apart, they are essentially unrelated, while points that are close together represent a set of functions whose call counts were linearly related in all runs. These sets of related functions are called factors (see Section 3). On the other hand when two functions are related in a non-linear way, they will be separated, and the row points will be arranged in a curve between these two column points, representing the relationship between these functions. In Figure 1a, one can see a few different factors in the display, one to the left, one on the bottom, and some others around the cloud of row points. The names of the functions in the box can be seen in Figure 1b. The second piece of information given by the column points arises 5 from their relationship with the row points. If a row point is close to a column point, it means that the execution represented by the row point used the function represented by the column point more often than average. For example, the executions on the bottom of Figure 1a used the functions listed in Figure 1b very often. Any row point can be interpreted as a linear combination of column points, since the location of each row point corresponds to how many times each of the functions was called. This means that the position of a row point in the display is a weighted average of the position of all column points. For example, if an execution used only functions f and g, and it used them the same number of times, its point would be halfway between f s point and g s point. A point that is very close to the origin is an execution that did not stress the functions that are represented in that plane. On different principal planes, they might be very far apart from the origin, since different factors will be seen. Finding Unusual Executions By looking at the correspondence analysis display of the GCC data, one can immediately identify some points that are far away from all the others. This means that GCC s behavior during these executions was remarkably different from its behavior during most others. For example, Figure 2 displays the first principal plane of the correspondence analysis. There is an outlier on the right side of the display, very far from the rest of the points. This indicates a very special execution, which made GCC behave in an interesting way. This point, and all such points, should be chosen as test cases worth looking at. After an outlier has been identified, the display is recalculated without taking that point into consideration. The reason for this step is that an outlier greatly influences the display, which makes the representation of all other points less accurate. After recalculating, it is possible to 120

Figure 2: First principal plane of the GCC data set. Initial display see more of the structure of the data, as shown in Figure 3. After this step, one can look for more outliers.

But looking at the second principal plane (Figure 4), one can see another outlier.

6 Figure 2: First principal plane of the GCC data set. Initial display see more of the structure of the data, as shown in Figure 3. After this step, one can look for more outliers. If none are found in the first principal plane, one can look for them in other principal planes. For example, there are no obvious outliers in Figure 3. But looking at the second principal plane (Figure 4), one can see another outlier. This is because even though the first principal plane is free of outliers, it does not mean there are no points influencing the remaining dimensions. Fortunately, there is a technique related to correspondence analysis, called jackknifing, that identifies outliers automatically. Jackknifing checks each point to determine whether removing it would cause a principal axis to rotate more that 45 degrees. If so, that point is considered an outlier and is actually removed. A new display is then calculated and the process is repeated to identify all outliers. Although jackknifing is very fast, recomputing the display is more expensive. The whole process can be run Figure 3: First principal plane after removing one outlier overnight and requires no human intervention. Figure 5 shows the display after removing outliers 24 times in this manner. Eighty-two outliers were identified and removed in total. (More than one outlier can be identified in any one step.) A problem occurs when a plane has a small cluster of points that is distant from the rest. For example, as will be seen below, Figure 8 contains a small cluster of points on the right. Technically none of these points is an outlier, since they are not unique there are four of them in that region. One may instead label the entire cluster as an outlier and remove all its points. Instead of checking all of the executions in the cluster for conformance to requirements, one or two of them might be checked instead. Outlier clusters have an adverse effect on the jackknifing algorithm. Removing any of the points by itself might not influence the display, so none of these points will be labeled as an outlier. To date, our only means of Figure 4: Second principal plane after removing one outlier. Figure 5: First principal plane after 24 iterations of the jackknifing algorithm. Shaded according to optimization level 6 121

better to do this before running the jackknifing algorithm, Another way to approach this problem is by assuming there so this algorithm can run on a more accurate display.

7 (a) (b) Figure 6: First principal plane after 24 iterations of the jackknifing algorithm. Shaded according to optimization level. (a) shows only points representing test suite runs, (b) shows only points representing user programs. identifying clusters of outliers is by visual inspection. It is behavior. better to do this before running the jackknifing algorithm, Another way to approach this problem is by assuming there so this algorithm can run on a more accurate display. is no test suite and picking a set of test cases from the Figure 9 shows the result of removing the outlier cluster executions in Figure 5. Intuitively, one or more test cases from Figure 8. should be picked from each region, since executions in Comparing and Augmenting Test Suites different regions have different behavior. One can also The GCC test suite turns out not to cover all of the look at the features of the population and select tests from functions of the compiler. Just by adding the user each cluster, etc. This process is repeated for subsequent programs, 27 functions were executed that were not principal planes, being careful to mark the tests that have exercised by the test suite. This suggests that the test suite been chosen from higher planes. This way more features might be improved by adding more test cases. Figure 5 can be taken into account, without introducing redundant shows the correspondence analysis for the whole GCC data tests. set. This includes both the existing GCC test suite and the Identifying Significant Features in the Display set of user programs compiled for comparison. To make it By looking at Figure 5, one can see distinct features in the easier to differentiate between these two data sets, Figure display. These correspond to ways in which the factors in 6a shows the same display, with only the test suite points the display are related in the population. For example, plotted, and Figure 6b plots only the user program points. consider the dark points in Figure 6a. All of these points Putting the two together yields Figure 5. exhibit a similar correlation between the factor at the Considering figures 6a and 6b, it is obvious that the test bottom of the display and the factor at the left. This means suite compilations behave very differently than user that for these points, the two factors have an inverse linear program compilations. In particular, the test suite programs relationship. But notice that this correlation is different for seem to concentrate towards the top left part of the display, different shades, while it is basically nonexistent for the while the user programs are closer to the bottom of the points in Figure 6b. This means that these different groups display. This confirms the common knowledge that handmade test cases may not reflect the way the program is different ways, showing that the relationships between the of executions of the compiler behave in remarkably actually utilized by end users. components of the compiler are not static but depend on the specific attributes of the input. Moreover, this shows that Examining these pictures, one can see that the test suite these different behaviors are discrete, since there is stresses the compiler functions at the top left of the display. essentially no middle ground. When selecting test cases, it There are a few user programs that lie in that region, so it is is desirable to test all such program behaviors, and a good thing that region has been tested. On the other therefore one needs to select tests from each region of the hand, many user programs lie in a region at the bottom of display. the display, where there are no test suite programs covering that combination of functions. This indicates that the test In order to understand this data set better, it was necessary suite could be augmented by adding one or more of these to determine what these regions represent. It was first user programs, or by designing tests that produce this noticed that there were differences between test suite 7 122

Figure 7: First principal plane when analyzing only runs with no optimization. Black points indicate test suite runs. Light points indicate user programs programs and user programs.

The regions represent the level of optimization of the compiler during that execution. Figures 5, 6a and 6b are shaded according to optimization levels, with black being runs that were not optimized.

8 Figure 7: First principal plane when analyzing only runs with no optimization. Black points indicate test suite runs. Light points indicate user programs programs and user programs. Then, different explanations were explored for the regions in the display. In the end, the simplest one proved to be appropriate. The regions represent the level of optimization of the compiler during that execution. Figures 5, 6a and 6b are shaded according to optimization levels, with black being runs that were not optimized. Different optimization levels require different behaviors from the compiler, and this fact is represented in the display. Selecting test cases from each region of the display therefore ensures that each optimization level of the compiler is tested. Although the need for such tests should be obvious to someone knowledgeable about compilers, the test selection procedure we have described does not require such knowledge. This is especially important when the workings of the software under test are poorly understood. Once we have determined that executions with and without optimization behave very differently, we can choose to Figure 9: First principal plane after removing the first cluster of outliers from the set of points with optimization. 8 Figure 8: First principal plane when analyzing only tests with optimization enabled. examine them separately. Figure 7 shows the first principal plane for points without optimization, and Figure 8 shows the points with optimization. Unlike figures 6a and 6b, these displays are calculated separately, so there is no relationship between the displays in figures 7 and 8. Figure 7 shows that test suite and user programs are still fairly disjoint. Additional features can be seen in the data, both linear and otherwise. Again, this suggests there are different kinds of executions that should be checked. Figure 8 shows a cluster of points to the right. It can be seen that these points stress some aspect of the compiler in a special way. It turns out that these points correspond to test cases for testing the built-in memcpy functions, using high optimization. So again, the display suggests meaningful test cases. After removing that cluster of outliers, we get Figure 9. Eliminating Redundant Executions The correspondence analysis display can point a tester to sets of tests that might be redundant. Tests that are far apart in the display have very different profiles. Likewise, tests that are close together have similar profiles. One has to be careful, though. Similarity in this case does not mean equality. It simply indicates that the tests coincide with respect to the factors displayed in the current principal plane. Therefore, when looking for similar executions, it is necessary to check whether the points coincide in several planes. Once a group of seemingly similar executions has been found, it is possible to further analyze it to establish which points really are redundant. This can be done by comparing the profiles with standard statistical techniques, or by examining this smaller group again with correspondence analysis or another multivariate visualization technique. Using correspondence analysis on this smaller set of data would allow the tester to look at the differences between its points without the display being influenced by the differences between the other points. Moreover, looking at the display's column points allows the 123

9 Figure 10: Binary tree data set. Light points indicate failures. tester to see what these differences are and decide whether they are meaningful, or whether the tests really are redundant. 5 FUTURE RESEARCH In order to develop a comprehensive methodology for the use of multivariate visualization in observation-based testing, it is necessary to understand how visualizations are affected by the type of software being studied, the types of defects it contains, the type of profiles that are generated, and the type of visualization technique that is employed. We therefore plan to conduct a substantial empirical study that will address these issues. In this study, different visualization techniques are to be used to analyze a variety of profile types from several representative types of software. One particularly important issue is the extent to which multivariate visualization can distinguish executions that exhibit failures from other executions. The case study presented in Section 4 did not address this issue, because we did not know which, if any, inputs caused the GCC compiler to fail. However, we have preliminary evidence from other programs that multivariate visualization can distinguish failure behavior. Consider for example the correspondence analysis display in Figure 10. It was computed from line-count profiles of a simple program that implements a binary search tree. During execution of this program, a number of key-value pairs are inserted and deleted, and the results are printed out at intervals. The program contains a defect that was deliberately placed in its deletion routine. A second program was used to check whether the tree program s output was correct. The black points in Figure 10 correspond to successful executions, while the light points indicate failed executions. This display clearly separates many failures from successful executions. Consequently, if tests are selected from the major regions of the point cloud, the defect is certain to be 9 revealed. Interestingly, statement coverage would not necessarily reveal this defect, because the defect affects only deletions of internal nodes and causes a failure only if the tree s contents are printed out soon after an internal node is deleted. (Subsequent deletions can cause coincidental correctness.) In Figure 10, the left corner of the point cloud corresponds to executions with only insertions; on the right are executions with many deletions. On the top are executions with insertions and deletions of internal nodes. We shall investigate whether the separation of failures and successful executions exhibited by the binary tree data set is common with other, more substantial programs and different forms of profiling. 6 RELATED WORK A number of authors have addressed topics closely related to observation-based testing and multivariate visualization of execution profiles. Hanson et al describe the use of several multivariate analysis techniques, including star plots and principal components analysis, for studying usage of UNIX operating system commands in support of userinterface enhancement [4]. Podgurski et al examine the use of the multivariate analysis technique cluster analysis for improving the accuracy/efficiency of software reliability estimation [10,11]. In this application, program executions are captured in the field and later replayed and profiled. The profiles are then clustered to obtain a stratified sampling design for estimating reliability. Podgurski et al report experiments in which cluster analysis of branchtraversal profiles permitted the failure frequency of several programs to be estimated, using stratified random sampling, more accurately than it could be with simple random sampling. This result was explained by the fact that cluster analysis of profiles isolated some program failures in very small clusters. Reps et al explore the use of a type of execution profile called a path spectrum for discovering Year 2000 problems and related issues, and they propose several other applications of path spectra to software maintenance and testing [12]. They also describe a prototype system called DynaDiff for comparing path spectra, which produces graphical representations of individual spectra to facilitate their comparison. Harrold et al evaluated several types of program spectra (profiles) empirically, to determine how well they indicate the occurrence of execution failures [6]. They observed that failures were likely to be indicated by differences in complete-path spectra, path-count spectra, and branch-count spectra. They also observed that differences in such spectra are more likely to indicate failures than are differences in execution-trace spectra. Pavlopoulou and Young describe how monitoring residual test coverage in software that is deployed or undergoing beta testing can be used to validate the thoroughness of testing in the development environment [9]. They describe 124

10 a prototype system that monitors residual statement coverage in Java programs and they present performance measurements that suggest the performance impact of monitoring is acceptable. 7 CONCLUSION We have described a new approach to testing, called observation-based testing, which calls for obtaining a large amount of potential test data as expeditiously as possible and then filtering that data to obtain a much smaller subset on which to actually evaluate the software under test. Filtering potential test cases involves profiling the executions they induce and then analyzing the resulting profiles with automated help. We have proposed the use of multivariate visualization techniques for analyzing profiles, described several applications of the techniques, and presented a case study in which correspondence analysis was used to analyze potential tests cases for the GCC compiler. REFERENCES 1. Borg, I. and Groenen, P. Modern Multidimensional Scaling: Theory and Applications, Springer, GCC. The GCC Home Page, Free Software Foundation, Greenacre, M.J. Theory and Applications of Correspondence Analysis, Academic Press, Hanson, S.J., Kraut, R.E, and Farber, J.M. Interface design and multivariate analysis of UNIX command use. ACM Transactions on Office Information Systems 2, 1 (March 1984), Harrold, M.J., Gupta, R., and Soffa, M.L. A methodology for controlling the size of a test suite. ACM Transactions on Software Engineering and Methodology 2, 3 (July 1993), Harrold, M.J., Rothermel, G., Wu, R., and Yi, L. An empirical investigation of program spectra. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (Montreal, Canada, June 1998), Harville, D.A. Matrix Algebra From a Statistician s Perspective. Springer-Verlag, Krzanowski, W.J. Principles of Multivariate Analysis: A User s Perspective, Oxford Science Publications, Pavlopoulou, C. and Young, M. Residual test coverage monitoring. Proceedings of the 21 st International Conference on Software Engineering (Los Angeles, CA, May 1999), ACM Press, Podgurski, A., Masri, W., McCleese, Y., Wolff, F.G., and Yang, C. Estimation of software reliability by 10 stratified sampling. ACM Transactions on Software Engineering and Methodology 8, 9 (July, 1999), Podgurski, A. and Yang, C. Partition testing, stratified sampling, and cluster analysis. Proceedings of the First ACM Symposium on Foundations of Software Engineering (Los Angeles, CA, December 1993), ACM Press, Reps, T., Ball, T., Das, M., and Larus, J. The use of program profiling for software maintenance with applications to the Year 2000 Problem. Proceedings of the 6th European Software Engineering Conference and 5th ACM SIGSOFT Symposium on the Foundations of Software Engineering (Zurich, Switzerland, September 1997), ACM Press, Rothermel, G. and Harrold, M.J. A safe, efficient regression test algorithm. IEEE Transactions on Software Engineering 6, 10 (April 1997), X.Org. X.Org,

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups