Investigating Java Type Analyses for the Receiver-Classes Testing Criterion

Size: px

Start display at page:

Download "Investigating Java Type Analyses for the Receiver-Classes Testing Criterion"

Myles Lambert
6 years ago
Views:

1 Investigating Java Type Analyses for the Receiver-Classes Testing Criterion Pierre-Luc Brunelle Computer Eng. Dept. École Polytechnique Montréal, PQ, Canada Ettore Merlo Computer Eng. Dept. École Polytechnique Montréal, PQ, Canada Giuliano Antoniol Faculty of Engineering University of Sannio Benevento, Italy Abstract This paper investigates the precision of three linearcomplexity type analyses for Java software: Class Hierarchy Analysis (CHA), Rapid Type Analysis (RTA) and Variable Type Analysis (VTA). Precision is measured relative to class targets. Class targets results are useful in the context of the receiver-classes criterion, which is an object-oriented testing strategy that aims to exercise every possible class binding of the receiver object reference at each dynamic call site. In this context, using a more precise analysis decreases the number of infeasible bindings to cover, thus it reduces the time spent on conceiving test data sets. This paper also introduces two novel variations to VTA, called the iteration and intersection variants. We present experimental results about the precision of CHA, RTA and VTA on a set of 17 Java programs, corresponding to a total of 600 kloc of source code. Results show that, on average, RTA suggests 13% less bindings than CHA, standard VTA suggests 23% less bindings than CHA, and VTA with the two variations together suggests 32% less bindings than CHA. 1. Introduction Inheritance, polymorphism and dynamic binding are three major characteristics of object-oriented (OO) languages. These features offer great benefits, such as reuse and abstraction, but at the expense of increased testing complexity [5]. Polymorphism allows an object reference to be bound to objects of different types. With dynamic binding, the method actually invoked at a dynamic call site is determined according to the runtime type of the receiver. Hence, a call site can in fact execute different statements. To increase confidence in the program, a call site should be tested for every class whose objects can be bound to the receiver. The rationale is that even if a class correctly implements the called method, there is no guarantee that others will. Hence, testing a call site for just one binding of the receiver is not sufficient. It must be tested for every possible class binding of the receiver. In essence this is what the receiverclasses criterion is about. The question that arises is how to determine the possible class bindings. One simple way is to use the inheritance hierarchy of the program, and assume that a receiver can be bound to objects of its declared class and all its subclasses. This technique is essentially Class Hierarchy Analysis (CHA) [7]. It yields a safe estimate, but one that could include many spurious classes. Other analyses have been developed that are safe too, but include less possible classes. Rapid Type Analysis (RTA) [3] and Variable Type Analysis (VTA) [16] are two such analyses. All three analyses are conservative, fast and exhibit a linear time complexity. We have made two improvements to VTA, called the iteration variant (VTA n ) and the intersection variant (VTA ). Both variations increase the precision of VTA, at the expense of increased computation time (VTA n ) or additional manual intervention (VTA ). This paper measures the receiver resolution precision of CHA, RTA and VTA on a set of 17 Java programs. Concerning VTA, results are reported for standard VTA, VTA n and VTA. The precision of these type analyses is evaluated with respect to the number of infeasible classes they find. This measurement is useful since if we find out that an analysis is much more precise than the others, a tester would be well advised to use that analysis: she would have less infeasible bindings to cover. All code examples and all test programs are written in Java, although the criterion and the type analyses can in principle be applied to other OO languages such as C++. The paper is organized as follows. Section 2 recalls basic definitions. Section 3 introduces the receiver-classes criterion and its variants. Section 4 describes the three type analyses and the variations to VTA, and section 5 describes the problems faced by two of the analyses when applied to Java software. Section 6 describes the experiments while section 7 presents and discusses the results. Finally, section 8 presents some related work and section 9 concludes and proposes new research leads.

2 2. Definitions Some object-oriented (OO) notions, including OO call sites and polymorphism, are at the heart of the receiverclasses criterion, so we recall their definitions, referring to the following small Java program to ease the discussion. 1 public class T f 2 public static void main(string[] args) f 3 T t = new U(); 4 t.f(args); 5 g 6 public void f(object v) f 7 String s = v.tostring(); 8 System.out.println("s = " + s); 9 g 10 g 11 class U extends T f g Figure 1. Small example The program is made of a class named T and a subclass named U. ClassT defines two methods, one of which allows only static binding (main) and one which also allows dynamic binding (f). At line 7 the tostring method is called on v, an object reference of declared type Object. Line 7 is made of two statements: first a method call, then an assignment. The statement v.tostring() is a call site. The call site is made of a sender object (t defined at line 3), a receiver object (v), a method (tostring) and its parameters (none). For some call sites the receiver object is not explicitly specified but rather it is the implicit this object. Recall that Object is the root class of the Java class hierarchy: Every class, whether defined by a user or provided in the standard library, is a subclass of Object. In this example, T is a subclass of Object even if the relationship has not been specified explicitly. At line 3 an object of type U is created and assigned to the object reference t. Polymorphism allows t to be bound to an object of class U. We define infeasible types as follows: A type T is said to be infeasible for an object reference v iff v is not bound to an object of type T in any execution of the program. We say that T is an infeasible class binding for v. Thereceiver resolution precision (often just called precision in this paper) is the ratio of the number of actual possible types to the number of possible types found by a conservative analysis. Increasing the receiver resolution precision is equivalent to decreasing the number of infeasible types. 3. Receiver-classes criterion A concise definition of the receiver-classes criterion has been proposed by Rountev, Milanova and Ryder [14]. The receiver-classes criterion requires testing every dynamic call site for every possible class of objects that may be bound to the call site s receiver object reference. Some elements of this definition should be highlighted: ffl Even if a class inherits the method invoked at the call site it still must be tested. Returning to the example of figure 2, class U should be tested at line 4, although U doesn t override method f. The rationale is that T.f could call a method implemented by U. ffl The receiver-classes criterion is only concerned with dynamic call sites. All other call sites (including constructor invocations and calls to static methods) are not taken into account and should be covered by other testing strategies. ffl The receiver-classes criterion is a white-box technique. It requires access to source code or to object code that contains enough information, such as bytecode. The receiver-classes criterion can be applied at two scopes, program scope and component scope. At program scope, the tester treats the whole program as a single entity. The target classes are tested only in the way they are currently used. To execute the test cases, test data is entered using the same means a user (either a human or another program) would use (i.e. command-line options, standard input stream, files, GUI input, etc.). Chen and Kao [6] applied the criterion at program scope. We also apply the criterion at program scope in this paper. At component scope, conceptually the program is split into components, i.e. units that accomplish specific functionality. The tester tests each component separately, simulating arbitrary clients. The component scope is used by Rountev [14] Polymorphism faults The receiver-classes criterion targets polymorphism faults. It has not been designed to cover other kinds of faults. Therefore, before using this testing technique, one should make sure that polymorphism faults in OO language are frequent enough to justify its cost (in time). At least three studies show that polymorphism faults are indeed an important part of faults in OO software. Chen and Kao [6] investigated three subsystems written in C++ totalling 40 kloc. They analyzed the bug reports and found that 15% of the faults (30 out of 200) were due to inheritance or polymorphism. Functional testing was quite poor at finding those faults: after running over 800 test cases, only a third (10 out of 30) were detected. Simple white-box testing has not proven effective as well: running another 250 test cases and achieving around 90% statement and branch coverage, only 3 additional polymorphism faults

3 were detected, leaving over half of polymorphism faults (17 out of 30) undetected. The likelihood of polymorphism faults has also been established for Java. Kim et al [9] designed mutation operators for Java. Out of 13 operators designed to represent OO faults, three specifically represent polymorphism faults: CRT (Compatible Reference Type replacement), ICE (class Instance Creation Expression changes) and OMR (Overriding Method Removal). The CRT and ICE operators represent client-side errors, while OMR relate to server-side errors. ICE represents the binding of an object reference to a newly allocated object of a wrong (but compatible) type. More generally, a reference could be bound to any object of a wrong but compatible type. In another article [8], they present an extensive list of hypothesized flaws for the Java language, four of which concern polymorphism faults. Offutt and Alexander [13] also studied object-oriented faults, dividing them into nine categories. In another study [2], they compared the power of a procedural testing criterion versus three OO coupling criteria (see section 3.3). They first injected a total of 899 faults (belonging to five of the nine categories) in 10 test programs. Then, they ran test cases designed to achieve 100% coverage of all criteria. They found that branch coverage revealed only 12% of the injected faults (running a total of 23 test cases), while the OO criteria revealed 37%, 63% and 82% of the faults requiring repectively 26, 55 and 817 test cases. They concluded that the coupling criteria are better at detecting the object-oriented faults used in the experiment. We conclude from these studies that polymorphism faults are not uncommon in languages such as C++ and Java. Also, traditional testing techniques, whether whitebox or black-box, are not sufficient to find most of polymorphism faults Liskov Substitution Principle The Liskov Substitution Principle (LSP) is a criterion that is useful when designing the type families of a system. Under this principle, for two types to be related by inheritance, the objects of the subtype ought to behave the same as those of the supertype as far as anyone or any program using the supertype object can tell. [11] The objects of the subtype must not break the assumptions clients can infer from the specification of the supertype. The receiver-classes criterion partly tests the conformance of a program to the Liskov Substitution Principle. The criterion verifies that the specific assumptions made by specific clients hold for the server class families as they stand at time of testing. At least two kinds of errors can be detected: (1) a client at some point makes wrong assumptions regarding a server class hierarchy, or (2) a derived class in a server hierarchy does not respect the behavior of the base server class. Note that the emphasis is on the client-side, not on the server-side: Even if a given client shows no faults after being submitted to test cases achieving receiver-classes coverage, another client with other assumptions could fail. Also, the client could fail if new subclasses are added to the server hierarchy, either because these subclasses do not conform to the base class behavior or because they trigger wrong client assumptions. To summarize, the receiver-classes criterion is designed to uncover these faults [4]: 1. An object reference is bound to an object of the wrong type. 2. A client assumes that a server class family conforms to the LSP when in fact some server class does not. 3. A client using a server class family that conforms to the LSP makes assumptions that violate a method s specification Variants Chen and Kao [6] propose the all-bindings criterion to uncover polymorphism faults. Their criterion differs from the receiver-classes criterion in two ways. First, it covers not only method calls but also access to data members. Second, at call sites all possible bindings of both the receiver object and the parameters (if any) must be tested in combination. McDaniel and McGregor [12] adapted Robust Testing from procedural software, aiming to detect both polymorphism and state faults. At each call site, combinations of possible bindings of sender object, receiver object, parameter objects along with sender state, receiver state and parameter states must be tested. Since the number of combinations could be enormous, they use Orthogonal Array System Testing to reduce the number of combinations while at the same time being systematic. Alexander and Offutt [1] [2] designed four criteria to cover the def-use pairs that occur in a given method, taking into account the calls made from that method. Two of these criteria, all-poly-classes and all-poly-coupling-defsuses, consider polymorphism and dynamic binding. To cover all-poly-classes, a possible receiver binding at a call site is tested as many times as the call site appears in def-use pairs, considering the feasible type substitutions. The allpoly-coupling-defs-uses criterion further requires that all def-use paths be covered. Binder [4] proposes the Polymorphic Message Test strategy. It applies at component scope. The main difference with the receiver-classes criterion is that it tests the method bindings, not the object bindings. Hence, a subclass that does not re-implement a given method does not have to be tested.

4 4. Type analyses A type analysis determines the types of the objects that can be bound to an object reference v at a program point p. Informally, we call these types the possible types (or possible bindings) of v at p. We can show that it is safe (conservative) to include infeasible types by studying how the results of type analyses are used. Type analyses were originally used to optimize method calls. Conceptually, a type analysis finds at each call site the possible types of the receiver. The set of possible target methods can then be determined, yielding the call graph. If there are dynamic call sites which have a single target method, then an optimizer can replace the dynamic call with a static call. Furthermore, the optimizer can inline the target method. Note that it is conservative with respect to these two optimizations to include infeasible types. For testing it is also conservative to add infeasible types: this would indicate to a tester to carry infeasible test cases, but would not remove a valid test case. All investigated analyses are conservative. This is an important requirement. If the analyses failed to be conservative, a tester achieving 100% coverage of receiver-classes would falsely believe that all possible bindings have been properly tested. In this section we recall the description of three fast and linear-complexity analyses: Class Hierarchy Analysis (CHA), Rapid Type Analysis (RTA) and Variable Type Analysis (VTA). All three type analyses studied are flowand context-insensitive, so the possible types for an object reference v are the same for all program points. To illustrate how the three analyses work, we use a small program example shown in figure 2. Consider the call site v.m() at line 5. It is obvious by inspection that, no matter the input given to the program, the only possible binding of v at line 5 is to the object of class A allocated at line 4. This result is shown in figure 3, along with the results of the type analyses. Note that the possible classes also represent the bindings that would have to be covered to satisfy the receiver-classes criterion. Let s see how the analyses compare Class Hierarchy Analysis Of all type analyses studied in this paper, Class Hierarchy Analysis [7] is the most simple to implement and the less precise. CHA uses only the declared type of an object reference and the class hierarchy of the program to determine the possible types of the object reference. This set of types is equal to the type family of the declared type. Recall that the type family of type T includes T and all its subtypes. In the example, v is of declared type A and A is the parent 1 public class A f 2 public static void main(string[] args) f 3 A v = new B(); 4 v = new A(); 5 int res = v.m(); 6 A v2 = new C(); 7 g 8 public D never called() f 9 return new D(); 10 g 11 public int m() f return 1; g 12 g 13 class B extends A fg 14 class C extends A fg 15 class D extends A fg 16 class E extends A fg Figure 2. Type analysis example Analysis CHA prta orta VTA Inspection Possible classes of v at line 5 fa, B, C, D, Eg fa, B, C, Dg fa, B, Cg fa, Bg fag Figure 3. Partial results of type analyses on figure 2 of classes B, C, D and E. Hence, Class Hierarchy Analysis determines that the possible types of v are A, B, C, D and E Rapid Type Analysis There are two versions of Rapid Type Analysis [3], a pessimistic version (which we denote as prta) and an optimistic version (orta). Conceptually, both these analyses intersect the results of CHA with the instantiated types of the program. Indeed, if a class has never been instantiated in the program, then an object reference cannot be bound to an object of such a class. The two versions uses a single set of instantiated types for the whole program. The results of RTA will always be at least as precise as those of CHA because of the properties of set intersection. The pessimistic version (prta) analyzes every method of the program to collect the instantiated classes. Thus, for the example, it finds fa, B, C, Dg as the set of instantiated types. The optimistic version (orta) builds the set of instantiated types as it builds the call graph. It analyzes only the methods that are actually possibly called. Thus, orta does not analyze method A.never called, soclassd is not included in the set of instantiated types. This explains why

5 the set of possible types of v as determined by orta includes only three classes, as opposed to four from prta. In the rest of this paper, the acronym RTA refers to the optimistic version Variable Type Analysis Variable Type Analysis has been recently designed by the Sable Research Group at McGill University [16]. This technique is more precise than the previous techniques because object references of the same declared type can be associated different sets of possible types. To this end, it uses a type propagation graph where nodes represent object references. Each node is attributed with a set of possible types, and edges show the propagation of types. The algorithm proceeds in several steps. 1. Nodes initialization A node is created for each object reference in the program. A node representing the return value is created for each method returning an object. Primitive types are not assigned any node. 2. Edges initialization For each assignment between object references (including implicit assignments due to parameter bindings and exception object bindings in catch clauses), an edge is added from the node representing the right-hand side reference to the left-hand side reference. An already built call graph is used when creating the edges between actual and formal parameters, and from return values to assigned object references. 3. Types initialization For each assignment from an object allocation expression to an object reference, the set of types of the node representing the object reference is augmented with the instantiated class. 4. Strongly connected components (SCC) collapse An SCC exists when there is a sequence of assignments between n object references, of the form v2 = v1; v3 = v2;...; v1 = vn;. The nodes that form an SCC are merged together. Thus, the object references that they represent will have the same possible types. The set of types associated to an SCC is the union of the types of the nodes it contains. The graph becomes a DAG. A node now represents either an SCC or a single object reference. 5. Propagation The propagation is made following a topological order of the graph, so a single pass is necessary to propagate all types. The types associated with a node is the union of its initial types (steps 3 and 4) and the types of all its predecessors in the graph. Returning to the example in figure 2, one node (call it n) is created to represent object reference v in the type propagation graph (step 1). Since v is not assigned any object reference, the graph contains no edge pointing to this node (step 2). At line 3, v is assigned a newly allocated object of type B and at line 4, an object of type A, so the initial set of types for n is fa, Bg (step 3). This is also n s final set of types since this node is not part of any SCC (step 4) and no types are propagated to it (step 5) VTA variants for testing We should note that VTA is pessimistic. It uses a call graph which remains the same throughout the algorithm. This is to avoid the use of iteration and keep the timecomplexity of the algorithm linear with respect to the size of the program. In [16], the authors showed that using two passes of VTA (with the second pass being fed the call graph produced from the first pass) could slightly improve the results for some test programs. Since our goal is to compute the possible types of objects for testing purposes and performing those tests will usually take much longer than the calculations, we can spare a few minutes of calculation, as long as the calculation time does not explode. We have therefore modified VTA to include an iterative algorithm with a simple convergence criterion. We call this variation to VTA the iteration variant, denoted VTA n. The iterations end when the type propagation graph remains constant: (1) its nodes (and SCCs) remain the same and (2) the types associated with the nodes also remain the same. We describe another variation to VTA called the intersection variant, denoted VTA. Aswewillseeinsection 5.2, VTA sometimes propagate classes that are never instantiated, due to the way it handles dynamic methods, i.e. methods that can allocate objects of any class, such as Class.newInstance. It is often possible to investigate these methods callers to determine which classes are actually instantiated, reducing the number of infeasible classes. Note that this requires manual intervention. 5. Java-specific hurdles There are two features of the Java language that RTA and VTA must deal with. One is native methods and the other is dynamic class loading. These two features overlap: classes are dynamically loaded in some native methods Native methods Native methods are implemented in other programming languages, usually in C or assembly. Their declaration is qualified with the native keyword and they don t have

6 a body. The Java Native Interface (JNI) [10] is the mechanism that allows native methods to be called from Javaimplemented methods, to call Java-implemented methods, and to access static and instance fields of Java classes. Like regular methods, native methods can allocate objects and propagate objects. Native methods are compiled or assembled into the platform executable format. We have used analyses implemented in the Soot framework [18], which operate solely on Java bytecode. Therefore the analyses cannot analyze the source code or object code of native methods. In particular, they cannot analyze object allocation and object propagation. If RTA and VTA are to be conservative, then they somehow must handle these two object manipulation activities. In native methods, objects are created via calls to JNI functions AllocObject, NewObject, NewObjectA and NewObjectV. Objects are created by the Java Virtual Machine during its startup, in native library methods or in user-defined native methods. None of the programs we have used in the experiments contain user-defined native methods, so we only have to deal with objects created by the JVM and in native library methods. Object allocation in native methods is a problem to both RTAandVTA.IfRTAistobeconservativeandalsobe more precise than CHA, then manual inspection is necessary. We have thus manually investigated and then specified the classes that can be instantiated in native methods. VTA also needs manual inspection to determine the connections between the nodes of the native methods and the rest of the graph (i.e. object propagation), and the types that are assigned initially to these nodes (i.e. object allocation). Class VTANativeAdjustor provided in the Soot framework specifies the effects of native methods from Blackdown JDK Dynamic class loading In Java methods, objects are usually allocated using the new operator. There are other ways to allocate objects, which we call in this paper dynamic class loading. It allows a program to instantiate an object without having to specify its class name at compile time. In JDK 1.3.1, four dynamic methods can be called from user code: 1. java.lang.class.newinstance() 2. java.lang.constructor.newinstance(object[]) 3. java.io.objectinputstream.readobject() 4. java.rmi.naming.lookup(string) The first two methods are part of the Java Reflection mechanism. The third reads an object from a file. The last is part of Java RMI (Remote Method Invocation). All these methods return an Object. These dynamic methods cause problems to RTA and VTA because these analyses assume that all objects are allocated using a new expression. We have manually analyzed the test program s source code, looking for a call to one of these methods. We have been able to determine, for each application method, the set of classes that could be dynamically loaded. We have found that 8 projects (out of 17) called at least one dynamic method. This correlates to Sweeney and Tip s findings: in one of their studies [17], 9 out of 13 test programs used dynamic loading. Note that dynamic class loading usage varies greatly across projects, even across projects of the same size: SEA calls dynamic methods extensively, while Soot calls them only once. The impact of dynamic class loading is different for RTA and VTA. For RTA, the effects are global: if a class is instantiated at a single point in the program, then it becomes part of the whole program set of instantiated types. Therefore, the calls to these methods must be investigated, and their effects specified. We have to make sure that every class that can be dynamically loaded is added to the set of instantiated types (so RTA remains conservative) and at the same time we have to be as precise as we can. For VTA, the effects are (to some extent) localized: the types returned from the dynamic methods do not propagate to all nodes of the type propagation graph. Therefore, even if no analysis is performed on the use of these methods, VTA still finds less infeasible types than CHA. Moreover, in Soot s implementation of VTA, Class.newInstance is handled in a special way in order to reduce the number of infeasible types. Starting from the variables to which its return value is assigned, an algorithm proceeds through their successors to find casts, and add the cast class plus all its subclasses as possible classes for the variables. (A rationale for this algorithm is that programmers know what they are doing and if they insert a cast then it is because they expect the returned object to be of that type.) If a method is invoked on the object before it has been cast, then the algorithm assumes that its possible types are all classes. This improvement has some limitations. First, it is only applied to Class.newInstance, not to the other dynamic methods. This means that for this call site (assuming v is an ObjectInputStream): Integer i = (Integer) v.readobject(); every class will be propagated to i, which will be further propagated if i is assigned to other variables. Second, sometimes the object returned by Class.newInstance is used as a receiver without prior casting. Thus, infeasible types can propagate in the graph. Due to these limitations, it can be worthwhile to specify the types that can actually be created in the dynamic methods. This variation (VTA )

7 Table 1. Test programs Name Source LOC Classes Bytecode files (kb) BCEL BIOM DTDexplorer FastWars GAV Humanoid JavaDocHelper JFlex JReversePro JTop Mork Muffin RabbIT SEA SiteCompiler Soot Urlchk TOTAL AVERAGE increases the precision at the expense of manual intervention. Since RTA relies on a specification of dynamically loaded classes found by manual inspection while standard VTA only relies on automatic analysis, for some call sites RTA can yield less types than standard VTA. But VTA will always find as few or less types than RTA. 6. Experiments We have carried out experiments while keeping two goals in mind. The main goal is to compare, for the receiver of each call site in a program, the number of possible types that are found using CHA, RTA and VTA. The second goal of the experiment is to find out the reduction that can be further obtained using our variants, namely the iteration (VTA n ) and intersection (VTA )variants. For every receiver, the set of types found by CHA will be a superset of the set of types found either by RTA or VTA. The magnitude of the reduction obtained by RTA or VTA is of the utmost importance: If the reduction is small, then the extra effort (both in implementation and computation time) may not be justified. On the other hand, if the reduction is substantial, then a tester would be well advised to use one of the more sophisticated analyses. The advantage of increasing the receiver resolution precision (or, stated in another way, of decreasing the number of infeasible types for a receiver) is that a tester wastes less time figuring out that a test case is impossible to execute. Also, sometimes it can be difficult for a tester to make sure that the test case is really impossible to execute, not that she has not tried hard enough. We stress that increasing the receiver resolution precision merely decreases the number of test cases to try to execute; it does not change the number of polymorphism bugs that will be found. In other words, the same number of bugs will be found with less effort (and less frustration). The experimental results will guide testers in choosing the analysis (and the improvements, if applicable) that yields the best effort/precision tradeoff Experimental set-up We have used the Soot framework [18] to compare the receiver resolution precision of the analyses. Soot is a Java bytecode optimizer. Since it operates on bytecode, it can analyze library classes as easily as application classes. It has been designed to support new analyses. Implementations of CHA and standard VTA are already provided in the framework. We have made a number of additions to the framework: ffl We have designed and implemented the iteration and intersection variations to VTA, as described in section 4.4. For the iteration variant, we have built a chain of analyses, as shown in figure 4. We see that orta builds its own call graph from scratch, the first pass of VTA uses orta s call graph, the second pass of VTA uses the call graph built by the first pass, and so on until VTA intersected with the instantiated types (VTA ) uses the call graph of the last pass of VTA (VTA n ). ffl VTA and RTA need to know about entry points, i.e. the methods that can be invoked without a specific call in the program. These are not trivial: unlike C/C++, the main method is not the sole entry point. In the original implementation of VTA, applet s entry points were not taken into account; we have added them since many of our benchmarks are applets. ffl We have implemented both the optimistic and pessimistic versions of RTA. In this paper results are reported solely for the optimistic version. The experiment was carried over a group of 17 test programs, totalling 600 kloc (excluding library code). The test programs are presented in table 1. The programs have been chosen from a number of different domains, including games (e.g. FastWars and Humanoid), scientific (e.g. SEA, BIOM and GAV) and compiler (e.g. SOOT, JReversePro, and SiteCompiler). To the best of our knowledge, they are all unrelated, except for GAV and BIOM. Two test programs

8 CHA - prta orta VTA 1 - VTA VTAn - - VTA Figure 4. Propagation of call graphs (Soot and SEA) contain more than 100 kloc. The number of lines of code was extracted using wc. 7. Results Table 2 shows the results of applying the three type analyses (including the two VTA variants) to the test programs. We are interested in the cumulative number of possible benchmark types for benchmark call sites. This figure corresponds to the number of bindings to cover. The cumulative number of types is the summation of the number of types for all call sites. The possible types comprise only types defined in the application, not in libraries. Benchmark call sites are located in a class that is part of the test program itself, not in a library it uses. To illustrate this definition, let s look at the following class: 1 public class App f 2 public static void main(string[] args) f 3 C c = new C(); 4 A a = c; 5 LibClass l = new LibClass(); 6 a.tostring(); // 3 types: fa, B, Cg 7 a.tostring(); // 3 types: fa, B, Cg 8 c.tostring(); // 1 type: fcg 9 l.getclass(); // 0 type 10 g 11 g Assume that class A has subclasses B and C, all three are concrete benchmark classes, and class LibClass is a library class with no subclass, For this program, CHA would report seven possible types. Note that at line 7 three more possible types are counted, even though the receiver (a) is the same as at line 6. At line 9 no type is counted because CHA determines that object reference l can only be bound to an object of class LibClass, which is a library class. Using these results, a tester would try to cover seven bindings. Only three of these bindings are feasible (with object references a and c being bound to the object of class C allocated at line 3). All three bindings can be covered with a single test data set. Returning to table 2, the column CHA shows the cumulative number of possible types as computed by CHA; RTA is for optimistic RTA; VTA 1 is for standard VTA; VTA n is VTA with the convergence criterion; finally, VTA is for the intersection between VTA n and the instantiated types. There are 4 reductions shown, all of which are for some analysis over CHA; they are shown in order, starting from RTA, so r rta is RTA over CHA, r vta is VTA 1 over CHA, r n is VTA n over CHA, and r is VTA over CHA. The reduction of the number of possible types for program P when using analysis X over analysis C is computed as: C[P ] X[P ] r = 100 (1) C[P ] where C[P ] represents the number of possible types found in program P by analysis C. We see that, on average for all benchmark call sites, RTA finds 13% less possible types than CHA (line AVERAGE, column r rta ). This means that a tester would try to cover 13% less bindings, all of which are infeasible. RTA is more precise than CHA for all but three programs. The reduction obtained by standard VTA is more substantial: 23% of the types found by CHA are not found by standard VTA (column r vta ). Standard VTA is more precise than CHA for all test programs. The reductions range from 3% (Urlchk) to 68% (FastWars). Note that for two programs (Mork and SEA), the first pass of VTA yields more types than RTA. This is due to the handling of dynamic methods, as explained in section 5.2. Iteration is quite worthwhile, reducing the number of possible types for 7 test programs, with an additional average reduction of possible types over standard VTA of 5% (column r n ). The number of iterations needed to achieve convergence is surprisingly constant: 7 iterations are sufficient for all but two test programs (Soot and BCEL), which need 8 iterations. The intersection variation also produces significant improvements, decreasing the number of possible types by another 5% on average (column r ). 13 test programs benefited from this variant. We have also shown the weighted average reductions

9 Table 2. Possible types Name Cumulative number of possible types Reductions (%) CHA RTA VTA 1 VTA n VTA r rta r vta r n r JReversePro FastWars JFlex SEA Soot Humanoid RabbIT Muffin DTDexplorer Mork Urlchk BCEL JavaDocHelper SiteCompiler JTop GAV BIOM AVERAGE TOTAL WEIGHT.AVG (last line in table 2). A weighted average reduction is computed the same way as a normal reduction, except that it operates on the TOTAL line (i.e., P = TOTAL). The weighted average reductions are even better than the average reductions for all analyses; for example, VTA finds 40% less types than CHA. This means that our results are not biased by a few small programs that would exhibit higher reductions than larger programs Discussion For our test programs, RTA suggests 13% less bindings than CHA. Since RTA is easily implemented and is fast to execute (as shown in Bacon and Sweeney [3], the speed difference between the two analyses is marginal), testers should always prefer RTA over CHA to determine the possible types of receiver objects at call sites. VTA is harder to implement than RTA, and its speed is significantly slower than RTA (although it is a linear-time algorithm), but the payoff is great: the number of bindings to cover is cut by another 10%. Each of our variations further cuts this figure by 5%. Both of them are easily implemented. The intersection variation is fast to execute, while the convergence variation takes more time. Overall VTA suggests 32% less bindings than CHA. The receiver resolution precision could further be improved if some sort of declarative type assertion be available in programming languages. Unfortunately, only cast constructs are available in Java. Although it may not be the best practice, casts could be inserted to reduce the number of possible types. It could be especially helpful when using container objects. To illustrate, suppose we have a simple container as shown in figure 5 and we run VTA. In the type propagation graph, there will be a single node representing attribute x for all Container objects. Method test inserts a String object in the first container and an Integer object in the second container. The effect is that the set of types of the node corresponding to Container.x in the type propagation graph will contain two types, String and Integer. Thus, VTA determines that at line 10 the c1.get() call can return an object of type String or Integer, although by inspection we see that only one type (i.e. String) is possible. Appropriate casts could be inserted at statements where an element is retrieved from a container object. Thus, line 10 would become: String s = ((String) c1.get()).tostring(); These insertions would help all studied type analyses. To see how effective this simple technique can be, table 3 shows the number of possible types for program Mork, before and after the transformation. We see that by making only 6 one-line modifications, CHA finds a third less possible types, and VTA, 20% less types.

10 1 public class Container f 2 private Object x; 3 public void put(object v) f x = v; g 4 public Object get() f return x; g 5 public static test() f 6 Container c1 = new Container(); 7 Container c2 = new Container(); 8 c1.put(new String("!")); 9 c2.put(new Integer(10)); 10 String s = c1.get().tostring(); 11 g 12 g Figure 5. Container Table 3. Results of applying appropriate casts Mork CHA RTA VTA 1 VTA n VTA Original Modified Related work Our study is related to the ones developed by the designers of RTA [3] and VTA [16]. The main difference between their experiments and ours is that we are interested in class targets, not method targets. Basically, the results presented in [3] and [16] go one step further than ours: after determining the set of possible types at call sites, they determine the methods that can be called. For example, referring to figures 2 and 3, all analyses find that a single method can be called at line 5, namely A.m. Note that in this example although there is no difference in the number of target methods when comparing the analyses, there is a difference in the number of target classes. The designers of RTA and VTA developed their analyses with program optimization in mind. The number of resolved call sites is often used to compare type analyses in the context of program optimization. A dynamic call site is said to be resolved when an analysis finds that it has only one possible method target. Bacon and Sweeney [3] found that RTA could resolve about 20% more call sites that CHA. McGill [16] found that VTA was more effective than RTA at resolving call sites for nearly all their test programs: one pass of VTA could remove between 3% and 22% of benchmark call edges (average 11%), while RTA could remove between 2% and 4% of the edges (average 4%). These results led us to expect a reduction of the number of possible types when using RTA over CHA, and in general a reduction when using VTA over RTA. The authors of [3] and [16] published results for target methods rather than for target classes although they likely computed both. Section 7 in this paper presents empirical results on the number of possible types at call sites, for Java software. In a closely related study, Rountev, Milanova and Ryder [15] compare the absolute precision of four static type analyses (CHA, RTA, 0-CFA and Andersen points-to). The absolute precision is computed by comparing the number of bindings an analysis suggests with the number of feasible bindings, as determined by manual examination. Results are presented for a set of 8 tasks, each one testing part of the methods and the fields of between 8 and 24 classes. The precision is reported for two testing criteria, receiverclasses and target-methods. Section 7 in this paper presents the relative precision of analyses (i.e. the precision of one analysis as compared to another analysis), for a set of 17 test programs made of between 5 and 1406 classes. 9. Conclusion The receiver-classes criterion is an object-oriented testing criterion designed to detect polymorphism faults. The criterion requires knowledge about the possible types of receivers. We have studied three fast and conservative type analyses that can be used to find these types: Class Hierarchy Analysis, Rapid Type Analysis and Variable Type Analysis. We have introduced two variations to VTA, the iteration variant (VTA n ) and the intersection variant (VTA ). We have compared the number of bindings suggested by each analysis and found that RTA was more precise than CHA (it suggested on average 13% less bindings) and VTA was much more precise than CHA (32% less bindings). These results agree with the ones published concerning method targets. The bindings ruled out by RTA or VTA are infeasible; hence, a tester would have never been able to conceive test data sets to cover these bindings. Thus, using a more precise technique does not increase (nor decrease) the number of bugs found; it simply decreases the number of infeasible bindings a tester tries to cover. Therefore, we suggest that VTA should be used to determine the bindings at dynamic call sites, although the analysis is slightly more complex and takes more time and memory to run. Further research might concern the intuition that using a more precise analysis decreases the time spent testing: Is this intuition valid? If so, what is the testing time reduction? What is the impact on the total development effort? Another research lead deals with programming and design practices. We have seen that casting retrieved objects to specific types can substantially reduce the number of infeasible bindings. We would like to know if there are other practices that can aid with this criterion or hamper its use. Finally, this study dealt solely with Java software. It would be interesting to replicate the study for other OO languages such as C++.

11 10. Acknowledgments The authors deeply wish to thank Laurie Hendren and the members of the Sable research group at McGill University for their support, discussions and the distribution of the Soot framework. Special thanks go to Ondrej Lhotak. The authors also wish to thank the reviewers for their helpful comments. This research work has been partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). References [1] R. T. Alexander and A. J. Offutt. Criteria for Testing Polymorphic Relationships. In Proceedings of the Eleventh International Symposium on Software Reliability Engineering (ISSRE 00), October [2] R. T. Alexander, J. Offutt, and J. M. Bieman. Fault detection capabilities of coupling-based oo testing. In Proceedings of the Thirteenth International Symposium on Software Reliability Engineering (ISSRE 02), pages , November [3] D. F. Bacon and P. F. Sweeney. Fast Static Analysis of C++ Virtual Function Calls. In Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 96). ACM, [4] R. V. Binder. Testing Object-Oriented Systems: Models, Patterns and Tools. The Addisson-Wesley object technology series. Addison-Wesley, Upper Saddle River, NJ, [5] M.-H. Chen and H. M. Kao. Effect of Class Testing on the Reliability of Object-Oriented Programs. In Proceedings of the Eight International Symposium on Software Reliability Engineering (ISRRE 97), pages IEEE, [6] M.-H. Chen and H. M. Kao. Testing Object-Oriented Programs - An Integrated Approach. In Proceedings of the Tenth International Symposium on Software Reliability Engineering (ISSRE 99). IEEE, [7] J. Dean, D. Grove, and C. Chambers. Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis. In Proceedings of the Ninth European Conference on Object-Oriented Programming (ECOOP 95), August [8] S. Kim, J. A. Clark, and J. McDermid. The Rigorous Generation of Java Mutation Operators Using HAZOP. Technical report, High Integrity Systems Engineering Group, Department of Computer Science, The University of York, [9] S. Kim, J. A. Clark, and J. McDermid. Class Mutation: Mutation Testing for Object-Oriented Programs. In OOSS: Object-Oriented Software Systems, Net.ObjectDays 2000, October [10] S. Liang. The Java Native Interface: Programmer s Guide and Specification. The Java Series. Addison Wesley, [11] B. H. Liskov and J. M. Wing. A Behavioral Notation of Subtyping. ACM Transactions on Programming Languages and Systems, 16(6): , November [12] R. McDaniel and J. D. McGregor. Testing the Polymorphic Interactions between Classes. Technical Report , Department of Computer Science, Clemson University, March [13] J. Offutt, R. Alexander, Y. Wu, Q. Xiao, and C. Hutchinson. A fault model for subtype inheritance and polymorphism. In Proceedings of the Twelfth International Symposium on Software Reliability Engineering (ISSRE 01), pages 84 93, November [14] A. Rountev, A. Milanova, and B. G. Ryder. Class Analysis for Testing of Polymorphism in Java Software. Technical Report DCS-TR-432, Department of Computer Science, Rutgers University, February [15] A. Rountev, A. Milanova, and B. G. Ryder. Fragment Class Analysis for Testing of Polymorphism in Java Software. In Proceedings of the International Conference on Software Engineering (ICSE 03), May [16] V. Sundaresan, L. J. Hendren, C. Razafimahefa, R. Vallée- Rai, P. Lam, E. Gagnon, and C. Godin. Practical Virtual Method Call Resolution for Java. Technical Report , Sable Research Group, School of Computer Science, McGill University, November [17] P. F. Sweeney and F. Tip. Extracting Library-Based Object- Oriented Applications. In ACM SIGSOFT Software Engineering Notes, volume 25, pages , November [18] R. Vallée-Rai, E. Gagnon, L. J. Hendren, P. Lam, P. Pominville, and V. Sundaresan. Optimizing Java Bytecode Using the Soot Framework: Is It Feasible? In Compiler Construction, 9th International Conference (CC 2000), pages 18 34, 2000.

Class Analysis for Testing of Polymorphism in Java Software

Class Analysis for Testing of Polymorphism in Java Software Atanas Rountev Ana Milanova Barbara G. Ryder Rutgers University, New Brunswick, NJ 08903, USA {rountev,milanova,ryder@cs.rutgers.edu Abstract