Change and Fault Proneness in an Object-Oriented Software System

Size: px

Start display at page:

Download "Change and Fault Proneness in an Object-Oriented Software System"

Carmella Baker
6 years ago
Views:

1 Change and Fault Proneness in an Object-Oriented Software System Aditya Joshi, Sagar Patil, Omkar Shelke and Aniket Purandare Guided by Prof. Vaishali Nandedkar University of Pune Department of Information Technology, PVPIT Bavdhan Abstract Object-Oriented software development is a complex task that requires significant efforts from potentially thousands of developers to produce a fault free product. Dormant faults may result in inefficient software particularly in large scale systems. Since modern codebases are millions of lines long, it is difficult to go through each module to determine potential and existing errors. If developers know which classes are more prone to faults, it will be easier for them to debug them. In our proposed system, we use OOP metrics to predict change and fault proneness in an OOP based system. Keywords: change proneness, fault proneness I. INTRODUCTION Early stage detection of fault and change prone classes influences developer productivity non trivially. In our system, we examine both the change and fault proneness of a class. Change proneness of a class is defined as the probability of a class to undergo a change in its source code in its next release. We can estimate this by the extent of changes in previous releases. It has also been observed empirically that there is a positive corelation between object oriented metrics and change proneness. Along with concrete classes, we build a system that predicts the change proneness of interfaces. Interfaces, unlike classes, have a lower likelihood of changing as given software evolves. Fault proneness is defined as the probability that there will be a fault in one or more modules of a class. We use a supervised machine learning based approach to predict fault proneness. Random forest is an effective ensemble based algorithm that has proved to be an excellent method for classification and regression problems. We apply the same to predict fault proneness of a class. Since software modules and their dependencies can be represented as a directed graph, we represent our modules as nodes and a directed edge from node i to j indicates that node i is dependent on node j. We pre-process this graph and find strongly connected components. Every node in this graph is in a distinct connected component. Modules in the same connected components are highly dependent on Aditya Joshi: 1adityajoshi@gmail.com Sagar Patil: sagarvp 1@rediffmail.com Omkar Shelke: om shelke@yahoo.co.in Aniket Purandare: purandareaniket@gmail.com Fig. 1. Overall system overview each other, and hence changes or faults might propogate from one module in a component to another. Developers can also package highly dependent components together and abstract it away, thus making packages loosely coupled and following principles of an ideal object oriented system. II. PROPOSED SYSTEM The system we describe has three main components, one that finds the change proneness, another that finds fault proneness and the last one that finds strongly connected components in a software graph. Further, the change proneness module consists of two sub modules, one that analyses history based change proneness, and the other that evaluates source code based metrics. We have also developed a sub module that works exclusively to predict changes in interfaces. In the fault proneness module, we use Random Forests to predict faulty modules. This algorithm falls under the category of supervised machine learning, and requires two phases of processing. First, the input is a set of vectors, called the training data. After this phase, the software automatically classifies input files as fault prone or not fault fault prone. Finally, we process the software graph and find the strongly connected components using the Kosaraju Sharir algorithm. A. Change proneness In this section, we describe the two techniques used to calculate the probability of the change proneness of an input 650

2 Algorithm 1 History Based Change Proneness 1: procedure HISTORY BASED 2: counter 0 3: n number of files in revision log 4: i 0 (index of revision files) 5: ɛ numerical threshold 6: loop: 7: if i < n then 8: if LCS(i, i + 1) > ɛ then 9: counter counter : goto loop. 11: close; 12: i i + 1. return counter. n Fig. 2. History based change proneness class. The third technique is used to calculate the change proneness of an interface. 1) History based: For an input class, we extract the data of past releases which can be obtained from the release log or a version control system. For every consecutive release, we examine if changes have been made. We consider a random variable X that takes on a value of 1 if there are non trivial changes from release t i to t i+1. We determine that a change has occured by using a variant of the (LCS) longest common subsequence algorithm, which is a common algorithm to find out how similar two strings are. We do this for all consecutive releases, and then determine the probability of changes in the next release as simply P h (i) = Number of changes Number of releases where h indicates that the probability is based on the history log. The algorithm to determine whether non trivial changes have occured from release t i to t i+1 is described below. Let there be n releases for a particular class. Let arr be the array that contains the locations of the different releases. 2) Source code Based: Source code based changes are based on two types of changes, changes that are caused due the class itself, and changes that are propogated to the concerned class due to its dependencies with other classes. We extract OOP metrics of a class using a tool called antlr, which compiles the java source code and extracts variables, methods etc in accordance with a custom context free grammar for java. We denote the change proneness of a class i as P s (i) where the subscript s indicates that the probability is based on the source code. In our system, P s (i) depends on a number of OOP metrics such as Weighted Methods per class, Coupling between Objects (CBO), number of variables, number of methods, and Response for a class (RFC). Weighted methods per class (WMC) is defined as the sum of the cyclomatic complexities of all the methods in a class. Coupling between objects (CBO) includes efferent coupling, which is the number of classes the given class depends upon, and afferent coupling, which is the number of classes that depend upon the given class. The Response for Class (RFC) metric is the total number of methods that can potentially be executed in response to a message received by an object of a class. The final change proneness based on source code is based on the following formula: P s (i) = 1 e λxit In the formula above, λ is a suitable constant that is empirically observed in the past data. t indicates the amount of time units before the next release in which every time unit can potentially alter the class. For simplicity, we consider t = 1, that is, we consider the probability for change proneness of a class i in the next release. Here, x i is a linear combination of all the metrics given above with a certain weight. For instance, if the given metrics of a class are m 1, m 2,... m n and the corresponding weights are w 1, w 2,... w n, then the linear combination for the class i is given by 651

3 Algorithm 2 Source code Based Change Proneness 1: procedure SOURCE BASED 2: files file array from the package 3: λ Suitable constant 4: t 1 (to calculuate change proness of next release) 5: for file i in files do 6: m 1 CBO 7: m 2 RFC 8: m 3 WMC 9: w 1 Weight (Importance) of CBO 10: w 2 Weight (Importance) of RFC 11: w 3 Weight (Importance) of WMC 12: x i m 1.w 1 + m 2.w 2 + m 3.w 3 13: P s (i) 1 e λxit B. Fault proneness Fig. 3. Source code based change proneness x i = m 1.w 1 + m 2.w m n.w n = n m j.w j j=1 Finally, to calculate the total change proness of a class i, we take an average of the change proneness obtained by the history based and source based approach, which is given by 3) Interface: P final (i) = P s(i) + P h (i) 2 We use metrics such as CBO (Coupling Between Objects), LCOM (Lack of Cohesion Of Methods), NOC (Number Of Children), WMC (Weighted Methods per Class) to predict the change proneness of interfaces. Along with these object oriented metrics, we use certain external cohesion metrics for interfaces like IUC (Interface usage cohesion). After extracting these metrics, we use the same formula and algorithm described above to find the change proneness of the given interface. 1) Random Forest: Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. A decision tree is a commonly used machine learning technique to classify data. However, with a single decision tree, there is a significant variance from the true probability. Hence, we use multiple decision trees and average the result. This is a powerful cutting edge modern machine learning technique. In our system, we classify classes as either fault prone or non fault prone based on a five component vector. The output is the probability of a class to be fault prone. The components in our system are the same as described for change proneness, namely, WMC, RFC, CBO, number of variables and number of methods. At each node, we have to make a binary decision. If the concerned component of the vector is lesser than the threshold, we go left, else we go right. The algorithm to build a random forest is described below: The entropy function finds out which attribute we should split the data by and this is selected using Shannon s Entropy formula. The attribute that provides that maximum gain in entropy is selected and data is split accordingly. After building n trees, for every class that we want to classify, we simply traverse all the trees sequentially, and find the average probability that the class if fault prone. C. Strongly Connected Components In our internal representation of classes, a class was represented as a node. We refer to classes as nodes in our discussion. A directed edge from node i to node j indicates that node i was dependent on node j. We said that a node i depends on node j if an object of node j 652

Algorithm 3 Fault proneness - Random forest training 1: procedure BUILD TREE(NODE, VECTORS) 2: vectors current array of vectors 3: n current node 4: attributes remaining attributes 5: if leaf node

4 Algorithm 3 Fault proneness - Random forest training 1: procedure BUILD TREE(NODE, VECTORS) 2: vectors current array of vectors 3: n current node 4: attributes remaining attributes 5: if leaf node reached then 6: for vector i in vectors do 7: if i is fault prone then 8: n.pos n.pos + 1 9: if i is not fault prone then 10: n.neg n.neg : threshold attribute with max shannon entropy 12: less array whose attribute is less than threshold 13: more array whose attribute is more than threshold 14: for vector i in vectors do 15: if i.attr < threshold then 16: less i (add to the less array) 17: if i.attr threshold then 18: more i (add to the more array) Build Tree(n.left, less) Build Tree(n.right, more) 19: procedure BUILD FOREST 20: forest array of trees 21: vectors current array of vectors (training data) 22: for i to n do forest[i] = Build Tree(i, vectors) was called in node i. In graph theory, node i is said to be strongly connected to node j if there is a directed path from node i to node j and from node j to node i. This property is transitive, that is, if i is strongly connected to j and j is strongly connected to k, then i is strongly connected to k. Thus, every graph can be split to such strongly connected components. In our representation, we split our nodes (classes) into these strongly connected components using a linear time algorithm proposed by Kosaraju Sharir. Nodes belonging to a particular strongly connected component are highly dependent on each other, and software developers can pay attention to all the components in a subset of a node that is supposed to be faulty as bugs in this node can get propogated to other nodes in the same subset. In the figure 4, classes a,b and e are in one component, classes f and g are in one component and classes c,d and h are in the last component. Algorithm 4 Fault proneness - Random forest classification 1: procedure CLASSIFY(NODE, VECTOR) 2: n curent node 3: if n is leaf then n.pos return n.pos + n.neg 4: attribute threshold attribute type of current node 5: vector attribute value of the attribute of vector 6: if vector attribute < attribute threshold then Classify(n.left, vector) 7: if vector attribute attribute threshold then Classify(n.right, vector) Fig. 4. Strongly connected classes 653

Fig. 5. Change proneness results Fig. 7. Fault proneness results This will greatly reduce the effort taken to detect faults in todays large scale software systems.

Nandedkar, Head, Department of Information Technology, PVPIT Pune for her valuable guidance in the completion of this project. REFERENCES Fig. 6. Dependency graph of classes III.

gl/wkwyr9 In the following table, CP indicates Change proneness and FP indicates fault proneness. FP is calculated using our training data obtained from sample past projects. Class CP FP Deque 28.

5 Fig. 5. Change proneness results Fig. 7. Fault proneness results This will greatly reduce the effort taken to detect faults in todays large scale software systems. ACKNOWLEDGMENT We take this opportunity to thank our project guide Prof. Vaishali S. Nandedkar, Head, Department of Information Technology, PVPIT Pune for her valuable guidance in the completion of this project. REFERENCES Fig. 6. Dependency graph of classes III. RESULTS AND OBSERVATIONS We include results obtained by our project using an implementation of a Randomized Queue and a Deque which is available here: In the following table, CP indicates Change proneness and FP indicates fault proneness. FP is calculated using our training data obtained from sample past projects. Class CP FP Deque 28.8 % % Randomized Queue % % Subset 5.82 % % Results are presented in Fig 5, Fig 6 and Fig 7. jfreechart and jung libraries were used for the visualization of the data. IV. CONCLUSION Thus, our proposed system uses ensemble based techniques to classify a class as either fault prone or not. We also use a probabilistic approach to predict the change proneness of a class and interface. Finally, we identify strongly connected software modules, so these can be packaged together and errors from highly fault prone class can be prevented. [1] Sharafat, A.R., Tahvildari, L. A Probabilistic Approach to Predict Changes in Object-Oriented Software Systems, IEEE [2] Lan Guoy, Yan Maz, Bojan Cukicy, Harshinder Singh, Robust Prediction of Fault-Proneness by Random Forests [3] Kaur, A., Application of Random Forest in Predicting Fault-Prone Classes [4] Romano, D. ; Pinzger, M. Software Maintenance (ICSM), th IEEE International Conference on Using source code metrics to predict change-prone Java interfaces. [5] Nikolaos Tsantalis, Alexander Chatzigeorgiou,Member, IEEE Computer Society, and George Stephanides,Member, IEEE Computer Society Predicting the Probability of Change in Object-Oriented Systems [6] Todd L. Graves, Alan F. Karr, J.S. Marron, and Harvey Siy,Predicting Fault Incidence Using Software Change History IEEE TRANSAC- TIONS ON SOFTWARE ENGINEERING, VOL. 26, NO. 7, JULY [7] Daniele Romano Software Engineering Research Group Delft University of Technology The Netherlands, Martin Pinzger Software Engineering Research Group Delft University of Technology The Netherlands,Using Source Code Metrics To Predict Change-Prone Java Interface, th IEEE International Conference on Software Maintenance (ICSM). [8] Shyam R. Chidamber and Chris F Kemerer, A Metrics Suite For Object Oriented Design,IIIE TRANSCATION OF SOFTWARE ENGINEER- ING, VOL 20,NO. 6,JUNE [9] Sunint K. Khalsa, Proceedings of the World Congress on Engineering 2009 Vol I WCE 2009, July 1-3, 2009, London, U.K [10] Parvinder S. Sandhu, Satish Kumar Dhiman, Anmol Goyal, World Academy of Science, Engineering and Technology 60, 2009 [11] Hans Christian Benestad, Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo March 2009 [12] Toshi hi ro Kami ya, Shi nj i Kusumoto and Katsuro Inoue Graduate School of Engi neeri ng Sci ence, Osaka Uni versi ty 1-3 Machi kaneyama, Toyonaka, Osaka, , Japan. [13] Jaana Lindroos, Helsinki, 1 St of December 2004 Seminar on Quality Models for Software Engineering Department of Computer Science UNIVERSITY OF HELSINKI 654

Technical Metrics for OO Systems

Technical Metrics for OO Systems 1 Last time: Metrics Non-technical: about process Technical: about product Size, complexity (cyclomatic, function points) How to use metrics Prioritize work Measure programmer