Software Bug Classification using Suffix Tree Clustering (STC) Algorithm

Size: px

Start display at page:

Download "Software Bug Classification using Suffix Tree Clustering (STC) Algorithm"

Leon Curtis
6 years ago
Views:

1 IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 ISSN : (Print) ISSN : (Online) Software Bug Classification using Suffix Tree Clustering (STC) Algorithm 1 Naresh Kumar Nagwani, 2 Dr. Shrish Verma 1 Department of CS&E, National Institute of Technology Raipur. 2 Department of ET&C, National Institute of Technology Raipur. Abstract Suffix Tree Clustering (STC) is one of the popular text clustering algorithms. STC has number of applications and the most popular is web document clustering. Software bug data contains number of attributes like bug-id, summary (title), description, comments, status, version etc. Most of the important attributes holds text data. Since the software bug repositories are consist of most the data in the form of text, STC can be applied to create the clusters of software bug record. In this paper STC algorithm is used for software bug classification. First clusters are created from the bug repositories and then labels are assigned to the each cluster, which indicates the classes of the clusters. STC implementation is available as the part of Carrot2 framework. The designed technique is evaluated using the common clustering parameters. Keywords Software Bug Classification, STC Clustering, Bug Clustering, Software. I. Introduction A bug is defect in sofware. Bug indicates the unexpected behavior of some of the given requirement during software development. During software testing the unexpected behavior of requirements are identified by software testers or quality engineers and they are marked as a Bug. In this paper both defect and bug are used as synonyms. Bugs are managed and tracked using number of available tools like Bugzilla, Perforce, JIRA etc. A. Bug Reposiories Most of the open source projects and bigger projects manages their software development related data using some of the tool. For managing the bugs associated with the software bug tracking tools are used. These bug tracking systems provides online interfaces to various users associated with the projects. These tools internally manages the bug repositories where all the bugs and related data are stored. For example for the Mozilla project, the bugs are tracked using bugzilla tool. Bugzilla provides all the mozilla bugs in the form of online repository. By specifying the bug id in the Mozilla online repository, any user can fetch the required bug information. The url for Mozilla s bug repository is id=. B. Cluster Analysis Cluster analysis is a statistical method, which identifies groups of similar objects which shows some similar characteristics. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. There are number of clustering algorithms available and also numbers of techniques exist for measuring the distances for the clusters data points. Here some of the popular distance functions and clustering algorithms are explained, which will used by the suggested data mining model. 36 International Journal of Computer Science and Technology 1. Major Clustering Approaches There exist a large number of clustering algorithms. The choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. In general, major clustering methods can be classified into the following categories: Partitioning algorithms, Hierarchy algorithms, Density-based, Grid-based, and Model-based C. Suffix Tree Clustering (STC) algorithm The first clustering algorithm to take advantage of association between words, not only their frequencies, was Suffix Tree Clustering used in Grouper [30,31]. STC attempts to cluster documents or search results according to identical phrases they contain. What motivates incorporation of phrases into STC is making use of proximity and order of words, which have more descriptive power than keywords. Apart from forming clusters, phrases can be used for labeling the clusters created. STC is organized into two main phases: discovering phrase clusters (also called base clusters) and combining similar ones to form merged clusters (or simply clusters). The Suffix Tree Clustering (STC) algorithm groups the input texts according to the identical phrases they share [31]. The rationale behind such approach is that phrases, compared to single keywords, have greater descriptive power. This results from their ability to retain the relationships of proximity and order between words. A great advantage of STC is that phrases are used both to discover and to describe the resulting groups. The Suffix Tree Clustering algorithm works in two main phases: base cluster discovery phase and base cluster merging phase. In the first phase a generalized suffix tree of all texts' sentences is built using words as basic elements. After all sentences are processed, the tree nodes contain information about the documents in which particular phrases appear. Using that information documents that share the same phrase are grouped into base clusters of which only those are retained whose score exceeds a predefined Minimal Base Cluster Score. In the second phase of the algorithm, a graph representing relationships between the discovered base clusters is built based on their similarity and on the value of the Merge Threshold. Base clusters belonging to coherent sub graphs of that graph are merged into final clusters. A detailed example illustrating the STC algorithm along with its evaluation based on standard Information Retrieval metrics and user feedback is presented in [30,31]. STC algorithm Step 1: Cleaning - Stemming, Sentence boundary identification, Punctuation elimination. Step 2: Suffix tree construction - Produces base clusters (internal nodes); Base clusters are scored based on size and phrase score (which depends on length and word quality ) Step 3: Merging base clusters - Highly overlapping clusters are merged

2 ISSN : (Print) ISSN : (Online) Cluster labeling algorithms group a set of documents based on a similarity score between them and then identify phrases which are representative of the cluster. They pick commonly occurring phrases in the documents and compute a score for each phrase. D. Carrot2 Framework Carrot2 framework is proposed by Stefanowski and Weiss in Carrot2 framework is open source and is available at It is a component based framework for text clustering. It allows substituting components for Input (i.e. snippets from other search engines), Filter, Stemming, Distance measure, Clustering and Output for the text clustering. In this paper web clustering technique STC is applied for software bug classification. It is a classification by clustering technique, which uses the web clustering algorithm STC for classification of software bugs. This paper is divided in six sections. Section two discusses about the work done previously in the similar field. In section three proposed methodology is explained in details, implementation and result evaluation is mentioned in section four and five. And finally conclusion and future scope of the proposed work is given in section six. II. Related and Previous Work Done Work done related to software repositories mining is discussed in this section. Software bug repositories contains huge amount of knowledge patterns. The approach of xfinder is proposed by Kagdi et al to recommend expert developers by mining version archives of a system [19]. The basic premise of this approach is that the developers who contributed substantial changes to a specific part of source code in the past are likely to best assist in its current or future change. Some investigation and analysis on bug fixing is done by Ayewah and Pugh [28]. Several past projects introduce and refine an approach to finding fix-inducing commits that is based on creating a link between the bug report database and the code repository using commit messages [4,12,20,32]. The concepts of neighbors and link are introduced by Luo et al [12] for document clustering. Some of the common problems of text clustering like big volume, high dimensionality and complex semantics etc. are studied by Stefanowski et al [16], solution to these problems are also proposed. Some of the suggested solutions are subspace clustering, ontology etc. A generic open source framework for pre-processing of software bug repositories was designed by Nagwani and Verma [24]. The framework was implemented in java including GUI for the framework. The framework was designed for extracting software bugs from online software bug repositories and parsing the files retrieved from online software bug repositories. All the parsed data can be saved in the local database and user can also fetch the bug records from the local database. The framework implemented is implemented in generic way and can be plugged with any of the online software bug repository. A weighted bug similarity model is proposed by Nagwani and Singh [27] for discovering similar bugs in a software bug repositories. In the proposed model a bug is transformed to an object with number of attributes. For measuring the similarity between two bugs all the attributes similarities are calculated and weights are assigned to the similarity values then using a suitable threshold value bugs can be marked as similar bugs. A data mining model is designed and implemented in java by Nagwani and Verma [25] for predicting the fix duration for IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 newer incoming software bugs in software bug repositories. The proposed model was designed by using textual information similarity in a software bug. For any newly created bug, all its similar bugs are discovered first in the software bug repository then average fix duration is calculated and fix duration is predicted for a software bug. A GUI (Graphical User Interface) bugs mining model is proposed by Nagwani and Singh [26]. The model was proposed to discover the similar and duplicate GUI bugs for the graphical user interface of any software. The similarity of GUI components and associated events were counted for detecting the similar and duplicate GUI bugs. Grouper [30,31] is a snippet-based clustering engine based on the Husky Search Meta search engine. The main feature of Grouper is the introduction of a phrase-analysis algorithm called STC (Suffix Tree Clustering). In essence, the algorithm builds a suffix tree of phrases in snippets; each representative phrase becomes a candidate cluster; candidates with large overlap are merged together. The main contribution of Grouper stands in the complexity of the clustering algorithm, which allows for very fast processing of large result sets. In [9] the problem of improving cluster labels is studied. A machine learning approach is used: given a corpus of training data, the system is able to identify the most salient phrases and use them as cluster titles. The algorithm is therefore supervised. Various issues related to Web clustering engine like acquisition, preprocessing etc. are addressed by Carpineto et al [5]. Cluster label optimization and performance related issues are also discussed in the study. Eclipse [7] is a multi-language software development environment comprising an integrated development environment (IDE) and an extensible plug-in system. It is written primarily in Java and can be used to develop applications in Java. Some of the problems of text clustering like very high dimensionality of the data, very large size of the databases and understandability of the cluster description etc. are analyzed and studied by Beil et al [8] and an approach is proposed which uses frequent item (term) sets for text clustering. Such frequent sets are discovered using algorithms for association rule mining then clusters are created based on frequent term sets. Java [14] is a general-purpose, concurrent, class-based, objectoriented language that is specifically designed to have as few implementation dependencies as possible. JBoss Seam [15] is a powerful new application framework for building Web 2.0 applications. JIRA [18] is the issue tracking, bug tracking and project tracking tool for software development teams. A semi-automated approach is proposed by Fluri et al. [1] to discover patterns of source code change types using agglomerative hierarchical clustering. They found that change type patterns do describe development activities and affect the control flow, the exception flow, or change the API. A standard for the classification of software anomalies is provided by the IEEE in 1993 [10], which was further revised on 2009 [11]. A uniform approach to the classification of anomalies found in software and its documentation is provided. The processing of anomalies discovered during any software life cycle phase are described. Various case studies which provide the quantitative data on categories of software faults and discuss the applicability of these software fault category distributions to fault injection are studied by Ploski et al. [13]. Various challenges in hierarchical document clustering are studied by Benjamin et al. [2], they focused on some of the key challenges like high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels etc. And also numbers of document clustering algorithms International Journal of Computer Science and Technology 37

IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 are reviewed. The group of X Wang et. al. [34], has proposed discovering semantically similar terms using WordNet.

Also they have proposed the Semantic Similarity Retrieval Model (SSRM), a general document similarity and information retrieval method suitable for retrieval in conventional document collections and

This system uses surface features, textual semantics, and graph clustering to predict duplicate status III.

ISSN : 2229-4333(Print) ISSN : 0976-8491(Online) Fig. 1: Retrieving Software Bugs from Online Repositories Fig. 2: Software Bug Clustering Using STC Al

3 IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 are reviewed. The group of X Wang et. al. [34], has proposed discovering semantically similar terms using WordNet. Several methods have been implemented and evaluated. Also they have proposed the Semantic Similarity Retrieval Model (SSRM), a general document similarity and information retrieval method suitable for retrieval in conventional document collections and the Web. Jalbert and Weimer [29] have proposed a system that automatically classifies duplicate bug reports as they arrive to save developer time. This system uses surface features, textual semantics, and graph clustering to predict duplicate status III. Methodology The overall methodology of software bug clustering using STC algorithm can be represented using fig. 1, fig. 2 and fig. 3. The overall methodology is divided in the six stages. ISSN : (Print) ISSN : (Online) Fig. 1: Retrieving Software Bugs from Online Repositories Fig. 2: Software Bug Clustering Using STC Algorithm. A. Data Access Layer to extract the data from local database. Retrieving software bugs from online software bug repositories, parsing the software bugs and saving to the local database. B. Bug to object transformation Once a software bug record is retrieved at local database and is available for performing data mining operation, it is transformed into the terms of a java object. So that it can be stored and processed further in the java collection API (Application Programming Interface). C. Applying stop word elimination and stemming The popular text mining pre-processing techniques named stop word elimination and stemming are applied here in order to pre-process the software bug records. Most of the software bug attributes are textual; hence they need to be prepared for mining. Stop words does not make any sense in knowledge discovery and hence need to be eliminated. Stemming is required in order to unify the terms present in a text document, so that knowledge patterns can be discovered effectively. D. Passing the software bug objects to carrot framework Once the software bug record is transformed into the java object, and all the software bug records are collected in the java collection, it is passed into the carrot2 framework, which is implemented in java for web document clustering, for performing clustering of software bug records. STC algorithm is selected for performing the software bug clustering. E. Output Software bug clusters with labels As soon as STC complete its task of creating clusters of software bugs, the output is generated for each cluster with cluster label assigned to it. This label indicates the class of the cluster. This method is an example of classification using clustering. F. Calculating entropy and purity for each cluster. At last various cluster parameters are evaluated, to evaluate the performance of the software bug clustering algorithm. The four parameters are evaluated for the algorithm. These parameters are purity, entropy, number of clusters created and time to create the clusters in milliseconds. Fig. 3: Steps in Evaluating the STC Algorithm for Software Bug Clustering. IV. Implementation Implementation is done using open source technologies. Java [14] is used as the primary programming language. Eclipse [7] is used as the IDE (Integrated Development Environment), MySql [26] is used as the local database for storing the software bug records locally and JDBC (Java Data Base Connectivity) is used for getting software bug records from MySql database to Java programming language. Pre-processing task for software bug records are done in two phases. First the stop words are elimination by using Weka API, and in the second phase stemming is done using Porter s stemmer, which is implemented using Java. Carrot2 Framework is java implementation of web document clustering. It includes the implementation of STC web clustering algorithm, which is also used for the implementation of the proposed algorithm. V. Experiments and Result Evaluation Cluster quality is evaluated using various metrics. In this paper four matrices are used named purity, entropy, number of clusters generated for the different number of software bug records and times to create the clusters. Purity assumes that all samples of a cluster are predicted to be members of the actual dominant class for that cluster. The cluster quality can be judged on the basis of these parameters. One of the ways of measuring the quality of a clustering solution is cluster purity. Let there be k clusters (the k in k-means) of the dataset D and size of cluster Cj be Cj. Let Cj class=i denote number of items of class i assigned to cluster j. Purity of this cluster is given by purity(cj) = (1) The overall purity of a clustering solution could be expressed as a weighted sum of individual cluster purities purity = purity(cj) (2) 38 International Journal of Computer Science and Technology

4 ISSN : (Print) ISSN : (Online) IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 In general, larger the value of purity better the cluster solution is. Entropy is a measure of randomness or irregularity, looking for the most random distribution corresponds to looking for the distribution with maximal entropy. The entropy of a probability distribution P is (3) The entropy is calculated as a function of probability of a software bug object belonging into a particular class. Based upon the proposed methodology and cluster evaluation parameters the implementation is done and parameters are calculated. Three software bug repositories are taken for experiment with different number of software bug records. The number of clusters generated for different number of software bugs in different software bug repositories are given in Table 1. The graph plotted for this data is shown in fig. 4. Time to create clusters for different number of software bug records in different software bug repositories is given in Table 2. The corresponding graph for the given value is plotted and shown in fig. 5. Table 1: Number of Clusters Created in STC Algorithm Number of Bugs / Jboss-Seam Mozilla MySQL Table 2: Time To Create Clusters in STC Algorithm Fig. 5: Purity Measured for Created Clusters in STC Algorithm. The entropy value for different number of software bug records and different software repositories is calculated and given in Table 4. The graph plotted for the corresponding values in fig.6. Table 4: Entropy in STC Algorithm Number of Bugs / Jboss-Seam Mozilla MySQL Number of Bugs Jboss-Seam Mozilla MySQL Fig. 6: Entropy Measured for Created Clusters in STC Algorithm. Fig. 4: Time to Create Clusters in STC Algorithm. The purity value is calculated against the clusters created for different number of software bug records in different software bug repositories are given in Table-III. For JBoss-Seam repository the maximum purity value is achieved. These values are plotted in graph shown in fig. 5. Table 3: Purity in STC Algorithm Number of Bugs / Jboss-Seam Mozilla MySQL VI. Conclusion and Future Work In this paper STC clustering algorithm is used for software bug clustering. And it is demonstrated that the STC can be used effectively for the software bug clustering also. Carrot2 framework is used in the implementation and software bug clusters are created using STC algorithm. Various clustering parameters are also calculated to evaluate the quality of the clusters created. Using STC the cluster labels are assigned to the each created cluster, which indicates the class of the cluster. So this is an effective way of classifying the software bug in just a small time, also cluster purity calculated is adoptable. The future scope of the proposed work could be applying more pre-processing task on the software bug repositories in order to reduce the mining time and comparing the other web clustering algorithms with the STC and designing some hybrid algorithms for the betterment of software bug classification. International Journal of Computer Science and Technology 39

5 IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 References [1] Beat Fluri, Emanuel Giger, Harald C. Gall, "Discovering Patterns of Change Types", 2008 IEEE, pp , [2] Benjamin C. M. Fung, Ke Wang, Martin Ester, "Hierarchical Document Clustering, The Encyclopedia of Data Warehousing and Mining", John Wang (ed.), Idea Group, pp. 1-7, [3] "Bugzilla, An Open source web-based general-purpose bug tracker and testing tool originally developed and used by the Mozilla", [Online] Available : [4] Williams, J. Spacco. "Szz revisited: verifying when changes induce fixes". In DEFECTS 08: Proceedings of the 2008 workshop on Defects in large software systems, pages 32 36, New York, NY, USA, ACM. [5] Claudio Carpineto, Stanislaw Osin Ski, Giovanni Romano, Dawid Weiss, "A Survey of Web Clustering Engines", ACM Computing Surveys, Vol. 41, No. 3, Article 17, Publication date: July [6] Congnan Luo, Yanjun Li, Soon M. Chung, "Text document clustering based on neighbors", Elsevier Data & Knowledge Engineering 68 (2009), pp [7] "Eclipse, A multi-language software development environment comprising an integrated development environment (IDE) and an extensible plug-in system" : [Online] Available : [8] Florian Beil, Martin Ester, Xiaowei Xu, "Frequent Term- Based Text Clustering", SIGKDD 02 Edmonton, Alberta, Canada, [9] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, J. Ma., "Learning to Cluster Web Search Results". In Proceedings of the ACM SIGIR Conference on Research and development in information retrieval, pp , [10] IEEE Standard Classification for Software Anomalies, IEEE Std , [11] IEEE Standard Classification for Software Anomalies, IEEE Std (revision of IEEE Std ), [12] J. Sliwerski, T. Zimmermann, A. Zeller. "When do changes induce fixes?", In MSR 05: Proceedings of the 2005 international workshop on Mining software repositories, pp. 1 5, New York, NY, USA, ACM. [13] Jan Ploski, Matthias Rohr, Peter Schwenkenberg, Wilhelm Hasselbring, Software Engineering Group, TrustSoft, Research Issues in Software Fault Categorization, ACM SIGSOFT Software Engineering Notes, Vol. 32 No. 6, [14] "Java, Open source programming language": [Online] Available : [15] JBoss Seam, a web application framework developed by Jboss. [Online] Available : JBSEAM. [16] Jerzy Stefanowski, Dawid Weiss, "Comprehensible and Accurate Cluster Labels in Text Clustering", Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, C.I.D. Paris, France [17] Jiawei Han, Micheline Kamber, "Data Mining: Concepts & Techniques" 2nd Edition, Morgan Kaufmann Publishers, ISBN , [18] "Jira, An issue tracking", bug tracking and project tracking tool for software development teams: [Online] Available : [19] Kagdi, H., Hammad, M., Maletic, J. I., "Who Can Help Me with this Source Code Change?" in Proc. of IEEE International 40 International Journal of Computer Science and Technology ISSN : (Print) ISSN : (Online) Conference on Software Maintenance, Beijing, China, September 28-October [20] L. Aversano, L. Cerulo, C. Del Grosso. Learning from bugintroducing changes to prevent fault prone code. In IWPSE 07: Ninth international workshop on Principles of software evolution, pp , New York, NY, USA, ACM. [21] Mozilla, a global community dedicated to building free, open source products like Firefox web browser and Thunderbird software: [Online] Available : bugzilla.mozilla.org/. [22] "MySql Bugs", available at: [Online] Available : mysql.com. [23] "MySql", A relational database management system (RDBMS) that runs as a server providing multi-user access to a number of databases: [Online] Available : mysql.com. [24] Naresh Kumar Nagwani, Dr. Shrish Verma, An Open Source Framework for Data Pre-processing of Online Software Bug Repositories, CiiT International Journal of Data Mining Knowledge Engineering, Vol. 1, No. 7, September [25] Naresh Kumar Nagwani, Dr. Shrish Verma, Predictive Data Mining Model for Software Bug Estimation Using Average Weighted Similarity, IEEE 2nd International Advance Computing Conference ( IEEE IACC 2010) to be held on 19-20th February, 2010 at Thapar University, Patiala [26] Naresh Kumar Nagwani, Pradeep Singh - "Bug Mining Model Based on Event-Component Similarity to Discover Similar and Duplicate GUI Bugs", IEEE International Advance Computing Conference, IACC-2009, Patiala, Punjab, India. 2009, pp Location: Patiala, India. [27] Naresh Kumar Nagwani, Pradeep Singh - "Weight Similarity Measurement Model Based, Object Oriented Approach for Bug Databases Mining to Detect Similar and Duplicate bugs", International Conference on Advances in Computing, Communication and Control, ICAC-2009, ACM SIGART Conf Id , Mumbai, Maharashtra, India. 2009, pp , [Online] Available : m?id= &type=proceeding&coll=portal&dl= [28] Nathaniel Ayewah, William Pugh, "Learning from Defect Removals", IEEE MSR 2009, pp , [29] Nicholas Jalbert, Westley Weimer, "Automated Duplicate Detection for Bug Tracking Systems", IEEE International Conference on Dependable Systems & Networks: Anchorage, Alaska, June , pp , [30] O. Zamir, O. Etzioni. "Grouper: A Dynamic Clustering Interface for Web Search Results". Computer Networks (1999) 31(11-16): pp [31] O. Zamir, O. Etzioni. "Web Document Clustering: A Feasibility Demonstration". In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), [32] Sunghun Kim, E. James Whitehead Jr., Yi Zhang, "Classifying Software Changes: Clean or Buggy?", IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, Vol. 34, No. 2, MARCH/ APRIL 2008, pp , [33] "Trac, A Project management and bug/issue tracking system", [Online] Available : [34] Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, Jiasu Sun, "An Approach to Detecting Duplicate Bug Reports using Natural Language and Execution Information", ACM ICSE 08, May 10 18, 2008, Leipzig, Germany, pp , 2008.

6 ISSN : (Print) ISSN : (Online) IJCST Vo l. 2, Is s u e 1, Ma r c h 2011 Naresh Kumar Nagwani was born on 15th February 1980 at Raipur, India. He completed his graduation in Computer Science & Engineering in 2001 from Guru Ghasidas University, Bilaspur. He completed his post graduation M.Tech. in Information Technology from ABV- Indian Institute of Information Technology, Gwalior in His area of interest is DBMS, Data Mining, Text Mining and Information Retrieval. His employment experience includes SSCET Bhilai, Team Lead in Persistent Systems Limited and NIT Raipur. Presently he is assistant professor at department of computer science & engineering, National Institute of Technology, Raipur. Dr. Shrish Verma has completed his graduation in Electronics & Telecommunication Engineering and his post graduation M.Tech. in Computer Engineering from Indian Institute of Technology, Kharagpur. He has completed his PhD in Engineering from Pt. Ravi Shankar Shukla University Raipur. Presently he is head & associate professor at department of information technology, National Institute of Technology, Raipur. International Journal of Computer Science and Technology 41

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri