A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

Size: px

Start display at page:

Download "A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies"

Elfreda Matthews
5 years ago
Views:

1 A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina Grande Campina Grande, PB Brazil {diegot,dalton,abrantes}@dsc.ufcg.edu.br Abstract. Bug tracking systems, such as Bugzilla, are widely used in software development projects to register and manage bug reports. Based on the information provided in bug reports by software users, developers must identify and locate the defective code. This task, however, can be quite challenging and time consuming. In our research, we investigate means to develop tools and techniques that can help developers in bridging the gap between bug reports and defective code, in effective and efficient ways. Our current research centers on the hypothesis that we can explore the similarities and dissimilarities of vocabularies of both bug report and software source code. In this paper, we present a preliminary case study on this subject, developed over the data of the Eclipse IDE project. We analyzed over than 4,000 bug reports which impact more than 3,000 different classes of the software, using information retrieval techniques. Our results indicate that almost 90% of analyzed bug reports impact up to three different classes and also that more than half of similarities between vocabularies are up to 25%. Therefore, we conclude that is not a trivial task for developers to relate source code and bug report without any systematic approach. 1. Introduction Bug tracking systems, such as Bugzilla 1, GNATS 2 and JIRA 3, are widely used in large software development projects. In these kind of systems, users and developers register bug reports containing technical information about a problematic software system and a free-form text that describes the problems encountered. Moreover, they also can suggest improvements and comment upon existing bug reports [Anvik et al. 2006]. Usually, bug reports are the unique source of information about software problems. They are used by developers to find bugs throughout the code and fix them. In order to do that, programmers have to understand the code and find its specific entities to change. For large projects with several modules, each one containing thousands of lines of code, this task has a significant cost. [Eisenbarth et al. 2003] This situation is even worse for developers who are newcomers, since they do not know the code yet and must learn about it on their own. It is simple to locate a bug when the bug report contains low-level information about it, such as patches or stack traces [Bettenburg et al. 2008]. However, only few bug This research has been partially supported by MCT/CNPq-14/2009 project, through grant number /

2 reports contain this kind of information [Schröter et al. 2010]. In such situation, developers need additional effort, since they have to rely only on free-form text. In summary, the main problem that is the focus of this paper is that developers spend a great effort to analyze software entities (e.g., classes) and to find which ones will be impacted by a maintenance requested in a bug report. Since we want to focus on bug reports of widely used systems, we will use Bugzilla-based repositories which are used by numerous large projects, such as: Eclipse, Mozilla, Gnome, RedHat and Apache. Nevertheless, we believe that any approach used in Bugzilla can be generalized to other bug reports with minor changes, since other bug tracking systems store similar information of Bugzilla. In our research, we investigate means to develop tools and techniques that can help developers in bridging the gap between bug reports and defective code, in effective and efficient ways. In this paper, we present a preliminary case study on the existent relation between source code and bug report, specially regarding about their vocabulary. Our study is developed over the data of the Eclipse bug data set, containing more than 4,000 bug reports. Our results indicate that most bug reports impact a few number of different classes and that their vocabularies have similarity lower than 50%. The analysis of bug reports and source code vocabularies helped us to improve the treatment approach of these vocabularies and to propose a technique that recommends Java classes which are more likely to be impacted by a bug report, using its vocabulary as descriptor. The remaining of this paper is organized as follows. Section 2 brings some explanation about key concepts of Bugzilla-based repositories and information retrieval. Section 3 presents the performed study and obtained results, followed by Section 4 which describes the proposed technique. Finally, Section 5 presents some related work and Section 6 concludes the paper. 2. Background 2.1. Bugzilla-based repositories Bug reports from Bugzilla have similar structure. Figure 1 presents a sample 4 of a bug report from Eclipse s Bugzilla repository. Each bug report is represented by a unique id and is composed by four sections: pre-defined fields, free-form text, attachments and dependencies [Anvik et al. 2006]. Pre-defined fields store informations about the software with error (e.g. product, component, version and platform used) and about the bug report itself (e.g., status, priority, target milestone, developer assigned to solve it, keywords and timestamp). Most data are supplied by the reporter when the report is filled and the remaining are automatically generated or supplied by project manager. The free-form text includes the title of report, a description and comments. The title commonly is an one-line summary of bug. The description should contain a detailed description of the bug, steps to reproduce it and any other kind of information that can help developers to identify and solve the bug. The additional comments represent discussions about possible approaches to solve the bug and pointers to other bug reports that can contain more information about the problem or that appear to be duplicated. 4 Bug # Source: bug.cgi?id=62741

3 Figure 1. A sample of Bugzilla bug report from Eclipse Finally, attachments are allowed in order to add non-textual information (e.g. screenshot of error) to the bug report and dependencies are the tracking of which bugs are pre-requisites for resolution of other bugs Key Concepts of Information Retrieval Manning et al. [Manning et al. 2008] define information retrieval (IR) as the activity which aims to find material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In terms of our study, we can say we use IR to extract, treat and analyze vocabulary of source code entities which are composed by identifiers that satisfy an information need from bug reports description. Before working with IR, we need to understand some key concepts: Tokens are the most basic units of a document. Usually, tokens are words of a text splitted by space, excluding certain characters, such as punctuation. Stop words removal is the process of excluding from vocabulary tokens that represent extremely common words which appear in almost every text (e.g. the, is, a, but). Removing them is important because they add little value in helping selecting a document which matches some query. Token normalization is the process of normalizing a token so that matches can occur despite superficial differences in the character sequences (e.g. plug-in and plugin; or Class and class). Stemming is the process of reducing a token to its root using some heuristic (in our case, Porter s algorithm [Porter 1997]). Stemming is needed because, when processing a text, one generally wants to consider different forms of a word as the same (e.g. program, programs and programming). Moreover, there are some words which are derivationally related and have similar meanings, such as am, are, is and be and we need to treat them as a unique term.

4 Term is a token that was normalized and stemmed. Document is a sequence of terms. 3. Study and Results We performed a preliminary case study on the similarity between source code and bug reports vocabularies in order to know whether bug reports are somehow related to vocabulary of their impacted classes. In our work, IR documents are source code s entities and their vocabularies are composed by names of packages, classes, fields and methods Study Setup In order to perform the study, we need to obtain fixed bug reports, each one mapped to the entities impacted for its solution that is, for each bug report, we need to know which classes were impacted for its fixing. This way, we can analyze the relation between vocabularies of a specific bug report and its impacted classes. Downloading bug reports is a trivial task, because the Bugzilla system provides a default URL pattern to export each report in XML format. However, mapping each bug report to source code entities it has impacted is not as easy, since the Bugzilla usually does not store this kind of information. We can find some work in the literature which have done this kind of mapping by mining software repositories [Śliwerski et al. 2005, Fischer et al. 2003, Zimmermann et al. 2007]. They search for references to bug reports in commit messages (e.g., Fixed and bug #53784 ). We can use these results as reference to guide us on this task. In our study, we used an Eclipse bug data set provided by Schröter et al [Schröter et al. 2006]. They provide data of three versions of Eclipse, namely versions 2.0, 2.1 and 3.0. We are using the latter, which has 3,333 impacted classes mapped to 4,136 bug reports from dec/2003 to dec/ Similarity between vocabularies of bug reports and source code in the Eclipse project Using Eclipse 3.0 bug data set, which contains more than 4K bug reports, we plotted the number of classes per bug report (presented in Figure 2). The chart shows that most bug reports impact few classes, while few bug reports impact various classes (e.g., 113 classes). In fact, 99% of bug reports impact less than 20 classes. Figure 3 presents the density of classes per bug report. We can see that the number of impacted classes by almost 60% of bug reports is only one. Moreover, almost 90% of bug reports impact one, two or three classes. Therefore, in general, a developer has to identify for each bug report less than four classes to change among thousands of classes contained in the project. We analyzed the relationship between vocabularies of bug reports and source code in the simplest way: we extracted both vocabularies; treated them with information retrieval approaches, namely tokenization, stop words removal and stemming; and calculated cosine similarity from their vectors of terms. We did not use any other treatment (e.g. filtering of terms) because we would like to check whether the most basic approach was enough to achieve good results (i.e., similarity above 75%). However, we discovered that the similarity between vocabularies without any specific treatment is very low.

5 Figure 2. Number of classes per bug report of Eclipse 3.0 bug data set Figure 4 presents the obtained similarities from analyzed vocabularies. We grouped similarities in four categories: from 0% to 25%; from 25% to 50%; from 50% to 75% and from 75% to 100%. As we can see, more than the half of similarities are from 0% to 25%. Moreover, we have more than 90% of similarities lower than 50%. Therefore, we can conclude that processing vocabularies in the simplest way, without any specific treatment for bug report and source code is not enough for predicting classes from bug reports. To use bug reports as descriptors of source code, one needs to improve the treatment approach of both vocabularies. This study helped us to understand the existent relation between both vocabularies and led us to adapt and improve information retrieval techniques aiming to increase accuracy on impacted classes prediction. 4. Our Proposed Approach After an analysis of the case study results, we propose a technique to relate bug report and source code vocabularies, aiming to achieve reasonable accuracy in location of classes impacted by a bug report. Our hypothesis is that we can explore the similarities and dissimilarities of vocabularies in order to predict such classes from reports. We rely on Figure 3. Density of classes per bug report of Eclipse 3.0 bug data set

6 Figure 4. Summary of similarity of vocabularies from Eclipse 3.0 bug data set the assumption that a bug report description (as free-form text) has information about the problem domain related to the described bug. The term software vocabulary comprises all identifiers names presented in code [Deissenboeck and Pizka 2006] (e.g., name of packages, classes, fields and methods). They are the primary source of conceptual information for comprehension of a software [Rajlich and Wilde 2002]. These identifiers are chosen by stakeholders and generally represent the problem domain [Haiduc and Marcus 2008]. Figure 5 presents an overview of our technique. It is broken into five steps, as follows: First of all (Step 1), we extract the vocabulary of source code and apply algorithms of information retrieval (e.g., tokenization, stop-words removal and stemming) in order to treat terms. In Step 2, we index the terms from classes, using the search engine library Apache Lucene 5. The index stores statistics about terms in order to make term-based search more efficient. So, indexing terms let us to efficiently search over the entities vocabulary. Step 3 aims at extracting the vocabulary of bug report and performing the same treatment as before. However, we do not index bug report s vocabulary. Instead, we use it as a query to search over the source code s entities vocabulary (Step 4). Finally, in Step 5, we score classes of source code and return a ranking of them to the user. For that purpose, we use a combination of the Vector Space Model (VSM) [Salton et al. 1975] of Information Retrieval and the Boolean model [Lashkari et al. 2009] to determine how relevant each class is to the bug report vocabulary (the query). 5. Related Work Nowadays, bug reports are used for various purposes. Research work has been carried out on mining bug repositories to help on software maintenance. Various studies used bug reports to track features over time [Fischer et al. 2003], understand how people describe problems [Ko et al. 2006], extract structural information from bug reports [Bettenburg et al. 2008], automatically assign them developers [Matter et al. 2009, Anvik et al. 2006], assign artifacts to bug reports [Čubranić and Murphy 2003] and improve their quality [Bettenburg et al. 2007]. 5

7 Figure 5. Our Technique However, to the best of our knowledge, only two studies [Canfora and Cerulo 2005, Canfora and Cerulo 2006] have proposed an approach similar to ours, which maps entities to bug reports. The both are very similar since they only differ in level of granularity (the first retrieves classes and the second, lines of code). Their obtained precision ranges from 30% to 78%. We think that range is not significant for evaluation about the effectiveness of their technique, because of its sparsity. Our study aims to better understand the behaviour of software and bug report vocabularies and to propose a technique which excels their results. Moreover, although their main objective is the same of ours, there are differences in the retrieval process, because they do not use software vocabulary. 6. Conclusion In our study, the results showed us that most bug reports (almost 90%) impact up to three classes. Moreover, without any specific treatment for source code and bug reports vocabularies, we discovered that they generally have less than 50% of similarity. Therefore, there is a relation between bug report and source code, even though with low similarity. It led us to propose a technique to adapt and improve information retrieval approaches aiming to increase the accuracy of impact prediction from a bug report. As a future work, we intend to implement the proposed technique, then experiment and improve it. Moreover, we aim at providing to developers community a tool that automatically implements our approach, predicting to developers which classes should be impacted to fix a given bug report. References Anvik, J., Hiew, L., and Murphy, G. (2006). Who should fix this bug? In International Conference on Software Engineering. Bettenburg, N., Just, S., Schröter, A., Weiß, C., Premraj, R., and Zimmermann, T. (2007). Quality of bug reports in Eclipse. In OOPSLA Workshop on Eclipse Technology exchange. Bettenburg, N., Premraj, R., Zimmermann, T., and Kim, S. (2008). Extracting structural information from bug reports. In IEEE Working Conference on Mining Software Repositories.

8 Canfora, G. and Cerulo, L. (2005). Impact analysis by mining software and change request repositories. IEEE International Symposium on Software Metrics. Canfora, G. and Cerulo, L. (2006). Fine grained indexing of software repositories to support impact analysis. In IEEE Working Conference on Mining Software Repositories. Čubranić, D. and Murphy, G. (2003). Hipikat: Recommending pertinent software development artifacts. In International Conference on Software Engineering. Deissenboeck, F. and Pizka, M. (2006). Concise and consistent naming. Software Quality Journal, 14(3): Eisenbarth, T., Koschke, R., and Simon, D. (2003). Locating features in source code. IEEE Transactions on Software Engineering. Fischer, M., Pinzger, M., and Gall, H. (2003). Analyzing and relating bug report data for feature tracking. Published by the IEEE Computer Society. Haiduc, S. and Marcus, A. (2008). On the use of domain terms in source code. In IEEE International Conference on Program Comprehension. Ko, A. J., Myers, B. A., and Chau, D. H. (2006). A linguistic analysis of how people describe software problems. IEEE Symposium on Visual Languages and Human-Centric Computing. Lashkari, A., Mahdavi, F., and Ghomi, V. (2009). A Boolean Model in Information Retrieval for Search Engines. In International Conference on Information Management and Engineering. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Matter, D., Kuhn, A., and Nierstrasz, O. (2009). vocabulary-based expertise model of developers. Mining Software Repositories. Assigning bug reports using a In IEEE Working Conference on Porter, M. F. (1997). An algorithm for suffix stripping, pages Morgan Kaufmann Publishers Inc. Rajlich, V. and Wilde, N. (2002). The role of concepts in program comprehension. In Proceedings of the 10th International Workshop on Program Comprehension, pages IEEE. Salton, G., Wong, A., and Yang, C. (1975). A vector space model for information retrieval. Journal of the American Society for information Science, 18(11). Schröter, A., Bettenburg, N., and Premraj, R. (2010). Do stack traces help developers fix bugs? In IEEE Working Conference on Mining Software Repositories. Schröter, A., Zimmermann, T., Premraj, R., and Zeller, A. (2006). If your bug database could talk. In International Symposium on Empirical Software Engineering. Śliwerski, J., Zimmermann, T., and Zeller, A. (2005). When do changes induce fixes? In IEEE Working Conference on Mining Software Repositories. Zimmermann, T., Premraj, R., and Zeller, A. (2007). Predicting defects for eclipse. In International Workshop on Predictor Models in Software Engineering. IEEE Computer Society.

Measuring the Semantic Similarity of Comments in Bug Reports

Measuring the Semantic Similarity of Comments in Bug Reports Bogdan Dit, Denys Poshyvanyk, Andrian Marcus Department of Computer Science Wayne State University Detroit Michigan 48202 313 577 5408