Impact Analysis by Mining Software and Change Request Repositories

Size: px

Start display at page:

Download "Impact Analysis by Mining Software and Change Request Repositories"

Iris Payne
6 years ago
Views:

1 Impact Analysis by Mining Software and Change Request Repositories Gerardo Canfora, Luigi Cerulo RCOST Research Centre on Software Technology Department of Engineering University of Sannio Viale Traiano Benevento, Italy {canfora, Abstract Impact analysis is the identification of the work products affected by a proposed change request, either a bug fix or a new feature request. In many opensource projects, such as KDE, Gnome, Mozilla, Openoffice, change requests, and related data, are stored in a bug tracking system such as Bugzilla [1]. These data, together with the data stored in a versioning system, such as CVS [2], are a valuable source of information on which useful analyses can be performed. In this paper we propose a method to derive the set of source files impacted by a proposed change request. The method exploits information retrieval algorithms to link the change request description and the set of historical source file revisions impacted by similar past change requests. The method is evaluated by applying it on four opensource projects. 1. Introduction Nowadays, a huge amount of data regarding version history are freely available in opensource projects (eg. SourceForge). They provide useful information on the evolution of a software project in terms of who have changed what and when. New opportunities for data analysis have been proposed in recent years not only in the field of software evolution, but also for change propagation [23], fault analysis [11], software complexity [7], software reuse [19], and social networks [17]. These issues have been faced mainly by analyzing the product instead of the process. For example, Graves et al. [11] found that the number of times code has been changed is a better indication of how many faults it will contain than is its length; Ying et al. [23] report that history change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. In this paper we propose an approach for deriving the set of source files impacted by a proposed change request by using information retrieval techniques. We use the historical data stored in a versioning system, namely CVS [2], and in a bug tracking system, Bugzilla [1]. In the opensource community Bugzilla is usually used for tracking bugs, and to manage enhancement feature requests. Information stored within Bugzilla are semistructured and are in large part composed by textual information: a short description, a long description, and a set of discussion comments about the resolution of the change request. Our hypothesis is that the textual data carried by the bug tracking system during the resolution of a change request is a useful descriptor of the impacted files to be considered in the impact analysis of future similar change requests. The approach is useful primarily for the developer but also for the project manager. A developer may know what are the set of files he/she should work on to resolve a given change request. We show with a case study that the set of file returned by our approach are right in no less 30 % of cases and reaches a maximum of 78 %. A project manager can take advantage of this approach to have an estimation of what are the impacted files in the next release. This is useful to focus testing effort. The paper provides two main contributions: we show that historical information stored in a versioning and a bug tracking system are useful for analyzing aspects not strictly related with software evolution but also for impact analysis; we verify with four case studies that the information stored in historical change requests are a useful descriptor of source files when it is used for impact analysis; The paper is organized as follows: the next section gives a review of related works regarding impact analysis and how software repositories have been used in different contexts; the third section introduces our approach and shows how software repositories such as CVS and Bugzilla are used for supporting impact analysis; the fourth section present

2 the application and validation of the approach in four case studies; the fifth section discusses the results obtained; the final section concludes the paper with suggestions for further development. 2. Related work Traditionally, impact analysis has been faced by static analyzing the product [5]. Many approaches are based on traceability analysis and dependence analysis. Traceability analysis identifies affected software objects using explicit traceability relationship. Some methods use a traceability matrix for representing relationships and the impacted objects are inferred by calculating the transitive closure of the matrix. Dependence analysis attempts to assess the effects of change on semantic dependencies between program entities; a technique is to use static and/or dynamic slicing [14]. Expert judgement and code inspection are also used; however, expert predictions have been shown to be frequently incorrect [16], and source code inspection can be prohibitively expensive [20]. The availability of data on software process, such as those deriving from software and change repositories, can provide new opportunities for impact analysis. In particular, approaches to predict the impact and propagation of changes can be found in [24, 23]. They use heuristics, such as historical cochange, and coauthorship to derive the set of impacted source file or program entities. A method to evaluate the performance of change propagation heuristics has been introduced in [12]. These methods predict the impacted files starting from a given initial source file. The approach we present in this papers starts from a change request description. Many software engineering artifacts are represented in natural language and it is not unusual to find in literature methods that take advantage of information retrieval algorithms and natural language processing approaches. As an example, in [4] a probabilistic information retrieval model is used to map source code artifacts with documentation and in [18] the same approach is extended, obtaining a better performance, with the use of latent semantic indexing. 3. Background In the opensource community project management is usually performed with two systems: a source version control system, such as CVS, and a bug tracking system, such as Bugzilla. The first, is used mainly for file sharing and versioning while the second is used for change management, either bugs and new feature requests. In this section an overview of these two systems is given CVS CVS can handle revisions of textual files by storing the differences between subsequent revisions in a repository. In opensource projects CVS is used to store every files regarding the software system under development: this usually comprises source code, project documentation, internationalization text, arts, help text. Every file in the repository evolve through a sequence of revisions; a software release represents a snapshot on the CVS repository. Revisions are stored together with the user id that performs the commit, the date, and a textual block representing the user comment to the change. The complete history of file revisions can be retrieved. In this paper we focus on the RCS log file, which identifies artifacts in the source repository (i.e. the file name and its path in the source tree) and the list of modification records describing the change history of each artifact from its initial checkin to the current release. A modification record starts with the revision number and contains the following fields: date, the date and time of checkin; author, an identificator of the person who did the checkin; state, one of the following values: exp means experimental and dead means that the file has been removed; lines, the number of lines added and deleted with respect to the previous version of the file; and a final block of free text that contains informal data entered by the author during the checkin process Bugzilla Bugzilla is a database for bugs. Its main use is to report a bug and to assign bugs to the appropriate developers. However, Bugzilla considers also requests for enhancement. In the opensource community it is largely used for proposing changes, assigning tasks, discussing different enhancement solutions and so on. Developers use Bugzilla to keep a todo list as well as to prioritize, schedule and track dependencies. The Bugzilla database is accessible from http and some of its reports can be exported in XML. A bug, or a request for enhancement, has a life cycle in Bugzilla. When a new bug is discovered it becomes by default UNCONFIRMED, because what happens to a bug when it is first reported depends on who reported it. A QA (Quality Assurance) person looks at it and confirm it exists before it gets turned into a NEW bug. Usually that person is one of the maintainers of the project. When a bug becomes NEW, the developer will probably look at the bug and either accept it or give it to someone else. When a bug is fixed it is marked RESOLVED and its resolution is set to FIXED. Other valid resolutions are: INVALID, when the problem described is not a bug; WONTFIX, when the problem described is a bug which will never be fixed; DUPLI CATE, when the problem is a duplicate of an existing bug;

3 and WORKSFORME, when all attempts at reproducing this bug in the current build were futile. In this paper we refer to Change Request (CR) as an item tracked by Bugzilla that can be either a bug or a request for enhancement. In the following, an excerpt in XML format of the information contained in a CR is shown. <bug> <bugid>261054</bugid> <creationts> :07 PST</creationts> <shortdesc>cursor size too wide in a textarea</shortdesc> <deltats> :18 PST</deltats> <product>firefox</product> <component>general</component> <bugstatus>resolved</bugstatus> <resolution>fixed</resolution> <reporter>ffes@users.sourceforge.net</reporter> <assignedto>aaronleventhal@moonset.net</assignedto> <longdesc> <who>ffes@users.sourceforge.net</who> <bugwhen> :07 PST</bugwhen> <thetext> [] the cursor is much wider then in input type [] This makes it hard to determine [] Expected Results: The cursor size of the input should also be used in the []</thetext> </longdesc> </bug> A CR is enclosed in a bug tag, which contains: bugid,a unique identifier assigned by Bugzilla; creationts, the date and time of CR creation; shortdesc, a short description; deltats, the date and time of the last CR state change; product, the product name; component, the component of the system; reporter, who has submitted the CR; assignedto, who was assigned the CR for resolution; and longdesc, a structure comprising a long description of the CR, thetext, who submitted it, who, and when, bugwhen. Bugzilla and CVS are often used in the opensource community in a disciplined and standard way [10]. Opensource development process usually follows guidelines developed around CVS and Bugzilla. One of these is to keep track of the files impacted by a CR. CVS and Bugzilla databases are not correlated but a common practice is to integrate in a CVS commit comment the id number of the CR tracked by Bugzilla. This is a common used heuristic to generate a one to many relation between CR and file revisions [8]. 4. The proposed approach 4.1. Impact analysis process Impact analysis is the identification of the work products affected by a proposed change request, either a bug fix or a new feature request. Our approach to impact analysis is based on the hypothesis that the set of revision comments of a file and the set of CRs that previously impacted it are a good descriptor of the file to support impact analysis of new CRs. We use textual similarity to retrieve past CRs similar to a new CR and to compute the impacted files. This is granted by the fact that CVS and Bugzilla are extensively used as tools for sharing knowledge during the software development process. The clarity of textual information, such as bug comments, bug description, feature proposal and so on is a critical needs in an environment in which no people meetings, no phone calls, and no coffee break discussions are possible. As a matter of fact, the quality of textual information is high specially for mature projects such as KDE, Mozilla, and OpenOffice, for which a history of about seven years is stored in their own CVS and Bugzilla repositories. Textual similarity is a critical part of our approach. The Information Retrieval community dealt with text similarity for a long time. Given a set of text documents and a user information needs represented as a set of words, or more generally as a free text, the information retrieval problem is to retrieve all documents relevant to the user [22]. The process depicted in figure 1 shows our approach. Each work file is the result of a set of revisions in the CVS repository and each revision can be the impacted modification of a CR in the Bugzilla database. These relationships are depicted in the figure as onetomany and manytoone database relations between respectively Work File, Revision, Fixed CR entities. A user submits a new CR to the Bugzilla database. If it is a bug, it is usually composed by a short text description and a long text description in which the steps to reproduce the bug are explained. Otherwise, if it is an enhancement feature request, a description of what is desired is explained in the long description. The new CR is then assigned to a developer in order to resolve it. The developer has to comprehend the CR and determine the files of the source tree that will be impacted by the new CR. A developer that has dealt with similar CRs in the past has a wide knowledge of the files involved and can easily locate the files that should be changed. Otherwise if he/she does not have such knowledge he/she has to analyze more deeper the problem and usually asks the help of other developers involved in the process. It is not unusual to find in the discussion comments of a CR, topics regarding similar past behaviors resolved in other CRs. Our approach attempts to keep track of similar past CRs that have been correctly resolved and fixed in order to give to the developer a first

4 CVS VERSIONING SYSTEM BUGZILLA TRACKING SYSTEM WORK FILE 1 * * REVISION 1 FIXED CR NEW CR NEW CHANGE REQUEST (BUG or NEW FEATURE) NEW CR DESCRIPTOR User WORK FILE DESCRIPTOR INDEX MATCH RANKED LIST OF IMPACTED WORK FILES INDEXING PROCESS Developer Figure 1. Impacted files retrieval process ranked list of probable impacted files which can be the start point for the change. The ranked list is the output of an information retrieval algorithm that compares the query representation of the new CR with the XML representation of the work files indexed through and indexing process. The new CR query representation and the XML file descriptor representation are built by the descriptor builder process shown in figure 2. A new CR item passes through a number of steps in which some standard query transformation are performed. In the first step, field extractor, the set of textual information that are used to build the query is extracted. We have considered two alternatives, one in which only the short description of the CR has been considered and another in which also the long description has been considered. In some cases a more wide query description gives more precise results. The second step, term tokenizer, regards the subdivision of the text in terms that will be considered in the query. A token is a sequence of alphanumeric characters separated by nonalphanumeric characters. In our case we have discarded tokens consisting only of digits. The third step, stemmer, serves to lead a term to its root. For example verb conjugation are leaded to the infinite verb, plurals are leaded to singulars, and so on. This is a common step of an information retrieval process in which terms are stemmed in order to collapse terms with the same meaning into a single term. In our case we have used the Porter stemmer algorithm for English [21]. The fourth step, stopwords, servesasafilterofcommon word that are not discriminant for the text. We have used a common stop word vocabulary used in the context of English text retrieval enriched with a set of words picked up from the software system domain. For example, we have discarded the words bug, feature, and words regarding the name of the system under consideration such as kde, koffice, and so on. The final step, new CR descriptor builder, builds the query in the language used by the retrieval engine that in REVISION CHECK IN FREE TEXT STOPWORDS FILE DESCRIPTOR BUILDER WORK FILE (document) FIXED CR ITEM TERM TOKENIZER STEMMER NEW CR ITEM FIELD EXTRACTOR STOPWORDS NEW CR DESCRIPTOR BUILDER NEW CR (query) Figure 2. Descriptors builder process our case is represented by the set of terms produced in the previous phases. The XML file descriptor representation passes through similar steps and is built from the fixed CR items in relation with the file and the revision checkin free text comment. Also in this case we have considered various alternatives for file representation. One in which only revision comments have been included, and an other in which also fixed CR short and long descriptions have been considered. The ranked impacted file list is retrieved through an in

5 formation retrieval algorithm that scores each work file on the basis of its relevance to the new CR query representation. In information retrieval queries and documents are described by a set of index terms. Let T = {t 1,t 2,,t n } denote the set of term used in the collection of documents. Both a document d and a query q are represented as a vector (x 1,x 2,,x n ) with x i =1if t i belong to the document/query and x i =0otherwise. In our approach we have used a probabilistic information retrieval model in which the relevance of a document with respect to a query is computed by evaluating P (R d, q), that is the probability that a given document d is relevant to a given query q. Different probabilistic models have been proposed in literature to evaluate this probability in order to score each document in the collection. We have used the model introduced in [13]. It assumes that each term is associated with a topic, and that a document may be about the topic, or not. Statistic measures about the term occurrences in documents are used to estimate the probability. In particular, a document d is scored with respect to a query q by using the following scoring function: S(d, q) = t q W (t) that sums the weight of each query term with respect to the document d on the basis of a weighting function W.An overview of weighting functions can be found in [6]. We have used the following [13]: W (t) = TF t (k 1 +1) ( ) log N k 1 (1 b)+b DL AV DL + TFt ND t where TF t is the frequency of term t in the document, DL is the document length (i.e. number of terms), AV DL is the average document length in the collection, N is the number of documents in the collection, and ND t is the number of documents in which the term t appears. The constant k 1 determines how much the weight reacts to increasing TF t, and b is a normalization factor. We use the values of k 1 =1.2and b =0.75, which are recommended values for generic English text Tool support We have developed a toolkit in order to support the process depicted in figure 1. All steps of the process are completely automated, including the download from the CVS and Bugzilla repositories. In particular we have developed a java software that parses a cvs log file and a script that can submit a request to the Bugzilla repository and download the output of a set of fixed CRs. We use the WebL scripting language [15] together with its information extraction Table 1. OpenSource projects # work #CR project files fixed in the starts last six months kcalc Apr 1997 kpdf Aug 2002 kspread Apr 1998 firefox Sep 2002 features in order to parse the html Bugzilla output and generate an XML version of a CR. The java software correlates file revisions with each Bugzilla CRs and generates the set of work file represented in XML ready to be parsed by the indexing engine. The indexing process and the matching algorithm has been implemented in java by using an information retrieval framework named Terrier [3]. 5. Case study We have applied the impact analysis approach in four case studies with different characteristics (Tab. 1). The first three systems come from the kde opensource project, a unix window manager with many desktop applications. They include a small system kcalc, which is a desktop calculator, a medium system kpdf, which is a pdf file viewer, and a medium large system kspread, which is a spreadsheet contained in the koffice environment. The fourth, a medium large system, firefox, is the Mozilla Internet desktop browser, which have a plugin architecture with many additional features. The size, in terms of work files, of each system is shown in the table together with the number of CRs fixed in the last six months, and information regarding the start date of the project. The results have been assessed using two widely accepted information retrieval metrics, namely, Precision and [22]. Precision is the ratio between the number of relevant documents retrieved for a given query and the total number of documents retrieved for that query. is the ratio between the number of relevant documents retrieved for a given query and the total number of relevant documents for that query. In our case study recall and precision indicates how much of the right impacted files has been correctly predicted (recall) and how much of the predicted impacted files is right (precision). We use the same methodology used for evaluating an information retrieval algorithm, that is, the retrieved ranked list of documents is considered at different cut levels [22]. A cut level N is the list of the first N documents. For each cut level the behavior of precision and recall is analyzed and traced on a graph. It is wort noting that the performance of the approach

6 depends on the amount of textual information that describes the work files of the system. A mature system has much more information than a system that has not been released yet. We show it with two analyses. In the first case we consider a snapshot of the system for a given time and we use a set of past CRs (fixed in the past six months) to build an oracle and evaluate the performance by using the leaveoneout assessment technique [9]. For a set of CRs we find the impacted work files by using a correlation heuristic based on the presence of the Bugzilla id number value in the revision comment of the file [8]. We think this is a good oracle as CRs managed in Bugzilla follow basically some accepted guidelines, and one of these is to indicate in the checkin comment the Bugzilla id that identifies the relevant CR. This is generally true for CRs regarding bug fixing; in addition, for the systems we have considered, also enhancement request are usually tracked in the same way. To evaluate the precision and recall of each CR we build the index without the textual information of the current CR. In particular we eliminate from the index the revisions impacted by the CR and the description of the CR itself. This guarantees that the impacted files are retrieved only by different CRs that have descriptions similar to the current CR. In the second case we consider the evolution of the system over time. We sort a set of fixed CRs by they creation date and time. For each CR we include in the index only those revisions and CRs before the creation date and time of the CR under consideration. We do this for the Kspread system in an interval starting from 1999 to 2003 and we show that the precision and recall increases because more information to build the work file descriptors are available. 6. Results In this section we discuss the results obtained in the two case studies. Different aspects regarding the information we have included in the work file descriptor and the new CR descriptor have been considered (Tab. 2). The first, labeled with A, considers a partial set of information, revision comments for work files, and CR short description for the new CR descriptor. It excludes past CR descriptors in work files. The second, labeled with B, considers a minimal set of CR descriptors to be included in work files, revision comments and CR short descriptor for work files and again CR short description for the new CR descriptor. The third, labeled with C, considers also a partial set of information for work files. It excludes the revision comments in work files and considers both short and long description of CRs. The last, labeled with D, considers the full set of information for both work files and the new CR. Figures 3, 4, 5, and 6 show the result of the first type of analysis for each of the four systems analyzed. Each fig Table 2. Combination of different descriptors work file new CR representation representation label revision fixed CR description new CR description Precision comment short long short long A B C D 1,00 0,90 0,80 0,70 0,60 0,50 0,50 0,60 0,70 0,80 0,90 1,00 Figure 3. Kcalc results ure shows a set of four precisionrecall curves, one for each descriptor combination of table 2 and labeled from A to D. The curves have been traced by observing the top 30 files ranked by the information retrieval algorithm and averaging the precision and recall on the number of CRs considered for each system. In each graph there are 30 points, each one corresponding to a cut level starting from 1 to 30. The upper limit of 30 has been chosen because, in each system, the number of impacted files is no more than 30. The first set of points should be read as a measure of overall precision, while the last set of points as a measure of overall recall. For example, for Kcalc, the first point means that in 28 cases, that is the number of CRs considered for averaging, the first work file retrieved is correct 22 times (78%, label D ). While, the last point in the same example means that in 28 cases the top 30 work files retrieved covers the set of correct work files, 26 times entirely (i.e. in 26 cases recall is 100%), one time the 97% (i.e. in 1 case recall is 97%), and one time the 50% (i.e. in 1 case recall is 50%). The resulting average recall is: 98% = ( )/28. A general consideration is that precision and recall grow when more information is used to index work files, and A B C D

7 0,50 A 0,50 A 0,45 B 0,45 B C C D D 0,35 0,35 Precision 0,25 Precision 0,25 0,15 0,15 0,05 0,05 0,50 0,60 0,70 0,80 0,90 0,50 0,60 0,70 0,80 Figure 4. Kpdf results Figure 6. Firefox results Precision 0,50 0,45 0,35 0,25 0,15 0,05 0,50 0,60 0,70 0,80 Figure 5. Kspread results more terms are used to represent a new CR. The curves show that CRs are better descriptors than revision comments in all systems except for Kpdf and Firefox, in which the behavior is inverted. We suppose that the motivation can be found in the fact that Kpdf have reused code from xpdf, a pdf visualizer for XWindows, and Firefox have reused code from Mozilla and Netscape browsers. Code reusing, usually, implies a bulk checkin of the reused system in the source tree. This operation carries in only file revisions of the system being reused, and not CRs, and the source of descriptors are only those from revision comments. Of course, more analyses are needed to argue this position. At first cut level the precision of Kpdf ranges from a maximum of 45% to a minimum of 36%, while at 30th cut level recall ranges from a maximum of 85% to a minimum of 70%. An overall good performance can be seen for Kcalc. At first cut level A B C D the precision of Kcalc ranges from a maximum of 78% to a minimum of 38%, while at 30th cut level recall ranges from a maximum of 98% and a minimum of 82%. We suppose this is mainly due to the fact that the size of Kcalc is small and its architecture is quite simple, 3 or 4 modules each encapsulated in one or two files. If a system is well modularized, then changes impact a small set of work files. This induces a good topic discrimination of work file descriptors, in the sense that different topics belong to different modules. That is, for example, if the topic of a new CR is regarding user interface the impacted files should be the set of files belonging to the user interface module. Kspread and Firefox exhibit overall similar performance: precision is lower than 39% for Kspread and 36% for Firefox, while recall is lower than 79% for Kspread and 67% for Firefox. Figure 7 shows the results of the second type of analysis performed on Kspread. The figure shows a set of precision and recall curves averaged on 16 CRs fixed in year The Kspread project started in April 1998 and while new file revisions have been performed over time new information can be used to build work file descriptors. We have considered five milestones in which the work file descriptors have been build considering only information available before that time. For example, the time period labeled with 2002 considers revision comments and CRs respectively submitted and fixed before 01 January The graph show that information available in 01 January 2000 are not good descriptors, while starting from 01 January 2001 the performance for precision, at first cut level, grows to 44%. For successive time period (2002 to 2003) no more evident improvement can be seen. This may be because during that period of time change activities on software have not produced discriminant descriptors. For example, debugging activities have dealt with the same topics.

8 Precision 0,60 0, ,50 0,60 Figure 7. Kspread time incremental results 7. Conclusions Software and change repositories give new opportunities to software assessment. In this paper an approach to predict impacted files from a change request definition has been presented. The approach has given promising results for further improvements and investigation. A direction, of improvement that we will follow, is to consider different level of granules of impacted entities. In this paper we have considered the file level but also program entities and architectural entities such as modules can be considered. We believe that descriptors at module level and program entity level can improve the performance of the retrieval algorithm especially for large systems. Furthermore, different retrieval algorithms can be considered and we have in program to compare different approaches based on language models. Further investigation of the analysis regarding the evolution of the system over time, should be interesting. In particular what happens to descriptors between relevant milestones, such as software releases, can be an important indicator to improve the performance of work file descriptors. References [1] Bugzilla. bug tracking system. [2] Cvs. concurrent versions system. [3] Terrier Information Retrieval Platform, Lecture Notes in Computer Science. Springer, [4] G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng., 28(10): , [5] R. S. Arnold and S. A. Bohner. Impact analysis towards a framework for comparison. In ICSM 93: Proceedings of the Conference on Software Maintenance, pages IEEE Computer Society, [6] F. Crestani, M. Lalmas, C. J. V. Rijsbergen, and I. Campbell. Is this document relevant?probably: a survey of probabilistic models in information retrieval. ACM Comput. Surv., 30(4): , [7] S. G. Eick, T. L. Graves, A. F. Karr, J. S. Marron, and A. Mockus. Does code decay? assessing the evidence from change management data. IEEE Trans. Softw. Eng., 27(1):1 12, [8] M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance ICSM03, Amsterdam, Netherlands, Sept [9] K. Fogel and M. Bar. CrossValidatory Choice and Assessment of Statistical Predictions (with Discussion), volume 36. J. the Royal Statistical Soc., [10] K. Fogel and M. Bar. Open Source Development with CVS. Coriolis, [11] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng., 26(7): , [12] A. Hassan and R. Holt. Predicting change propagation in software systems. In Proceedings of 20th IEEE International Conference on Software Maintenance (ICSM 04), pages IEEE Computer Society Press, Sept [13] K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6): , [14] M. Kamkar. An overview and comparative classification of program slicing techniques. J. Syst. Softw., 31(3): , [15] T. Kistler and H. Marais. WebL a programming language for the Web. Computer Networks and ISDN Systems, 30(1 7): , Apr [16] M. Lindvall and K. Sandahl. How well do experienced software developers predict software change? J. Syst. Softw., 43(1):19 27, [17] L. Lpez, J. GonzlezBarahona, and G. Robles. Applying social network analysis to the information in cvs respositories. In IEEE 26th International Conference on Software Engineering The International Workshop on Mining Software Repositories, [18] A. Marcus and J. I. Maletic. Recovering documentationtosourcecode traceability links using latent semantic indexing. In International Conference on Software Engineering, pages IEEE Computer Society Press, May [19] A. Michail. Data mining library reuse patterns using generalized association rules. In ICSE 00: Proceedings of the 22nd international conference on Software engineering, pages ACM Press, [20] S. L. Pfleeger. Software Engineering: Theory and Practice. PrenticeHall, Upper Saddle River, NJ, [21] M. F. Porter. An algorithm for suffix stripping. pages , 1997.

9 [22] B. Ribeironeto and Baezayates. Modern Information Retrieval. Addison Wesley, [23] A. T. T. Ying, G. C. Murphy, R. Ng, and M. C. ChuCarroll. Predicting source code changes by mining revision history. IEEE Transactions on Software Engineering, 30: , Sept [24] T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In ICSE 04: Proceedings of the 26th International Conference on Software Engineering, pages IEEE Computer Society, 2004.

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina