Comments Adaptation. Diploma Thesis May 14, Adaptation of Source Comments and API Documentation When Source Code Changes. Edoardo P. J.

Size: px

Start display at page:

Download "Comments Adaptation. Diploma Thesis May 14, Adaptation of Source Comments and API Documentation When Source Code Changes. Edoardo P. J."

Raymond Bell
5 years ago
Views:

1 Diploma Thesis May 14, 2007 Comments Adaptation Adaptation of Source Comments and API Documentation When Source Code Changes Edoardo P. J. Beutler of Winterthur, Swizerland ( ) supervised by Prof. Dr. Harald C. Gall Beat Fluri Department of Informatics software evolution & architecture lab

3 Diploma Thesis Comments Adaptation Adaptation of Source Comments and API Documentation When Source Code Changes Edoardo P. J. Beutler Department of Informatics software evolution & architecture lab

4 Diploma Thesis Author: Edoardo P. J. Beutler, Project period: Software Evolution & Architecture Lab Department of Informatics, University of Zurich

5 Acknowledgements First I would like to thank the whole S.E.A.L. team at the Department of Informatics of the University of Zurich for their advices, the comfortable working atmosphere and the many enjoyable lunch breaks. Especially I would like to thank Professor Harald Gall for giving me the opportunity to write this thesis. Further Beat Fluri as my supervising assistant, Michael Würsch and Emanuel Giger for their continuous introductions and support to the existing systems. Next I would like to thank Beat and my mother Adriana Beutler Romanò for proof reading my thesis. Last but not least I would like to thank my parents Adriana and Werner Beutler for their support during my years of studies.

7 Abstract Source comments and API documentation are an important part of the source code of software systems. Proper documentation increases the readability as well as the maintainability of source code. In this diploma thesis we investigate whether and when source comments change. We want to know whether comment changes are based on a change of the source code or not. If the comment changes are based on source code changes we are interested to know if the changes to the comments are done together with source code changes or if they are done in a later revision. To investigate this, we take advantage of CVS versioning data. Based on CHANGEDISTILLER, an existing Eclipse plugin, we implemented a plugin to find matching source code and comment changes. We used the plugin to evaluate two mid-size Java projects.

9 Zusammenfassung Kommentare und Schnittstellendokumentation sind ein wichtiger Bestandteil des Quellcodes von Software Systemen. Gute Dokumentation erhöht sowohl die Lesbarkeit, als auch die Wartbarkeit des Quellcodes. In dieser Diplomarbeit untersuchen wir ob und wann sich Quellkommentare ändern. Wir interessieren uns dafür ob einer Kommentaränderung eine Änderung des Quellcodes zu Grunde liegt, und falls ja, ob die Kommentaränderung gemeinsam mit der Quellcodeänderung stattgefunden hat oder erst später erfolgte. Um das zu untersuchen, nutzen wir CVS Versionierungsdaten. Wir haben ein Plug-In für Eclipse implementiert, welches auf dem CHANGEDISTILLER, einem existierenden Plug-In, aufbaut, und sowohl Kommentar- wie auch Quellcodeänderungen auf Gemeinsamkeiten untersucht. Das Plug-In haben wir dann verwendet um zwei mittelgrosse Projekte auf deren Gemeinsamkeiten bezüglich Quellcode und Kommentaränderungen zu untersuchen.

11 Contents 1 Introduction Motivation Envisioned outcome Structure Related Work Comment change detection Comment to source code matching Vector space based Information Retrieval model Recovering traceability links using Latent Semantic Indexing Applicability for our approach Comments Adaptation Outline of process Process overview Terminology Numeric analysis of source code to source comment change couplings Computing the number of changes Retrieving the results Rating the computed rankings Comment extraction and block building for line comments Mapping source code nodes to source comments Types of source comment to source code matchings Preparing the data The source comment to source code mapping algorithm Shortcomings of the algorithm Tracking comment nodes over multiple revisions Comment similarity measure Possible changes and their recognition Interpretation of the changes Summary Implementation Counting the number of comment changes in the simple analysis GNU diff Building blocks of line comments Adaptation of candidates as corresponding nodes Saving the results of the analysis

12 viii CONTENTS 5 Evaluation Validation and verification Evaluation process Evaluation of the Azureus BitTorrent Client The data evaluated Numeric analysis Detailed analysis Evaluation of the Eclipse Java Development Tools Core Component The data evaluated Numeric analysis Detailed analysis Conclusions Summary of contribution Lessons Learned Future Work A Too specific nodes 41 B Contents of the CD-ROM 43

13 CONTENTS ix List of Figures 2.1 Process to recover the traceability Outline of the procedure for our simple numeric analyze The outline of the steps performed during an in depth analysis. The colors are used to highlight the different chapters Overview over the rating of the ranking pairs Possible approaches to analyze the comment tracks Change recognition example. Evolution of a comment and its corresponding source code over several revisions List of Tables 2.1 Example for a Term-Document-Matrix The table shows the number of source code and comment changes. The rating In the last column is 1 if there is a coupling between the number of source code and comment changes, 0 if there is no coupling and -1 if the numbers can not be rated Results when analyzing the ArgoUML class Project. There are 172 revisions of the file (only 171 analyzed for revision 1.1, as the first revision, has no preceding revision to compare with) Examples for heavy changed classes Example for a repeating pattern Results from the analysis of Azureus. Unmodified basic version. All revisions are included Results from the analysis of the Azureus. Modified version the not rateable revisions have been eliminated to get a meaningful result Number of changes found in comments and their related source code nodes Numbers of common changes and when they happened Matching of change types for common changes Results from the analysis of the JDT Core Components. Unmodified basic version. All revisions are included Results from the analysis of the JDT Core Components. Modified version the not rateable revisions have been eliminated to get a meaningful result Number of changes found in comments and their related source code nodes Numbers of common changes and when they happened Matching of change types for common changes List of Listings 1.1 Different comment types Preceding, current and succeeding nodes One comment for one source code statement. Project.java Revision Three comment blocks describing one method. Project.java Revision One line comment describing several source code nodes. Project.java Revision

14 x CONTENTS 3.6 Comment succeeding the source code, but still on the same line. Project.java Revision A line comment, linking between the preceding and the succeeding source code. Project.java Revision The method wins against the conditional Javadoc example with different tags Identical comment Source code movement in relation to the comment. Version Identical comment Source code movement in relation to the comment. Version Identical comment Source code movement in relation to the comment. Alternate version Six line comments belonging to one block. Project.java Revision Javadoc with a succeeding modifier, being part of a field declaration

15 Chapter 1 Introduction 1.1 Motivation By now it is a commonplace that code is read much more often than it is written. Probably everybody used to programming, is familiar with the challenges of keeping the documentation up to date: Before writing any code the programmer does not feel like writing comments, after all there is no code yet and he does not know what exactly the code will look like. While programming, when the mind is focused on writing code and is concentrated on many details to take care of, the programmer is reluctant to write comments. They can still be written when the code works and is not anymore changed that often. When the code is written, the programmer knows what he has written and there is so much more to do, so why comment now? Some (probably not so long) time later the programmer does not know anymore in detail why his solution of a problem looks as it does, so a serious commentation would need a lot of time, since the code has to be rethought. Thus, making commenting much more painstaking, the programmer does not want to comment his old code. So the task of commenting the source code is often neglected; notwithstanding, everybody programming for longer than the last one or two weeks, knows the value of good comments. Comments allow a fast overview and understanding of the code. Where there are multiple solutions, they can inform the programmer or his successors, maybe years after the code has been written, why a certain solution was preferred over another one. Comments are crucial to sustain a good maintainability of source code. Maintenance can be nearly impossible if comments are outdated, e.g., not adapted after source code changes, or do not exist at all. Knowing the situation, it seems interesting to know how this ambivalent relation (love to have, but not as much to write) between programmers and comments influences actual projects. Are there often bugs due to outdated or missing comments and documentation (e.g., wrong use of a changed interface)? How much time is consumed by trying to reengineer source code which, with a few comments, would be easy to understand and maintain? Are there similarities between different projects? Thanks to todays versioning and bug tracking techniques used in most software development projects, there is plenty of historical data available. Nowadays there are lots of projects, using

16 2 Chapter 1. Introduction /** * Javadoc with tags, documenting a method nr the number to extract the root from returns the result as a double precision number. */ public double squareroot(double nr){ //Variable for the result - discribed in a line comment double result; /* A block comment: * delegation of the actual computing */ result = Math.sqrt(nr); } return result; Listing 1.1: Different comment types. advanced systems for versioning (e.g., Concurrent Versioning System (CVS), 1 Subversion, 2 or Rational ClearCase 3 ) or bug tracking (e.g., Bugzilla 4 or Mantis 5 ). Especially in the case of versioning most of this data is rarely used because it is only seen as insurance for the programmers. Yet there is much more potential in such databases. It would be interesting to use this memory of existing software systems. By that it could be possible to make predictions on bug-prone parts in a system, or to give advices to developers about how to write, handle and maintain comments as simple as possible. The first steps, in using such historical data to make statements on the usage of source code comments and documentation in software development projects, is to extract the source comments and documentation from the source code. Further to track it over time and to find out whether and when changes happen. These first steps are the goal of our work. 1.2 Envisioned outcome In this diploma thesis we aim at an extension of the CHANGEDISTILLER plugin [FG06] for Eclipse. 6 We want to be able to track the changes of source comments, i.e., to know, whether and when these changes are done, e.g., are source code changes often done together with comment changes? Are they done in the following revision, or maybe never at all? Source comments include thereby Javadoc API 7 documentation as well as single line and block comments (e.g., see Listing 1.1). We aim at extracting the source comments and map them to their corresponding source code nodes retrieved by CHANGEDISTILLER. Finally we aim at evaluating our work using two Java projects to conduct a case study. The 1 Concurrent Versioning System (CVS) last visited on May 4, Subversion last visited on May 4, Rational ClearCase last visited on May 4, Bugzilla last visited on May 4, Mantis last visited on May 4, Eclipse last visited on May 2, API Application Programming Interface

17 1.3 Structure 3 projects we are going to analyze are the Eclipse Java Development Tools Core Component (JDT Core) 8 with around 1.4 thousand classes and 37 thousand revisions and the Azureus BitTorrent Client 9 with about 2.8 thousand classes and 26 thousand revisions. 1.3 Structure The remainder of the thesis is structured as follows: In Chapter 2 we present other work related to source code to documentation traceability. Chapter 3 introduces our approach to find changes in source comments. Further we show how we bind the source comments to their related source code and track them over the lifetime of a file. Finally we discuss the analysis and the interpretation of the results. In Chapter 4 we present details on several specific points of the implementation. Chapter 5 shows the validation and verification we have done. Then the evaluation process is introduced and finally the results from the evaluation are presented. In Chapter 6 conclusions are drawn and we show some ideas for future work. 8 Java Development Tools Core Component (JDT Core) last visited on May 2, Azureus BitTorrent Clienthttp://azureus.sourceforge.net/, last visited on May 2, 2007

19 Chapter 2 Related Work In this chapter we look at related work in the field of comment change detection and comment to source code matching. We discuss the differences, i.e., applicability, to our approach. 2.1 Comment change detection We have found no work on change detection putting the main focus on the detection of source comment changes. There are lots of workings on change detection, but only on the detection of source code changes. Changing of source comments is not treated, normally it is even explicitly excluded. 2.2 Comment to source code matching Also in the field of source code to comment matching there is hardly any work available. We found no work on the matching of source comments to source code. As well as in the field of source code change detection, also here, comments are in general explicitly eliminated. However there are papers available concerning the matching of source code to (code-extern) free text documentation. The major part of these papers introduces techniques for linking source code to documentation during development of a system. These are mostly unsuited or even not applicable for recovering links in existing systems. They usually are based on certain grammars, the programmers have to use during the development, to establish and trace the couplings. In the following subchapters two different approaches are shown. Approaches which are able to retrieve source code to documentation links for existing systems Vector space based Information Retrieval model In their workings Recovering traceability links between code and documentation [ACC + 02] and Information Retrieval Models for Recovering Traceability Links between Code and Documentation [ACCL00], G. Antoniol et al. introduces a method to recover traceability links, based on vector space Information Retrieval (IR). A pre-condition of this method is, that programmers use meaningful names in their source code. The whole algorithm works as follows (see Figure 2.1 for a graphical overview):

20 6 Chapter 2. Related Work Code extern documentation Text Normalisation Indexer Indexer Document Classifier Scored Document List Source code Query Extraction Figure 2.1: Process to recover the traceability. 1. In a first step the source code and the documentation are prepared: Code extern documentation (upper path in the figure) The whole documentation is normalized (i.e., all letters are converted to lower case, all stop words like, e.g., articles, punctuation, numbers, etc. are removed, and all plurals are converted into singular as well as all flexed verbs to the infinity form). Source code (lower path in the figure) For each source code class a query is built. Identifiers consisting of different words are decomposed (e.g., printstacktrace to print, Stack, and Trace). Finally the words are normalized (same steps as for the code extern documentation). 2. The indexer builds an index, using a vocabulary, built from the documentation (respectively the source code) itself. 3. At last the document classifier computes the similarity between documentation and queries. As result, for each class, a scored list of documents is returned Recovering traceability links using Latent Semantic Indexing This idea, introduced by Andrian Marcus and Jonathan I. Maletic in Recovering documentationto-source-code traceability links using latent semantic indexing [MM03], is one of the few, applicable also to existing (e.g., legacy) systems. In this approach the links between the code extern (prose) documentation and the source code are done using Latent Semantic Indexing (for a brief introduction see the next subsection). Marcus and Maletic use the source comment and the identifier names (e.g., method or field names) to produce semantics. These semantics are used to construct traceability links to the source code extern documentation. Of course this only delivers useful results if the naming is meaningful in the source code components, as well as in the documentation. There are often approaches using predefined vocabulary (mostly for statistical analysis), necessitating expensive preprocessing and complex string manipulations. This solution since based on Latent Semantic Indexing uses no such vocabulary. Marcus and Maletic state that, due to saving the preprocessing and string manipulation, their solution is faster than others. Latent Semantic Indexing in a nutshell Latent Semantic Indexing (LSI) was introduced by Scott C. Deerwester et al. in the article Indexing by Latent Semantic Analysis [DDL + 90]. LSI can be used to identify main components (or

21 2.3 Applicability for our approach 7 concepts) of huge amounts of data. If for example the general expression Ship is identified, also the expressions Boat, Cutter, or Shallop are contained. LSI can also help identifying places, where Ship is mentioned, but is not a relevant match (e.g., a contest where a boat trip can be won). LSI works based on mathematical matrix operations. First a Term-Document-Matrix is built. In that matrix the number of occurrences of each term (i.e., word) per document is saved. In the Table 2.1 for example the term amazing appears three times in document 2, once in document 5, and never in the documents 1, 3, 4, and 6. If necessary the table can also be weighted. In a next doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 this term is amazing Table 2.1: Example for a Term-Document-Matrix step, the singular values of the matrix (X) are built (singular-value decomposition (SVD) see formula below). The resulting matrixes T and D have orthonormal columns and S is diagonal. Then, through omitting the lowest singular values, the dimension can be reduced until up to an undefined limit k (the reduced matrixes are S k and analog U k ). Finally queries q can be transformed to the semantic space (they are seen as special documents with the size (m 1)). SVD: (X = T S D) Query: Q = qt U k diag(s k ) After transforming all the documents to the semantic space (same procedure as transforming q), q can be compared to the documents by using inner product or cosine similarity. The advantage of LSI is the solution of the synonym problem. The disadvantage is it s not so good treatment of polysemy (i.e., one word, having different meanings) as well as the relatively high amount of computing power needed. 2.3 Applicability for our approach Notwithstanding the apparent similarity of matching source code to source comments or to source extern documentation, we found these tasks to be not as exchangeable as they seem. Normally source code extern documentation is much more detailed and contains more labeled connections to the source code then comments between the source code. By labeled connections we mean that there has to be a description which part of the source code is documented. On the other hand, connections between source comments and the source code are more defined through their proximity. A source comment has not to explain through words, to which source code node he attends. So the position of a source comment inside the document is an important, not to say the most important, part of the information. It is crucial to recover the link to the source code. Based on these two points, we think that these approaches are not applicable in a first step, but can be interesting in the future, when it comes to optimizing the results and recognize special types of source code to source comment couplings. What we doubt, is, that these advanced techniques are useful in a first phase. Due to the mentioned differences of the problem it is not possible to

22 8 Chapter 2. Related Work use the same (i.e., unaltered without major changes) algorithms to find source code to source comment couplings.

23 Chapter 3 Comments Adaptation In this chapter we present our tools and algorithms used to extract and analyze the source code to comment couplings. 3.1 Outline of process Here we give an overview of the processes introduced during this chapter. Further we explain the terminology used Process overview Simple numeric analysis of change couplings In the first part we show a relatively simple tool to perform a quantitative comparison. This comparison is able to provide hints whether source code changes are coupled to changes in the source comments or not. Revisions of a class Comments of a revision GNU diff number of comment changes Data Base number of source code changes Add up occurences Figure 3.1: Outline of the procedure for our simple numeric analyze. In Figure 3.1 this simple analysis is depicted. First we get all revisions available of the class we want to analyze. From the revisions we extract only the comments. Then, each pair of consecutive comments is compared and the number of changes is summed. The numbers of source code changes for that revision are fetched from the data base and the resulting numbers are saved for further analysis.

24 10 Chapter 3. Comments Adaptation public static void main(string[] args){ int i = 0; //output System.out.println("i is " + i); } Listing 3.1: Preceding, current and succeeding nodes. In depth analysis process For the in depth analysis more work has to be done. For a graphical overview see Figure 3.2. First the source code of all revisions of a file is fetched from the data base. Then for each revision all the comments are extracted. Several line comments have to be combined into one whenever they comment the same piece of source code. When this is done, the comments have to be mapped to their corresponding source code. After that these mappings can be tracked over multiple revisions and the comment changes, as well as the changes in their related source code can be analyzed. To do this, we will take advantage of the source code changes stored in the database. At the end we write the results to a file for further analysis. Comment extraction and... Data Base Revisions of a class (source code)... block building for line comments Mapping source code nodes to source comments Tracking comment nodes over multiple revisions Result extraction Source code change operations Figure 3.2: The outline of the steps performed during an in depth analysis. The colors are used to highlight the different chapters Terminology Here we introduce several important expressions used during this and the following chapters. Node, source node, source code node and source comment node. A source comment node is one comment, e.g., a block of Javadoc, a block, or a line comment. A source code node is, with few exceptions, a line of source code. The expressions node and source node include source code node, as well as source comment node. We talk about nodes because we are working with Abstract Syntax Trees (AST). Their nodes are the source code or comment elements we analyze. Preceding node, current node, and succeeding node. The current node is the one we are analyzing at the moment. When talking about preceding (respectively succeeding) nodes, we mean the nodes immediately before (respectively after) the current node. In the example in Listing 3.1 the current comment node

25 3.2 Numeric analysis of source code to source comment change couplings 11 //This line comments, as well as //this one, the following source code System.out.println("Success!"); Listing 3.2: A block, consisting of two line comments. //output has the variable declaration for i as preceding code and the method invocation of System.out. println(...) as succeeding. When talking about whole revisions of a class, e.g., the revision 1.34 is preceding revision 1.35 which is succeeded by revision Track of nodes A track of nodes, either source code or comment, is the sequence of one specific node, tracked over all revisions of a class. Change type insert, delete, and update We define the three operations insert, delete, and update of source code as change types. A source code insert is a source code part that has been added in the current revision, a delete operation is analogue a deleted part of source code, and an update a changed source code part. Corresponding node or related node The corresponding (or related) source code node for a certain comment node is the source code described by the comment. Analogue the corresponding comment of a source code node is the comment describing the source code node. Common change A common change is a change of a source comment with a change of the corresponding source code node at the same time, i.e., in the same revision. Common changes can be split into two different types: common changes of the same change type, e.g., code and comment are both updated. 2. a different type, e.g., the code has been updated, while the comment has been deleted. Comment block or block of comments These expressions signify blocks of different comment types (not the same as a block comment). A comment block can be either a single Javadoc or block comment, or one or more line comments, combined to a block. In Listing 3.2 we can see one block, consisting of two line comments. 3.2 Numeric analysis of source code to source comment change couplings In a first step we wanted to see, on a simple numeric base, if we are able to find clues whether the adaptation of comments is related to changing source code. This is a first analysis of the relation between code and comment and independent from the in depth analysis presented in the next sections. The results of this analysis can be used as a complement to the specific results, but their computation does not directly rely on each other.

26 12 Chapter 3. Comments Adaptation Computing the number of changes For this analysis we implemented an Eclipse plugin. The plugin uses the database created by EVOLIZER 1 and CHANGEDISTILLER to get information on source code changes. For each revision of each class (i.e.,.java file) we get the number of changes to the preceding revision, i.e., number of insert, delete, and update operations. These numbers are compared to the numbers of inserts, deletes, and updates in the source comments. For implementation details for the computation of the number of comment changes see Section 4.1. As result we get for each revision of a class the total number of inserts, deletes, and updates for all source code nodes, as well as for all comment nodes Retrieving the results A simple comparison of the computed total numbers does not deliver a useful result. The numbers them-selfs, as well as percentages are not significant, due to the missing relation between them. If we, e.g., assume there are three inserts in the source code, but only one in the comments. So there are three time as much source code inserts as comment inserts. But what does that mean how can this be interpreted? It can mean that two source code inserts are uncommented, but it can as well be one source comment, commenting all three source code inserts. We have no possibility to base a decision on. We decided to do a comparison on a quantity basis. We ranked the three change types after the frequency of their appearance, e.g., assuming three source code inserts, one delete, and five updates, the ranking is updates (5) before inserts (3) and deletes (1). This is done for the number of source code changes as well as for the number of source comment changes. Having computed the ranking we can compared the order of the ranks. This allows us to make statements like e.g., in the source code, as well as in the comments there were more updates than inserts or deletes. This provides a hint whether there are common changes of source code and comments in a revision or not. This is because of the assumption that, having a high number of source code and comment changes of the same type, makes it more probable that there are common changes in a revision. See Table 3.1 for an example output. We used the class org.argouml.kernel.project from the ArgoUML Project 2 for the example Rating the computed rankings As we can see in the example in Table 3.1, each revision gets a rating of either 1, 0, or 1. The list after this paragraph explains to the different ratings. In each case the expected statistical probability is given on the last line, because in a random distribution not every rating is equally probable. The probability is calculated by dividing the recognized cases through all possible cases (see Figure 3.3 for more details). A further distinction allows a special case; three identical values are treated differently, based on their value. When the total number of changes (for source code or comments) is zero this is rated with a 1, else with a 1. This fact is not reflected in our probability measure, thus the expected probability for a 1 is rated slightly to high and the one for a 1 accordingly to low. To take account of this special case, we reduce the probability for a rating of 1 by 3% and increase the one for a 1 as rating by the same amount of percentage points. The value of 3% is based on a number of tests where we compared how often the two cases really appear. Normally three identical values only occur when no changes to the source (code or comment) were applied, i.e., in 97% of the occurrences there were no changes. 1 EVOLIZER last visited on May 4, last visited on May 2, 2007

3.2 Numeric analysis of source code to source comment change couplings 13 1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1 Matchings result in a rating of: -1 0 1 1 0 0 1 2 3 1 0 0 1 2 3 1 0 0 0

mappings Mapping between two rankings. A matching results in a rating as indicated above the graphs. 1 3 2 Ranking.

27 3.2 Numeric analysis of source code to source comment change couplings Matchings result in a rating of: = 91 possible mappings Mapping between two rankings. A matching results in a rating as indicated above the graphs Ranking. The leftmost number indicates the change type which occurred most, the rightmost the least frequent (1 = insert, 2 = delete, 3 = update and 0 = two or all three types were equally frequent. Figure 3.3: Overview over the rating of the ranking pairs.

28 14 Chapter 3. Comments Adaptation revision insert delete update insert delete update rating code code code comment comment comment Table 3.1: The table shows the number of source code and comment changes. The rating In the last column is 1 if there is a coupling between the number of source code and comment changes, 0 if there is no coupling and -1 if the numbers can not be rated. A 1 is rated only if there were no changes: neither in the source code, nor in the comments. Because there can not be any common changes, when there are no changes at all, such revisions are not rated. The expected probability is 13 out of 91 cases, thus or 14.29%. If we reduce this value (to accommodate for the zero not zero values distinction) by 3% this results in a reduction of 0.43 percent points and in a probability of respectively 13.86%. There are two conditions, rated with a The most frequent change type in the source code is not the same as the most frequent comment change type and neither the first place of the source code, nor the one of the comment is shared. Shared first place means that the first and second most frequent have the same number of changes. Revision in Table 3.1 is an example for such a decision. The most changes in the source code were deletes, but the most changes in the source comment were updates). 2. The most frequent change type of either source or comment is the least frequent change type of the other and they are not shared first (respectively) third places, i.e., having the same number of changes as the second most frequent. For an example see revision in Table 3.1. The most frequent comment change type are deletes, which is the least frequent source code change type. There are 36 of the 91 cases leading to a 0 as rating. Accordingly the probability is or 39.56% All not yet mentioned cases result in a rating of 1. This are all the cases where the most frequent change types match, as well as most cases where two change types with the same number of changes are involved. The expected probability is or 46.15% (42 out of the 91 cases). This value has to be increased by the 0.43 percent points, subtracted in the 1 case, resulting in a total of , or 46.58%.

29 3.3 Comment extraction and block building for line comments 15 As last part in this section, we present results we retrieve, when using our plugin to analyze our example class (i.e., ArgoUML s Project.java). We are not going into detail here, but only show Table 3.2. During the case studies (see Chapter 5), we give a detailed analyze and explanation of similar tables. Ranking does match... does not match... is not rateable total number of occurrences percentage of occurrences 32.75% 17.54% 49.71% % expected percentage 46.58% 39.56% 13.86% % occurrence / expectation Table 3.2: Results when analyzing the ArgoUML class Project. There are 172 revisions of the file (only 171 analyzed for revision 1.1, as the first revision, has no preceding revision to compare with). 3.3 Comment extraction and block building for line comments We use the ASTParser of Eclipse to extract the source comment nodes. The parser builds an Abstract Syntax Tree (AST) from the source code. Unfortunately this AST contains only the Javadoc comments. Line- and block comments are excluded because they are missing a distinct link to the source code. The parser cannot be sure where to put these comments in the tree. Nevertheless the comments are returned. The parser generates a List containing all the comments, even Javadoc, which is already contained in the AST. While it is possible to take Javadoc and block comments as they are, line comments have, where necessary, to be combined to a block of line comments. For more details see Section 4.2. Remember, we use the expressions comment block or block of comments (but not block comment) for a single Javadoc comment, a single block comment, and one or more line comments, combined to a block. 3.4 Mapping source code nodes to source comments To find the corresponding source code node for a certain comment block, whilst for humans in most cases a trivial task, is hard for a computer. Matching is a difficult task, as long as the computer only does syntactical matching, hence no semantic matching, i.e., matching formal things, but no abstract things, e.g., matching words, but not their meaning. This is probably one of the main reasons why in this field the comments are explicitly excluded in most of the work (see Chapter 2). First we show different possibilities for source comments and their mapping to source code nodes i.e., patterns to recognize. Then our matching algorithm is presented and at last we show the cases recognized, and the ones not recognized by our algorithm respectively Types of source comment to source code matchings All examples used in this section are taken from different revisions of ArgoUML s file Project.java (class org.argouml.kernel.project). One source comment to one source code node In the majority of cases one source comment block can be matched to one source code node.

30 16 Chapter 3. Comments Adaptation /** * True if we are in the proces of making a project, otherwise false */ private static boolean _creatingproject; Listing 3.3: One comment for one source code statement. Project.java Revision 1.44 /** * Moves some object to trash. This mechanisme must be rethought since * it only deletes an object completely from the project obj The object to be deleted org.argouml.kernel.project#trashinternal */ //////////////////////////////////////////////////////////////// // trash related methos //Attention: whole Trash mechanism should be rethought concerning nsuml public void movetotrash(object obj) {... } Listing 3.4: Three comment blocks describing one method. Project.java Revision 1.44 Normally the source code succeeds the comment. This case can be expected as default. See Listing 3.3 for an example. Multiple source comments to one source code node There are also cases with several source comments for one source code node. Here has to be decided, whether all the comments match the succeeding source code node, or if a part of them matches the preceding node. See Listing 3.4 for an example. One source comment to multiple source code nodes As there can be multiple comments for one source comment node, it is also possible to have one source comment for multiple source code nodes. This can be for example a comment, stating that constant declarations follow, as we can see in Listing 3.5. This is not trivial to recognize. The algorithm would have to understand the semantic of the source code to find such occurrences. Especially for not very detailed comments (like simply constants in the example). Source comment succeeding a source code node There can also be source comment succeeding the corresponding source code. This comment can be either on the same line as the source code (as, e.g., in Listing 3.6), or on a // constants public static final String SEPARATOR = "/"; public final static String FILE_EXT = ".argo"; public final static String TEMPLATES = "/org/argouml/templates/"; Listing 3.5: One line comment describing several source code nodes. Project.java Revision 1.1

31 3.4 Mapping source code nodes to source comments 17 class ResetStatsLater implements Runnable { public void run() { Project.resetStats(); } } /* end class ResetStatsLater */ Listing 3.6: Comment succeeding the source code, but still on the same line. Project.java Revision 1.1 while (iter.hasnext()) { currentmember = iter.next(); if (currentmember instanceof ProjectMemberTodoList) { /* No need to have several of these */ return; } } // got past the veto, add the member _members.addelement(pm); Listing 3.7: A line comment, linking between the preceding and the succeeding source code. Project.java Revision 1.97 succeeding line (e.g., see the Javadoc in Listing 3.7). This can be handled analogue to one comment, succeeded by one source code node occurrences. Source comment linking between two source code nodes As a last type of matchings there are source comments linking between two source comment nodes. In Listing 3.7 we can see an example of a line comment, first commenting what decision was taken in the preceding while loop and then stating what is going to happen in the succeeding method invocation. As already seen in the one comment to several source nodes mapping, this is undecidable without a semantic checking Preparing the data Some preliminary work has to be done before the actual matching and mapping. Besides the obvious extraction of the comments described in Section 3.3 we have to find the candidates as corresponding source code nodes for each source comment block. This task can be divided into two subtasks. 1. Generally we consider the comment as corresponding to the source code node preceding or succeeding it. We find these nodes by searching the source code nodes with the closest start positions to the comment block. 2. In a second step we have to check the possible candidates for their plausibility and adapt them if needed. More details on this can be found in Section 4.3.

32 18 Chapter 3. Comments Adaptation public boolean istrue(string thesis){ /* * magic box: * the return value indicates whether the thesis is true or not. */ return secretmagicthesistester(thesis); } Listing 3.8: The method wins against the conditional The source comment to source code mapping algorithm Here we present the algorithm we use to decide which of the candidates found in the last section is the corresponding source code node for a certain comment. The algorithm works with a rating system. There are conditions, granting points to the candidates. Points are granted for the following: 0.5 Point for the succeeding node. Half a point gets the source code node succeeding the source comment. The reason is the higher probability for a succeeding, than for a preceding source code node. Normally the comments are preceding the corresponding source code. We use only 0.5 point instead of 1 point, because this is only intended to make the decision when else there is a draw, but it is not intended to influence the result in other situations. Thus, if both candidates are equally probable the succeeding node is taken as corresponding node. 1 Point if on the same line. If the source comment is on the same line as a source code node, one point is granted. Comments, sharing a line with source code, normally are commenting that code. int iter = 0; //Iterator for while loop 1 Point if on the preceding, respectively succeeding, line Source code nodes on the line preceding, respectively succeeding, the source comment immediately are granted a point. By immediately we mean that there is not more than one line break between the nodes. This is, because comments normally are in direct proximity of their corresponding source code node, meanwhile between a source code node and a comment related to another source code node there is normally more free space. 1 Point for each word matching Each word appearing in the source comment as well as in the source code node grants a point, e.g., the listing after this paragraph grants two points, one for the int and one for the iter appearing in the source code node and in the source comment). int iter = 0; //The int iterator iter is used for the next loop After summing the points for both candidates, there can be three results. As explained, a draw is not possible, therefore two possible results remain. Either the preceding or the succeeding source code node has more points than the other. In these cases the node with more points has won and is saved as corresponding source code node for the current comment. After first tests, we recognized one important weakness of the algorithm. Large source code structures always win, thus are seen as corresponding node. This is because they often contain the

33 3.4 Mapping source code nodes to source comments 19 source comment itself. In Listing 3.8 there is an example comment with a preceding method node and a succeeding return statement. For the reader, understanding the semantics, it is obvious that the comment describes what is returned by the return statement. Thus the corresponding node is the return statement. However, the algorithm in its first version comes to a different result. In the analysis with the algorithm, the return statement gets 1.5 points through its position (0.5 point for being the succeeding node and 1 point for only one line break in between). 2 more points for matchings of return, and thesis. Comes to a total of 3.5 points for the return statement. The method gets 1 point for the proximity, as well as 2 points for the return statement it contains (as calculated before). The matching of the header and the comment itself grants 16 additional points (i.e., 1 from the thesis in the header and 15 from the comment itself 1 for each word plus 2 because the appears twice), resulting in a total of 19 points. So the preceding method node wins clearly with its 19 points against the 3.5 points of the return statement, which should win. To counter those situations, we decided to work only with the header for larger structures. In the example case, we only take public boolean istrue(string thesis) to calculate the points. Doing so the comment itself and the succeeding node (i.e., the return statement) are no longer contained in the preceding node. The result in the example is as follows: The succeeding node still gets his 3.5 points. For the return statement nothing changes. Other for the method. The method still gets 1 point for the proximity and 1 for the matching of thesis in the header. The methods total points are 2, which looses against the 3.5 from the return statement Shortcomings of the algorithm Our algorithm is able to recognize most of the source code to source comment matching types mentioned in Section However there are some types that are not recognizable. The ones the algorithm recognizes are: One source comment to one source code node. Multiple source comments to one source code node. Source comment succeeding a source code node. The algorithm does not recognize the following: One source comment to multiple source code nodes. Source comment linking between two source code nodes. The reason we decided not to implement a recognition for them is the high complexity of the problem, as well as the much higher computing power that is needed to get at least some results. To get good recognition results in such situations it is inevitable for an algorithm to understand the semantics of the given source code. Only then he can find reliable source comment to source code matchings in real life environments. 3 A number of random samples for such comment nodes, not recognizable by our algorithm showed: normally the comment contains no syntactic clues an algorithm can use to find out whether a comment describes one node or more nodes. Hence, the syntactical analysis (and analog also statistical) can only deliver few results, but needs lots of computing power for the large number of different possible solutions. 3 By real life environment we mean a source code of an arbitrary project. Source code which is not explicitly designed to be understood by an algorithm, but is designed for humans.

34 20 Chapter 3. Comments Adaptation /** * Returns the character found at the specified position in the * given String. * str The String to get the character from. index An index, specifying a position within the String. The character at the specified position. IndexOutOfBoundsException is thrown if the index is not * within the String. */ public char charat(string str, int i) throws IndexOutOfBoundsException{ return str.charat(i); } Listing 3.9: Javadoc example with different tags. In detail we have to check for each comment node (m) every possible combination of source code nodes. Which is for a number of n source code nodes: n 0 n n n 1 n + n = n n =2 n k So the total runtime used is m 2 n, thus exponential which is really slow. Even if we make a constraint, by only allowing connected blocks of code nodes the runtime stays in polynomial (i.e., cubic) order. By connected code blocks we mean, that no gaps of non corresponding nodes are allowed between corresponding ones. k=0 3.5 Tracking comment nodes over multiple revisions To find changes in the source comments and in the corresponding source code nodes (especially changes over several versions), it is necessary to track a comment over the whole chain of revisions. In the first subsection we discuss our source comment similarity measure used to find changes (and their type) between two versions of a comment. In the next subsection we present our tracking algorithm as well as the algorithm to recognize the changes (and change types), of a comment. Finally, we discuss the evaluation of the found comment chains to get meaningful results Comment similarity measure When looking for changes in comment nodes we can distinguish between Javadoc and line or block comments. In contrast to the structureless line and block comments, Javadoc has a syntax, thus it can be divided into different tags, related to certain parts of e.g., a method declaration as in Listing 3.9. We would like to add, that our current analyze uses only a part of the information available. We only analyze whether there has been a change and if, what type of change it has been. No details like e.g., which part of a comment node has been changed are evaluated until now.

BASIC COMPUTATION. public static void main(string [] args) Fundamentals of Computer Science I

BASIC COMPUTATION. public static void main(string [] args) Fundamentals of Computer Science I BASIC COMPUTATION x public static void main(string [] args) Fundamentals of Computer Science I Outline Using Eclipse Data Types Variables Primitive and Class Data Types Expressions Declaration Assignment