The University of Saskatchewan Department of Computer Science. Technical Report #

Size: px

Start display at page:

Download "The University of Saskatchewan Department of Computer Science. Technical Report #"

Ashley Potter
5 years ago
Views:

1 The University of Saskatchewan Department of Computer Science Technical Report #

2 The Road to Software Clone Management: ASurvey Minhaz F. Zibran Chanchal K. Roy {minhaz.zibran, University of Saskatchewan, Canada February 13, 2012

3 Contents 1 Introduction and Motivation 1 2 A Systematic Review on Clone Literature Threat to Validity Code Clone Clones beyond Source Code Clone Relationship Clone Granularity Which Level of Granularity Is Appropriate? Intentional and Accidental Clones Clone Detection Techniques Strengths and Weaknesses of Clone Detection Techniques Challenges in the Empirical Evaluation of Clone Detection Tools Clone Evolution Clone Genealogy Genealogy Extraction: Clone Change Patterns Need for Improvements Visualization of Clone Evolution Clone Management Clone Management Strategies Design Space for a Clone Management System Architectural Centrality Triggering of Clone Management Activity Scope of Clone Management Activity Clone Management Activities Integrated Clone Detection Clone Documentation Clone Tracking Incremental Clone Detection Clone Annotation Techniques for Reengineering/Refactoring of Clones Generics and Templates Design Level Approaches Design Patterns: Traits: Aspects: i

4 4.8.3 Synchronized Modification Consistent Renaming Refactoring Patterns Tool Support for Refactoring Patterns: Analysis and Identification of Clones for Refactoring Visualization of Distribution and Properties of Clones Analysis to Find Clone Based Reengineering Opportunity Clone Categorization Based On Reengineering Opportunity: Cost-benefit Analysis and Scheduling of Refactoring Verification of Clone Modification/Refactoring Industrial Adoption of Clone Management 52 6 Conclusion 53 Bibliography 54 ii

5 List of Figures 1 Yearly number of distinct authors contributing to clone research Categories of publications on software clone research in di erent years 5 3 Proportion of publications in each category over the period A clone genealogy with two lineages over versions v k through v k+3 [201] 18 5 Clone change patterns and types of genealogies [171] Clone management workflow Kapser and Godfrey [107] Taxonomy: clone categorization based on location and functionality iii

6 List of Tables 1 Categories of theses on software clone research Code Clone Detection Techniques (extended from [126]) Summary of techniques for clone genealogy extraction Summary of clone management support from integrated tools Summary of tool support for incremental clone detection Summary of clone visualization techniques (extended from Jiang et al. [94]) Balazinska et al. [16] Taxonomy: categories of function/method level clones based on (dis)similarity caused by di erent types of di erences 46 8 Koni-N Sapu [121] Taxonomy: applicability of di erent refactoring patterns on di erent categories of clones Schulze et al. [178] Taxonomy: applicability of object-oriented (first three) and aspect-oriented (last three) refactoring patterns to refactor categories of clones Torres [189] Taxonomy: clone categorization based on concept location Comparison of Code Clone Refactoring Schedulers iv

7 1 Introduction and Motivation Copying an existing code and pasting it in somewhere else followed by major or minor edits is a common practice that the developers adopt to increase productivity. Such a reuse mechanism typically results in duplicate or very similar code fragments residing in the code base. Those duplicate or near-duplicate code segments are commonly known as code clones. There are many reasons why the developers intentionally perform such code cloning. Obvious reasons include reuse of existing implementation without re-inventing the wheel. More comprehensive discussions on the reasons for code cloning can be found elsewhere [106, 161, 163]. Code clones may also appear in the code base without the awareness of the developers. Such unintentional/accidental clones may be introduced, for example, due to the use of certain design patterns, use of certain API s to accomplish similar programming tasks, code conventions imposed by the organization, and so on. The reuse mechanism by code cloning o ers some benefits. For instance, cloning of an existing code that is already known to be flawless, might save the developers from probable mistakes they might have made if they had to implement the same from the scratch. It also saves the time and e ort in devising the logic and typing the corresponding textual code. Code cloning may also help in decoupling classes or components and facilitate independent evolution of similar feature implementations. On the other end of the spectrum, code clones may also be detrimental in many cases. Obviously, redundant code may inflate the code base, and may increase resource requirements. This may be crucial for embedded systems and systems such as hand held devices, telecommunication switches, and small sensor systems. Moreover, cloning a code snippet that contains any unknown fault may result in propagation of that fault in all copies of the faulty fragment. From the maintenance perspective, a change in one code segment may necessitate consistent changes in all clones of that fragment. Any inconsistency may introduce bugs or vulnerabilities in the system. Fowler et al. [58] recognize code clones as a serious kind of code smell. However, during the software development process, duplication cannot be avoided at times. For example, duplication may be enforced by the limitation of the programming language in facilitating with necessary mechanism to implement an e cient generic solution of a problem at hand. Code generators may also generate duplicated code, that the developers do not want to modify. Previous research reports empirical evidences that a significant portion (generally 9%-17% [206]) of a typical software system consists of cloned code, and the proportion of code clones in the code base may be as low as 5% [163] and as high as even 50% [162]. Indeed, due to the negative impact of code clones in the maintenance e ort, one might want to remove code clones by active refactoring, wherever feasible. However, in reality, aggressive refactoring of code clones appears not to be a very good idea [39], and not all clones are really removable through refactoring. Due to the dual 1

8 role of code clones in the development and maintenance of software systems, as well as the pragmatic di culty in avoiding or removing those, researchers and practitioners have agreed that code clones should be detected and managed e ciently. Since the emergence of software clones as a research area, significant contributions over years made the field grow and become quite a matured area of research. Nonetheless, over the entire course of software clone research there have been notably three surveys. Koschke [125], in 2007, presented a brief summary of the important findings about di erent aspects of software clones including cause-e ect of cloning, clone avoidance, detection, and evolution. along with a set of open questions. In the same year, Roy and Cordy [163] also published another survey containing a thorough review on those same areas with specific focus on clone the detection tools and techniques. Recently, Pate et al. [159] published a systematic review on 30 studies on clone evolution only. In this paper, we present an extensive survey on code clone research with strong emphasis on clone management. This paper is organized as follows. In Section 2, we present a systematic review on a repository of 262 papers on software clone research published over 19 years. The review draws birds-eye view on the overall contributions and growth along di erent dimensions of software clone research. The remaining of this survey is the outcome of careful investigation of literature beyond the said repository, and through analysis in the light of our experience. Section 3 starts with the necessary background including the definition and types of clones. Section 3.3 describes the di erent granularities, at which code clones can be addressed. Section 3.4 distinguishes intentional and accidental clones. In Section 3.5, we describe the di erent clone detection techniques along with their strengths and weaknesses. Section 3.6 addresses clone evolution with emphasis on clone change patterns based on the genealogy based evolution model. Section 4 begins with the management of clones beyond detection. Section 4.1 characterizes di erent strategies for clone management. In Section 4.2, we briefly describe the design space for a clone management system. Section 4.3 starts discussion on the di erent clone management activities, including integrated clone detection (Section 4.4), clone documentation (Section 4.5), tracking (Section 4.6) and annotation (Section 4.7). Section 4.8 presents the techniques for clone removal or clone based reengineering. In Section 4.9, we describe the analyses for the identification of potential clones as candidates for refactoring/reengineering. The cost-benefit analysis and scheduling of clone refactoring is then presented in Section A discussion on the verification of clone refactoring is accommodated in Section Our understanding on the challenges for industrial adoption of clone management is presented in Section 5, and finally, Section 6 concludes the paper. 2

9 2 A Systematic Review on Clone Literature There has been more than a decade of research in the field of software clones. To understand the growth and trends in the di erent dimensions of clone research we carried out a quantitative review on the related publications. Robert Tiras at the University of Alabama at Birmingham has been maintaining a repository [181] of scholarly articles that make significant contributions in the area. Until today, the corpus consists of 264 scholarly articles published between 1994 and 2012 in di erent refereed venues including Ph.D., M.Sc., and Diploma theses. The repository organizes the publications by categorizing them based on their contributions in four major subareas of clone research. The categories are as follows: Analysis: This category contains publications that perform analysis on the various traits of software clones, their reasons, existence, e ects in software systems, as well as investigation of clone reengineering opportunities and implications. A majority of such publications report findings from qualitative or quantitative empirical studies. Detection: Publications in this category address techniques and tools for the detection of software clones. Tool Evaluation: This category comprises the publications that contribute to the quantitative or qualitative evaluation of the techniques and tools for clone detection. Management: Publications in this category address the issues, techniques and tools for the management of code clones beyond detection. Other than the publications that fall in the aforementioned four categories, the corpus also contains three publications categorized as Survey of Overall Research. After careful reading of those articles, we put them in the Analysis category. Moreover, the repository classified the published theses based on simply whether they were outcomes of Ph.D., M.Sc., and Diploma programs. We also went through them, and based on their research focus, we classified them (Table 1) into the above mentioned four categories. In our review, we include a total of 262 scholarly articles and theses published since 1994 until Since, it is now early 2012 and more contributions are expected to appear by the end of 2012, we deliberately excluded publications after 2011 for sanity. The repository also contains a list of tools, events, research groups, and links relevant to software clone research, which we excluded from the review. We collected the information by performing automated parsing (followed by manual verification) of the web (HTML) interface of the repository. 3

10 Ph.D. Thesis M.Sc. Thesis Diploma Thesis Table 1: Categories of theses on software clone research Title Author Year Category Representation, Analysis, and Refactoring Techniques R. Tairas 2010 Management to Support Code Clone Maintenance Scalable Detection of Similar Code: Techniques L. Jiang 2009 Detection and Applications Toward an Understanding of Software Code C. Kapser 2009 Management Cloning as a Development Practice Assessing the E ect of Source Code Characteristics A. Lozano 2009 Analysis on Changeability Detection and Analysis of Near-Miss Software C. Roy 2009 Detection Clones Code Clone Analysis Methods for E cient Software Y. Higo 2006 Analysis Maintenance E ective Clone Detection Without Language Barriers M. Rieger 2005 Detection Automated Duplicated-Code Detection and Procedure R. Komondoor 2003 Detection Extraction Improving Clone Detection for Models P. Pfaehler 2009 Detection Clone Detection Using Dependence Analysis and Y. Jia 2007 Detection Lexical Analysis Clone Detection Using Pictorial Similarity in Slice Y. Jafar 2007 Detection Traces Visualizing and Understanding Code Duplication Z. Jiang 2006 Analysis in Large Software Systems Incremental Clone Detection N. Göde 2008 Detection CPC: An Eclipse Framework for Automated Clone V. Weckerle 2008 Management Life Cycle Tracking and Update Anomaly Detection Semi Automatic Removal of Duplicated Code Y. Liu 2004 Management Automated Detection Of Code Duplication Clusters R. Wettel 2004 Detection A Scenario Based Approach for Refactoring Duplicated G. Koni-N Sapu 2001 Management Code in Object Oriented Systems 4

11 Figure 1: Yearly number of distinct authors contributing to clone research Analysis Detection Management Tool Evaluation Figure 2: Categories of publications on software clone research in di erent years

Figure 1 plots the number of distinct authors contributing to clone research in the years from 1994 through 2011.

12 Figure 1 plots the number of distinct authors contributing to clone research in the years from 1994 through As the figure indicates, the clone research community has experienced a significant growth over the recent years. In the Figure 2, we present the number of publications appeared every year making contributions to each of the four sub-areas of clone research. As seen in the figure, early work on software clone research were dominated by the research on clone detection with some work on analysis. In the recent years, the work on clone analysis and detection has grown significantly while clone management has emerged and growing as a significant research topic. Despite the fast growth of the clone research community, the work on clone management remained much less compared to analysis and detection, which can be more clearly perceived from the Figure 3. This, in combination with the realized importance of research in clone management, point to the further need and potential for research in this sub-area. Figure 3: Proportion of publications in each category over the period It can also be noticed from both the Figure 2 and Figure 3 that over the entire span ( ) of software clone research very few work focused on the evaluation of clone detection techniques or tools, although more than 40 di erent clone detection tools have been produced realizing a wide variety of techniques [168]. Indeed, the detection of clones is a fundamental topic for software clone research, and the e ectiveness of clone management largely depends of clone detection. Therefore, more work on the evaluation of clone detection tools is necessary to inform scopes for further improvements. However, the task can be challenging due to a number of reasons, which we discuss in Section

13 2.1 Threat to Validity The corpus of publications used in our systematic review contains a significant number of scholarly articles published over 19 years. Still, a few publications might have escaped the repository. For example, two M.Sc. theses [5, 171] that were completed at the end of 2011, are yet to appear in the corpus. However, to preserve the reproducibility our systematic review, we did not include such few missing publications. In fact, the corpus is a result of consistent e ort for exhaustive preservation of publications, which significantly contributed to the software clones research. Hence, we believe that those few missing articles, if included, would not cause confounding variation in the findings of our review. Indeed, the remaining of this survey is based on relevant literature beyond the corpus used in the systematic review. 3 Code Clone Though, duplicate or similar code fragments are roughly known to be code clones, the definition of clone has remained more or less vague over the last decade. The vagueness is reflected in the definition given by Ira Baxter, Clones are segments of code that are similar according to some definition of similarity (Ira Baxter, 2002) [125]. Such a definition is dependent on how similarity is defined, and also raises question on how much of code can be regarded as a code segment. Clone research over the past decade somewhat addressed these issues, and the following categorizing definitions of code clone currently have been widely acceptable [125, 163, 161]. Type-1 Clone: Identical code fragments except for variations in white-spaces and comments are Type-1 clones. Type-2 Clone: : Structurally/syntactically identical fragments except for variations in the names of identifiers, literals, types, layout and comments are called Type-2 clones. Type-3 Clone: Code fragments that exhibit similarity as of Type-2 clones and also allow further di erences such as additions, deletions or modifications statements are known as Type-3 clones. Type-4 Clone: Code fragments that exhibit identical functional behaviour but implemented through very di erent syntactic structure are known as Type-4 clones. Indeed, a number of similarity based other taxonomies were also proposed by Mayrand et al. [148], Balazinska et al. [16], Bellon et al. [28, 127], and Davey et al. [42], and Kontogiannis [122]. 7

14 As described, the first three types of clones are defined based on the similarity in the program text, while Type-4 clone is defined based on semantic similarity. Thus, Type-4 clones are also called semantic clones. One the other hand, Type-1 clones are also known as exact clones, Type-2 clones are also sometimes called renamed clones, whereas, the Type-2 and Type-3 clones jointly are called near-miss clones [163]. The terms parameterized clone or p-match clone are often used in the community to refer to a subset of Type-2 clones, where there must be a bijective mapping between the identifiers of the two Type-2 clones [125]. Thus, renaming (arbitrary or systematic) of identifiers are allowed in Type-2 clones, whereas for parameterized clones, those renaming must be consistent/systemic [163]. For Type-3 clones, the deletion of statement from one code segment can be considered as an addition of the statement in the other. The di erence in the additional or changed statements in the Type-3 clones are often called the gaps [193], and thus Type-3 clones are sometimes also referred to as gapped clones [125, 163, 193]. While the definition of Type-1 and Type-2 clones are quite precise, the definition of Type-3 clones still remains vague [125]. The definition does not precisely indicate how much di erences in terms of addition, modification, or deletion of statements are allowed in code segments to be regarded as Type-3 clones. The practitioners commonly consider code segments as Type-3 clones when the di erence in the statements remain below a (dis)similarity threshold [11, 140, 164, 205]. However, a consensus on an appropriate value for such a threshold is yet to be established [125]. 3.1 Clones beyond Source Code Ongoing research also attempts to deal with clones in software artifacts other than the source code [99], such as clones in higher level code structure [21, 22] or high level concepts [146], clones in the models of formal model based development [46, 47], in UML domain models [180], UML sequence diagrams [142], in the graph based Matlab/Simulink models [160], and duplication in requirement specification documents [99, 100]. However, this paper focuses on the management of clones in the source code only. 3.2 Clone Relationship A clone relationship exists between two code segments, which are clones to each other according to specification of similarity as prescribed by the definitions stated above. Such a clone relationship is reflexive (i.e., if code segment A is a clone of B, then B is also a clone of A). Moreover, for Type-1 and Type-2 clones, the transitive relationship also exists (i.e., if code segment A is a clone of B, and B is a clone of C, then A is also a clone of C). However, such a transitive property may not hold for Type-3 clones [27]. The aforementioned definitions also imply that a subset relationship exists among the 8

15 Type-1, Type-2, and Type-3 clones. Mathematically, Type-i Type-j, for i 2 {1, 2} and j = i + 1 [119]. Two code segments that are clones to each other (i.e., having clone relationship between them) are called a clone pair. A clone-class or clone-group is a set of code segments such that any two of them are clone pairs. The term clone is used in the software community in the following two ways. clone as a noun refers to a code fragment that is, according to the aforementioned definitions, similar enough to one of more other code segments. For instance, our objective is to find all clones in the code-base. clone as a verb indicates the act of producing a code segment (e.g., by copypasting) that is, according to the aforementioned definitions, similar enough to one of more other code segments. For example, I cloned the function to reuse it in my context. 3.3 Clone Granularity The definitions of the all four types of clones are based on the notion of code segment, but how much of contiguous code can be considered as a code fragment is not made specific. Thus, contiguous portion of code at di erent levels of granularity have been used in the literature. As concerned with source code, the most commonly used granularities are at the level of the entire source file, class definition, method body, code block, and statements, which yield the the notion of code clones of the following five types: File clone: When two files are found to have contained similar enough source code, they are called file clones. Class clone: Two classes of an object-oriented code can be considered as class clones if they have identical or near identical code. Function clone: Two functions are considered as clones when the bodies of the functions consists of code that are similar enough. Block clone: When two blocks of code (marked with opening and closing braces or indentation, or the like) are similar enough, they are called block clones. Arbitrary statements clone: When two groups of statements at arbitrary regions of the source file are found to be similar enough, they are also regarded as clones (CCFinder detects such clones). 9

16 3.3.1 Which Level of Granularity Is Appropriate? According to the definition of clone, a pair of very small portion portion of code such as, two identical identifiers, two similar statements, functions or blocks each having only one statement can also be valid candidates for clones. However, those tiny code segments cannot be real clones of pragmatic significance. Thus, the practitioners typically disregard those code segments that are smaller than a given threshold. Again, due to the di erences in the varying contexts and techniques for clone detection, there has been no consensus in the community on such a threshold, though minimum 20 to 30 tokens [118, 172] or three to five lines of code [167, 206, 203, 205] is a common threshold used in practice. We believe that the chosen granularity of the code segments should exhibit some characteristics so that the detected clones at that granularity actually becomes useful from the maintenance perspective. In this regard, we propose the following desired characteristics as inspired by our experience and the criteria proposed by Giesecke [62]. Coverage: The set of all code segments should cover maximal behavioural aspects of the software. Significance: Each code segment should possess implementation of a significant functionality. Intelligibility: Each code segment should constitute su cient amount of code such that a developer can understand its purpose with little e ort. Reusability: The code segments should feature a high probability for informal reuse. A source file typically contains a large bulk of code (a Java source file may contain multiple classes, interfaces, and so on). Though, the set of all source files cover all behavioural aspects of the software, informal reuse at file level is unlikely in a software system. Classes in object-oriented systems also satisfy the coverage criterion, but they are also unlikely to be informally reused, as more elegant concepts such as inheritance and delegation are there for their formal reuse. Methods or blocks (a method body itself forms a block) cover almost all behavioural aspects, and those that are missed (e.g., library/package inclusion, declaration and initializations) can be neglected [62]. Methods and blocks contain implementation of significant functionality that can be understood with less e ort than to do for an entire source file or class. Methods isolate a functionality and so often do the blocks, and thus they are likely to be informally reused, as there is no easy rigorous way to reuse methods from another context [62]. Code segments at the arbitrary statement level satisfy the coverage criterion, but not the other three criteria. Single statement or short sequence of statements typically 10

17 don t cover a unit of significant functionality. The meaning of a statement in general is likely to be incomprehensible without considering its context, and the context of the host block or function. Moreover, sequences of import/include statements, declaration/initialization statements, or the sequences of statements spanning code boundaries such as boundaries of two functions, classes, or blocks do not exhibit significant potential for reuse, rather those should be ignored in many cases. Thus, we suggest that code segments at the level of functions or blocks can be the most suitable granularity for dealing with code clones, specially for maintenance. Giesecke [62] also proposed in favour of function as the most adequate level of clone granularity. Indeed, block level granularity also covers functions, though it does not distinguish a block that constitutes a function body from that which does not. Driven by the same understanding, Higo et al. developed CCShaper [79, 76] to post-process the clone detection result from CCFinder to extract block level clones as the potential candidates for refactoring. 3.4 Intentional and Accidental Clones Code clones in a software system can appear in two ways. First, the programmer copies an existing code, pastes it in another place, and thus reuses the implementation with or without further modifications. The resulting code may still remain similar to the original and form a clone-pair. Such clones created by the programmer s deliberate copy-paste-modification activities are known as intentional clones or copy-pasted clones. There are a number of reasons why programmers create such intentional clones. An obvious reason is to reuse existing implementation without re-inventing the wheel. Indeed, similar code segments may also appear in the system without the programmer s intention, and often the developer can even be unaware of the creation of such clones. For example, due to the developer s mental model, frequently used idioms are reproduced from memory instead of deliberate copy-paste [161]. Even di erent developers may also produce very similar code while solving a similar problem, or due to the dictation of certain APIs they use, or using the same design pattern [203]. Such similar code snippets that are not produced by deliberate copy-paste operations are known as unintentional clones or accidental clones. Practically, there are many reasons behind the creation of intentional and accidental code clones in a software system. Further details on the root causes of cloning can be found elsewhere [109, 110, 125, 163]. 3.5 Clone Detection Techniques Over more than a decade of code clone research a number of techniques have been devised for the detection of code clones and many clone detection tools have been 11

18 developed. Prominent techniques can be categorized as shown in the Table 2. The categorization is based on the type of information used in the analysis/comparison and the type of approaches used for the analysis. The first technique (Tracking Clipboard Operations) in the table is an action-trace based technique and the rest others are code similarity based techniques. In this section, we provide a brief summary of di erent clone detection techniques, and point out the strengths and weakness of those techniques in general. More detailed descriptions of those techniques can be found in the corresponding papers and elsewhere [163]. Tracking Clipboard Operations: This technique of clone detection is based on the assumption that programmers copy-paste activities are the primary reason for the creation of code clones. So, the technique simply tracks clipboard activities in the editor (inside IDEs such as Eclipse) when a programmer copies a code segment and reuses by pasting it. The copied and the pasted code segments are recorded as clone-pairs. Metrics Comparison: Metrics based techniques are usually used to detect function clones. The techniques are based on the assumption that similar code fragments should yield very similar values for di erent software metrics (e.g., cyclomatic complexity, fan-in, fan-out). Typically, for the code segments a set of metrics are gathered into vectors. The di erences in the vectors are calculated, where the close vectors (e.g., measured by Euclidean distance) indicate that their corresponding code fragments are clones. Texual Comparison: Text based techniques compare program text, typically line by line, with or without normalizing the text by renaming the identifiers, filtering out the comments and di erences in the layout. Token Based Comparison: The entire program is transformed into a stream of tokens (i.e., individual units/words of meaning) through lexical analysis. Then the token stream is scanned to find similar token subsequences, and the original code portions corresponding to those subsequences are reported as clones. Syntax Comparison: Syntax comparison based techniques are developed on the fact that similar code segments should also have similar syntactic structure. Thus, the program is parsed to produce syntax tree, where the similar subtrees indicate that their corresponding code segments are clones. PDG Based Comparison: For a given program, a set of PDGs (Program Dependency Graphs) [57] are produced based on the data and control dependencies among the statements of the program. The code segments corresponding to the isomorphic subgraphs are identified and reported as clones. Comparison of Low Level Form of Code: Instead of analyzing and comparing textual source code, the techniques analyze the lower level code (e.g., assembly code, Java bytecode) as obtained from the transformation done by the compiler. Other Techniques: Beside the aforementioned prominent techniques for clone 12

19 Table 2: Code Clone Detection Techniques (extended from [126]) Tracking Clipboard Operations copy-pasted code segments are recorded as clones [37, 45, 83, 197] Metrics Comparison comparing metrics for functions [123, 124, 132, 148] comparing metrics for web sites [48, 133] Textual Comparison hashing of strings per line, then textual comparison [95, 96] hashing of strings per line, then visual comparison using dotplots [52] latent semantic indexing for identifiers and comments [146] syntactic pretty-printing and normalization, then textual [164] comparison between lines syntactic pretty-printing and normalization, then comparison [205] of fingerprinted lines using su xtree syntactic pretty-printing and normalization, then hash based [190] comparison of functions/blocks Token-based (Lexical) Comparison su x trees for tokens per line [10, 11, 12] token normalizations, then su x tree/array based search [23, 105, 112] data mining for frequent token sequences [140] Syntax Comparison hashing of syntax trees and tree comparison [25, 198, 8] data mining for frequent syntax subtrees [194] serialization of syntax trees and su x-tree detection [55, 127, 138] derivation of syntax patterns and pattern matching [54] metrics for syntax trees and metric vector comparison [91, 154] PDG Based Comparison approximate search for similar subgraphs in PDGs [61, 78, 81, 120, 128] Comparison of Low Level Form of Code comparing Java bytecode [13, 175] comparison of compiled (assembler) code [43, 44, 170] 13

20 detection, other techniques, such as anti-unification [35, 138], formal methods [175], and combination of distinct techniques [60, 137] were also approached. Tracing of abstract memory states during the execution of the program was also attempted to detect semantic clones [116] Strengths and Weaknesses of Clone Detection Techniques Due to the orthogonal nature of the di erent approaches for clone detection, it is inappropriate rate one technique over another. Each of the techniques have their strengths and weaknesses. In this section, we briefly discuss the strengths and weaknesses of di erent clone detection techniques. More comprehensive qualitative evaluation of the existing clone detection techniques and tools can be found elsewhere [165, 168]. Clone detection using clipboard operations (copy-paste) has the benefit that clones are captured at the time of their creation. However, the technique poses a number of questions and limitations towards pragmatic design decisions to be made for realizing the technique in a tool: How much (granularity) of copy-pasted code should be recorded as clones? Should the copy-pasted code that spans beyond a syntactic block be considered as clone? Copy-pasted code may later be modified and become very di erent from original. Should the very dissimilar code segments still be treated as clones (as done in CloneScape [37]), or some similarity based decision has to be made? Since the tracking of copy-paste activity is coupled with a certain editor, the technique is vulnerable to changes in the source code outside the editor. The technique cannot deal with unintentional/accidental clones, or clones in legacy systems. The e ectiveness of metrics based techniques depends on the selection of a broad set of orthogonal significant metrics. Typically, metrics based techniques may be computationally expensive for the overhead of calculating the chosen metrics. Moreover, due to the fact that di erent code segments may, at times, have the same value for certain metrics, the metric based techniques may report many false positives. The advantage of simple textual comparison is that the technique, in general, is independent of programming languages and the program does not need to be complete of syntactically correct. However, the technique (without advanced normalization and filtering of di erences in layouts) may susceptible to even minor changes in the source code [126]. Due to the use of lexical analysis only, the token based techniques also can operate on incomplete or syntactically incorrect program. But, the token based 14

21 techniques may detect code clones that span beyond the boundaries of syntactic blocks. The techniques based on syntax comparison su er from the overhead of invoking a parser to generate parse tree. Due to the use of a parser, the technique is coupled with programming language syntax. The parser based syntax comparison techniques are very accurate in detecting clones that have similar syntactic structure. Still, such techniques are vulnerable to even subtle di erences in the nesting of code, so may fail to detect many potential clones. Earlier empirical evaluations [28, 27, 9] also suggest low recall of syntax comparison based clone detection techniques. PDG based techniques analyze the source code at a level higher than that of syntax comparison based techniques, and thus can capture the program semantics to some extent. PDG based techniques can detect clones consisting of non-contiguous code. However, producing PDGs is computationally expensive. In addition, finding isomorphic subgraphs from PDGs is an NP-hard problem [126]. Therefore, approximation algorithms are used, which do not aim for the optimum. The PDG based approach is tolerant to reordering of certain program elements, which is both a strength and weakness of the technique. The ordering of statements specify the execution path of a program written in traditional imperative language, but the PDG based techniques often disregard such order and reports clones composed of non-contiguous code segments, there remains quite good chance for false positives in detecting clones of practical significance. Clone detection in the compiled code (e.g., assembly code, binary executable, Java bytecode) requires the source code to be complete and syntactically correct. Moreover, compilation generates the lower level code that often normalizes syntactic variants of source code into a compact canonical representation free from comments and di erences in the layouts as in the source code. This, in principle, should make it easier to capture semantically equivalent constructs that might look di erent in the source code written in a high level language [43]. On the contrary, small di erence in the source code may produce very di erent sequences of bytes in the compiled code, and thus finding code clones based on byte code similarities may be even more di cult than finding clones in source code [13]. Moreover, clone detection based on the analysis of compiled code is typically performed at class level, because distinguishing the portion corresponding the functions, blocks or other level of granularity and mapping those back to locations in the higher level source code can be very challenging. The purpose of clone detection techniques is not limited to only clone management for improved software maintenance. The variable strengths of di erent clone detection techniques lead to their applications in solving diverse research problems such as, plagiarism detection [31, 74], program fault localization [26, 36, 89, 92, 103], malware detection [34, 196], finding crosscutting concern code [33], aspect mining [113, 32], 15

22 quality assessment of requirements specification [101], and more. Indeed, the choice of an appropriate clone detection technique largely depends on the context and the purpose [126] Challenges in the Empirical Evaluation of Clone Detection Tools Typically, the e ectiveness of clone detection is measured in terms of precision (p) and recall (r), p = C s \ C d, r = C s \ C d C d C s where C d denotes the set of clones that a particular clone detection tool identifies, and C s refers to the set of clones actually existing in the code base. But, how can one accurately find all the clones in a system? A reference corpus is necessary to serve as the set of all clones in a system. To obtain this, manual investigation can be a solution for small systems, but the time and e ort needed for such a subjective analysis is likely to be impractical for a fairly large system. However, the reference corpus needs to be large to be able to evaluate the scalability of the clone detector under consideration. Therefore, in practice [190, 205], one or more clone detectors that are already known to be e ective, are applied to detect clones from a system, and the union of all the sets of clones reported by those tools, is used as a reference corpus to examine the precision and recall of the clone detection tool to be evaluated. Using a similar idea, Bellon et al. [28] carried out a case study to compare the e ectiveness of six clone detectors (CCFinder, CloneDr, Dup, Duploc, Duplix and CLAN) representing di erent clone detection techniques. Bellon s study [28], which was performed about six years ago, is still the most comprehensive quantitative study on the comparative evaluation of clone detectors, and the reference benchmark produced from the study is the only notable benchmark data available to date [126]. A similar but smaller scale study was also conducted earlier by Bailey and Burd [9]. Bellon s benchmark data was constructed as the union of the sets of clones reported by the clone detectors subject to the study. Then the precision and recall for each tool was computed based on how many of the benchmark clones a particular tool detected. Thus, the result was skewed in favour of those tools that contributed more in the construction of the benchmark data. Moreover, only 2% clones of the reference set were manually verified. Hence, there is a possibility that the benchmark data might have included significant amount of false positives that could have been reported by those participating tools. The Need for Reference Benchmark: The above discussion suggests that, a reliable large set of benchmark data 1 is still necessary, and new tools should be evalu- 1 Similar to what available at used in comparative studies of algorithms for numerical linear algebra. 16

23 ated with the standard benchmark data before publication [125]. Both precision and recall should be reported, as precision or recall alone draws a partial picture only. Moreover, both the runtime complexity and memory e ciency should be reported, since these two criteria a ect the scalability of a certain technique. Theoretical runtime complexity only can be misleading, because high memory requirement may cause time-consuming page swapping at the operating system level, and increase the execution time in practice [169]. The creation of a reliable large benchmark data set with manually verified clones can be a mammoth task. The tedious work of manual investigation and verification can be accomplished by collaborative e orts from the community. The mutation framework [166] proposed by Roy and Cordy can also be used for the purpose by injecting a set of artificially created (and manually verified) clone fragments in various locations of a certain code base, and the set of synthetic clones can serve the purpose of a reference benchmark. Indeed, the vagueness in the definitions of clones may lead to disagreement in di erent human evaluators perception of clones, as found in the study of Walenstein et al. [195]. Therefore, the purpose of the benchmark data should be defined first, and the definition of similarity should be formalized accordingly, as well as the level of clone granularity. A purpose could be, for example, the evaluation of clone detection techniques for clone maintenance and removal. 3.6 Clone Evolution Software development and maintenance in practice follow a dynamic process. With the growth of the program source, code clones also experience evolution from version to version. Many studies have been conducted to date for understanding the overall evolution [4, 3, 143, 65, 206], stability of cloned code [14, 66, 82, 130, 131, 151], the relation of clone evolution with software faults [19, 69, 179], and other characteristics of clone evolution. While the high studies inform the characteristic and impact of code cloning, more lower level analyses that investigate the change patterns in the evolution of individual clone fragments can suggest techniques for optimizing clone management including refactoring and removal. However, there is a recent survey on clone evolution [159]. Hence, we keep this section brief with specific focus on the evolution of individual clone fragments and their change patterns Clone Genealogy Kim et al. [117, 118] first coined the term clone genealogy, which refers to a set of one of more lineage(s) originating from the same clone-group. A clone lineage is a sequence of clone-groups evolving over a series of versions (revision or release) of the software system. Thus, a clone genealogy (Figure 4) describes the evolution history of a particular clone-group over subsequent versions of the system. The extraction 17

of clone genealogies from a series of versions of a program has been identified as the fundamental task to study the evolution of individual clone-groups.

24 of clone genealogies from a series of versions of a program has been identified as the fundamental task to study the evolution of individual clone-groups. Figure 4: A clone genealogy with two lineages over versions v k through v k+3 [201] Genealogy Extraction: To map clones across subsequent versions of a program (i.e. extraction of clone genealogies) mainly four di erent approaches have been found in the literature, as summarized in Table 3. While most of these approaches [19, 14, 118, 172, 7] focused on genealogies of Type-1 and Type-2 clones, gcad [171, 173] is the only Type-3 clone genealogy extractor to date released as a separate tool Clone Change Patterns Recent genealogy based studies [118, 172, 173] on clone change patterns characterized the evolution of a particular clone-group based on the following four categories of transitions between subsequent versions of the software system. Addition/Grow: One or more clone fragments is added to the clone-group in the next version. Deletion/Shrink: One or more clones disappeared from the clone-group in the next version. Consistent Change: All clone in the clone-group experience the same set of changes during transition to the next version. 18

25 Table 3: Summary of techniques for clone genealogy extraction Approach Strength Weakness Citation Separate clone detection from each version, and then similarity based heuristic mapping of clones in pairs of subsequent versions Clones detected from the first version are mapped to consecutive versions based on change logs obtained from source code repositories Clones are mapped during the incremental clone detection that used source code changes between revisions Separate clone detection from each version, functions are mapped across subsequent versions, then clones are mapped based on the mapped functions flexible quadratic runtime [70], susceptible to large change in clone faster than the above technique faster than the above two techniques improved runtime can miss the clones introduced after the first version [171] cannot operate on the clone detection results obtained from traditional non-incremental tools [171] susceptible to similar overloaded/overriden functions [14, 118, 172] [7, 19, 129, 187] [68, 155] [145, 171, 173] *A combination of the first and second approaches was also used in some studies [29] Inconsistent Change: During transition to the next version, at least one clone in the clone-group experiences changes that are inconsistent with changes in other member(s) of the clone-group. Same/No Change The clone-group moves to the next version without experiencing any change in the members. On the basis of the aforementioned categorization, clone genealogies are classified as follows: Static Genealogy: A genealogy where the clones in the clone-group do not experience any change over the series of subsequent version. Inconsistently Changed Genealogies: A genealogy where one or more clones in the clone-group experience inconsistent changes during transition between any two subsequent versions. Consistently Changed Genealogies: A genealogy where none of the clones in the clone-group experience inconsistent changes during transition to subsequent versions. Dead Genealogy: A genealogy that does not survive till the last version (i.e. before the last version the clone-group disappears). 19

Figure 5: Clone change patterns and types of genealogies [171] Alive Genealogy: A genealogy that survives at the last version of the software system. Aversano et al.

26 Figure 5: Clone change patterns and types of genealogies [171] Alive Genealogy: A genealogy that survives at the last version of the software system. Aversano et al. [7] further caracterized inconsistent change pattern into two categories: Independent Evolution: In independent evolution, the clones of a clone-group, once changed inconsistently, evolve independently across versions. Thus, independent evolution may cause branching with multiple lineages in a genealogy. Late Propagation: In late propagation, one or more clones in a clone-group may experience inconsistent changes and may disappear from the clone-group in the next version, but at a later version those clones re-appear in the clone-group. Studies [14, 29] on clone change patterns revealed that inconsistent changes in clones sometimes caused program faults. Moreover, late propagation is reported to have even more significant correlation with software defects and thus concluded to be more risky than other clone genealogies [19]. Using the clone genealogy model, recently Saha [171] studied the change patterns in the evolution of clones in five open-source software systems. Major findings from the study report some common properties of frequently changed clone-groups, such as, cohesive clone-groups having small number of clones experienced more changes than 20

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25 Clone Detection CCFinder Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic) Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002 Recap of Polymetric