Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools

Size: px

Start display at page:

Download "Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools"

Elmer Stevens
6 years ago
Views:

1 , pp Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools Harpreet Kaur 1 * (Assistant Professor) and Raman Maini (Professor) 2 Computer Engineering Department Punjabi University, Patiala Abstract Since Code Cloning is the recent area of research in software engineering, it is crucial to have good understanding of all the code-clone-detection techniques. Clones in software development increases maintenance cost and it leads to poor software quality. This paper is basically combination of two issues: literature review of code clone detection techniques and experimental work for the evaluation of chosen techniques from literature. This paper firstly list out the various studies and then evaluates the performance of three chosen techniques (Text-based, Token-based and Tree-based) by means of automated tools. Netbeans-Javadoc, JBoss and Java-Quizz source codes has been examined to validate results. From the analysis it has been observe red that token based approach reports more false positives as compared to other techniques. Text based and token based approaches have precision values greater than tree based approach, but tree based approach has higher recall values. Token based, Tree based and metric based approaches are useful in combination with refactoring tools. It has been observed that in terms of speed, text-based approach is suitable to small size projects, but token based technique is scalable to large size projects also. Tree-based and token based techniques work effectively to detect near-miss clones and give more safe and sound result. DuDe, ccfinder, solid-sdd and clonedr tools have been used for validation. From the experimental work it has been observed that Dude tool is suitable for small projects, but ccfinder is scalable from small to large projects. False positives are reported by ccfinder because of its token based approach, but clonedr leads to minimum false positives as compared to ccfinder. The aim of the paper is to find the strengths and weaknesses of these techniques which will be helpful to select a clone detection technique for a particular purpose. Keywords: Code Clone, Software Maintenance, Code fragment, Clone-class 1. Introduction Code cloning is well known problem in software engineering and leading to poor software quality projects. The reasons to copy code fragments are: 1) making a copy of code is simple and fast rather than writing from scratch 2) Producing more source code leads to better incentives for programmers in industry [6]. Techniques and tools for detecting duplicate code are of main concern in software maintenance research. Some of the definitions related to code-cloning are discussed below: Code Fragment: A code fragment (CF) is any sequence of code lines (with or without comments). It can be of any granularity, e.g., function definition, begin-end block, or sequence of statements [9]. Code Clone: A code clone is a code portion in source files, similar or identical to another code portion [7]. A code portion (CP1) is a clone of another code portion (CP2), if they are similar to each other by some relation, iff, f(cp1) = f(cp2) where f is any similarity ISSN: IJSEIA Copyright c 2017 SERSC

2 function[9]. Clone Pair: A pair of code portions/fragments is called a clone pair, if there exists a clone relation between them. Types of Clones Type-1: Identical code fragments except for variations in whitespace, layout and comments. In Type I clone, different code fragments are exact copy of each other that is why, Type I is widely known as Exact-clones, only in variation in white spaces and comments. Type-2: Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments. A Type II clone means when two code fragments are similar to each other except for some variation in the names of identifiers declared (name of variables, constants, class, methods and so on), types, layout and comments. Type-3: Copied fragments with more modifications such as with some added statements, with removal of some statements or some modified statements, in addition to variations in identifiers, literals, types, whitespace, layout and comments are known as Type-III clones. Type-4: Two or more code fragments that perform the same computation but are implemented by different syntactic variants. Two or more code fragments which are semantically similar to each other results in TYPE-IV clones. In this type of clones, it is not mandatory that code fragment should be copied from somewhere. Two code fragments under Type-IV clone may be developed by different programmers to implement same functionality. 2. Literature Surveys of Clone Detection Techniques Clone detection attempts to find out the duplicate code within whole software, which may be exactly-copied or modified somewhere. Several techniques are available to detect duplicate code. A) Token-Based Clone detection technique: Kamiya et al. [8] described the process of token-based technique is shown in Figure 1. The process consists of four steps: 1. Lexical Analysis: Each line of source files is divided into tokens according to lexical rules of respective programming language. The tokens generated from all source files are concatenated to form one single sequence of tokens. It will be easy then to perform analysis of this single token sequence. White spaces, comments and tabs are removed from source code in preprocessing. 2. Transformation. Identifiers are then replaced with customized tokens by the use of transformation rules. And this replaced information is kept at back up for future formatting into original text. 3. Match Detection. Then on transformed token sequence, token sequence of lines are then compared efficiently using similarity detection (token suffix-tree) algorithm. Then the similar lines of sequences are reported as clone pairs. 4. Formatting. Each location of clone pair is converted into line numbers on the original 32 Copyright c 2017 SERSC

3 source files. A clone detection tool Dup [22] uses a sequence of lines for the representation of source code and it detects clones line-by-line. It performs: 1) Identifiers of source code are replaced into a special identifier 2) extraction of matches by a suffix-tree algorithm [10] of O(n) time complexity (n is the number of lines in the input). The line-by-line method has a weakness in the line-structure modification. Ref: [23] Token-suffix trees scales very well in time and space, because of its linear complexity. Studies ([9], [11]) have shown that token based clone detection approaches suffer from many false positives, but this technique have high recall value with low precision. Ueda et al. [21] developed Gemini which is maintenance support environment used for visualization of clones on the output of CcFinder. Gemini specify GUI (scatter plot and metrics graph about code clones). This is basically used for the visualization of detected code clones. The scatter plot graphically demonstrates the areas of code clones among source codes. The measurements diagram indicates metric estimation of every clone. Utilizing Gemini, we can indicate the code clones that ought to be paid heed in the maintenance stage. B) Text Based Clone detection technique: In this approach, entire source code is assumed as sequence of strings. One line is compared with another line by applying string matching algorithms, and similar strings are reported as clones. Raw source code is used for detection, because this method is purely textual, no transformation to source code is performed. It spaces and comments. However, it may be needed some time to remove white spaces and comments etc. Ducasse et al. [6] proposed an approach in which, source code is transformed into internal format by removing comments and white spaces, Secondly, comparison algorithms, then performed on the internal data. This will be called as effective file on which comparison is to be performed. In this approach, one line of source code is taken as code fragment. As an example, the C line if( code & pcobjtype )f /* print type */ is condensed to: if(code&pcobjtype)f (by removing spaces) In this algorithm, comparison is performed of every source line with every other source line. The comparison is done using string matching techniques. If a string matches exactly, a Boolean true value is returned, otherwise Boolean false value is returned. This value is stored in a matrix, taking the coordinates that the two compared entities have in their respective ordered collections as the matrix coordinates for the comparison result. The result is represented in the form of dot-plot. S. Lee et al. [25] developed algorithm SDD (Similar Data Detection), this algorithm finds exact clones and same parts of software. SDD has controlled complexity using Inverted Index and an Index. Authors revealed that SDD shows better results than PMD. SDD also detects modified clones by using N-neighbor distance concept. Moreover, SDD is language independent. Copyright c 2017 SERSC 33

4 J.R. Cordy et al.[26] discussed light weight text based approach to detect near-miss clones. Basically they applied Pretty printing and Code normalization technique to find code clones. Code lines are broken into parts and clones are extracted by comparing the broken text and by applying code normalization. Basically UPI (Unique percentage of Items) is calculated and on the basis of that unique lines gapes are detected. Whole technique is implemented in a tool NICAD, which is parser based and language specific but reasonably light weight using simple line matching. case studies covered are Abyss [2] of 1500 lines and Weltab [3] of lines. These two are taken as test beds because results are already published for these. C) Metric Based Clone detection Technique: In this, different software metrics of code are gathered and on the basis of similar values of these metrics, clones are detected. At first an arrangement of programming measurements are ascertained for syntactic units, such as function, class, and even for a statement, then estimations of these measurements were thought about. If two syntactic units exhibit same metric value, these can be regarded as clone-pair. Mayrand et al. [4] used various metrics to detect clones. Functions with similar metric values are returned as clone-pairs. Metrics are calculated from names, layouts, expressions and control flow of functions. D) Abstract-syntax-trees (AST) Based: In this approach, Abstract syntax tree of a program is produced utilizing a parser of a dialect. Then tree matching technique is applied on that AST generated to detect similar sub trees. When a match is found between two subtrees, Then source code of similar sub trees is returned as clone-pair. Baxter et al., [5] uses a hash function to partition sub trees of the abstract syntax tree of a program. Then sub trees in the same partition are compared using tree matching technique. A comparable strategy was additionally proposed by Yang [2] utilizing dynamic programming to distinguish contrasts among different adaptations of same file.s Jiang et al., [24] Approach presented by Jiang et al. has been implemented through tool Deckard. This tool is platform independent. Character-stick vectors of AST are calculated in a Euclidean space and then those vectors are merged to compute similarity among subtrees. LSH (Local Sensitive Hashing) has been used to cluster similar vectors that can hash two similar vectors to the same hash value with arbitrary high probability and two distant vectors with arbitrary low probability and hence find clones. Case studies covered for evaluation of Deckard are JDK and Linux kernel as shown in Table1. Table 1. Case Studies under Deckard Tool Case Study #files and Loc covered Number of Files not Parameter s for Parsed Compariso n 8532 java files and only 2 files not Deckard JDK ,418,767 parsed for Characterization Loc JDK Vector Linux Kernel 7,988 c files and 5,287,090 Loc clonedr CP-Miner clonedr fails to work on whole JDK at once 9 Group of 1000 files in each group has been made Evaluated for Linux kernel 81 files not parsed for JDK Similarity Metri Distance c gap is used to find Clones 34 Copyright c 2017 SERSC

5 E) Program dependency graph (PDG) Based: Program Dependency Graph is used to show control flow and data flow dependencies of a program. The isomorphic sub graphs in a program dependency graph are named as clone-pairs. PDG is the method which can detect TYPE-3, 4 clones, because semantic information is carried out in PDG. Lieu, [10] has implemented a plagiarism detection algorithm and Gplag tool is implemented, which is based on PDG approach. Related Comparison Studies: Rysselberghe et al., [13] compared three techniques: simple line based matching, parameterized matching, and metric fingerprints. Research process used during experiment is based on Goal-Question-Metric worldview like what sorts of matches are found?, How accurate are the results and how useful information is gained? Etc. conclusions has been drawn that simple line matching is purely language independent on the other hand all other techniques need some kind of configuration. Function block duplication is found by metric fingerprint technique and general duplication is found by other techniques. No false matches are found by simple line matching, few false matches are found by parameterized technique and even more false matches are found by metric fingerprint (characterization of expressions which lacks accuracy is responsible for this problem). False matches therefore for metric fingerprint thus depends on the way expressions are characterized and the length of the code fragments under processing, While number of recognizable matches are high for this technique. F. Zibran et al., [27] discussed a focused approach of a selected code segment which is known as seed segment instead of the detection of all clones from the entire code-base. The limitations of available techniques are to find out type-3 clones and mainly implemented as stand-alone tools which uncovers the area of clone-aware development. Seed fragment is compared with the search space (whole source code) to find out type-3 clones in that. Mainly fingerprinting technique is used to generate finger prints for the unique lines and then syntax tree is generated for the whole fingerprint sequence. Suffix tree is generated for the generalized fingerprint sequence using Ukkonen's online algorithm. Eclipse's JDT API's is used to generate ASTs (Abstract Syntax Trees). Fabio Calefato et al., [28] described a semi automated approach to find clones in scripting code of web applications. The approach is useful to select function clones and to inspect selected script functions. Semi automated approach is both effective and efficient at identifying function clones in web applications. Muhammad Asaduzzaman [29] addressed that in spite of number of clone detection tools are available, even then there is one challenge of handling raw clone data, because of textual nature and large in volume. To address this issue, a framework VisCad is proposed for performing large scale code clone analysis. It also acts as a maintenance support environment. In VisCad: various visualization techniques, number of metrics and data filtering options are available, therefore users can analyze and identify distinctive code clones. 3. Research Methodology The methodology to carry out survey is discussed here. Step1: Information is collected from primary research as well as from empirical observations. Research papers covering different techniques and search criteria have been studied. Step2: A set of tools is chosen, which are either developed or used to implement and test different code clone detection techniques. Step3: Different Questions to cover issues related to detection techniques. Copyright c 2017 SERSC 35

6 Step4: Mostly used case studies in research papers are also mentioned which will be helpful in analyzing clone detection techniques and may serve as benchmarks. The information flow for gathering research data is shown in following Figure 1: Figure 1. Flow of Gathering Information Based on literature survey, various tools and techniques to detect clones have been summarized in Table 2 and Table 3 below: Table 2. Overview of Clone Detection Techniques Text Based Token Based Tree Based Metric Based Dependent on No need of Parsing is coarse-grained Layout parsing performed which abstractions for a Very less Independent of makes this piece of code chances of False layout technique Positives Adaptable to new complex actually useful Line by line languages [24] High precision for evaluation on method does not High recall Rate of false Thebasis of detect line break Chances of many positives is Low Functions not Compatibility with Refactoring Techniques: false positives as compared to Individual Find clones other techniques Statements sometimes which Find syntactic Are not syntactic clones Helpful in Helpful in Refactoring refactoring Token-Based Approaches works on parameterized matching and is robust against rename operations. In this manner it works best in blend with fine-grained refactoring tools that work on the level of articulations (i.e., Extract Method and so forth.) Metric fingerprints (agent for the parse-tree based procedures) are great at uncovering copied subroutines, independent of little contrasts, subsequently work best in mix with refactoring tools that work on method level (i.e., Remove Method and Pull up technique). [13] Based on study of code clone detection techniques from literature review, it has been observed that Detecting duplicate code manually is impossible for huge software. Following points are of some concern: 36 Copyright c 2017 SERSC

7 1. Because of availability of various techniques to find out clones, pushes us to think on the point that Which technique is to be followed for clone detection? This particular point is our interest to perform experiment using different techniques for duplication-detection in source-code. 2. Each technique detects different number of clones in the same software. Some code-clones can be missed by one technique and can be detected by any-other technique. It might be possible that the detected clones are not of good concern. 3. Which technique is best suitable to improve a system design with minimal effort? Table 3. Comparison of Tools Technique Internal Tool Availability Clone Clone Types Matching Algorithm Representation (Free/Paid) Relation of Source Code Text Lines Dude[30] Free CP Type-1 Based Duploc[31] CP Matching Algorithm: SDD[25] Free Type-1 Full Free CP Type-1, Type- Suffix tree algorithm NICAD[26] 2 and Type-3 (Baker and CCFinder) Token Tokens Dup[22] Free CP Type-1 and Based Type-2 DPM (Dynamic ccfinder[8] Free CP Type-1 and Pattern Matching) (Duploc Type-2 and CP-Miner[32] Free Kontogiannis) solidsdd[36] Free/paid (both) CP Type-1, Type- 2 and Type-3 Hash value comparison (Baxter, Clone CP Marrand s, and SMC) detective[33] Tree Nodes in AST clonedr[5] Free (evaluation CP Type-1, Type- Character-stick Based (Abstract Version)/ Full 2 and Type-3 vectors Of AST are Syntax Tree) Version (Paid) Calculated in a cpdetector[11] Trial Version CP Type-1 and Euclidean space Available Type-2 (Deckard) Deckard[24] Trial Version CP Type-1, Type- Available 2 and Type-3 use of an Inverted Index and an Index(SDD) Metric Functions and Mayrand et al. Function Based Methods [1] blocks or Methods Mostly Used Case Studies Type-1, Type-2 and Type-3 ScoreMaster, TextEdit [20], Brahms [1], JMocha, JavaParser of JMetric [10]., ANTLR (Version 2.7.1) [1], NetBeansJavadoc, JBoss Concerning all above issues, Three techniques are chosen for comparison (text-based, token-based and tree based) by means of three case studies: Netbeans-javadoc, Java-Quizz and Jboss SP1-src (all in java) using the automated tools mentioned in the Table 4 and Table 5 respectively: Table 4. Case Studies Case Language Total number of files processed Netbeans Javadoc Java 21 Java Quizz Java 10 Jboss SP1-src Java 4951 Copyright c 2017 SERSC 37

8 Table 5. Tools and Background Processing Technique Tool Working Technique Reference DuDe Text-based approach [17] solidsdd Token Based Approach [19] CCFinder Token Based Approach [18] clonedr Tree- Based Approach [20] DuDe is language-independent code clone detector. It works on text based approach. Clone detection is performed on duplication chains. The tool is composed in Java and keeps running on each significant stage. Despite the fact that DuDe is content based, it consolidate little copied portions to shape bigger ones by permitting gaps in its scatter plot representation. CcFinder converts source code into tokens, then those tokens are transformed into special tokens using lexical rules of respective programming language. CCfinder detects clones portions having different syntax but similar meaning. Another purpose is to filter out code portions with specified structure patterns. Token sequence helps to detect clones with different line structures, which cannot be detected by line-byline algorithm. SolidSDD (The Duplicate Code Detector) is a tool for finding and breaking down copy code (i.e., code clones). It identifies clones in source code amid advancement, for instance by duplicate copy paste operations. SolidSDD supports C, C++, C# and Java. In provides graphical interface to assess code duplication characteristics and also able to locate clone position in software stack. The graphical is helpful to developers, architects and software managers for refactoring purposes. In CloneDr, an annotated parse tree (AST) is generated. At that point sub-trees are analyzed by measurements in light of a hash work. Source code of comparative sub-trees is then returned as clones. The hash work empowers one to do parameterize coordinating and to identify gapped clones, particularly if the gaps are inside a line. The trial version analyzes the whole project but only reports 10 sample clones of medium size (max 50 lines). All printed material, including text, illustrations, and charts, must be kept within the parameters of the 8 15/16-inch (53.75 picas) column length and 5 15/16-inch (36 picas) column width. Please do not write or print outside of the column parameters. Margins are 3.3cm on the left side, 3.65cm on the right, 2.03cm on the top, and 3.05cm on the bottom. Paper orientation in all pages should be in portrait style. 4. Evaluation Criteria Different techniques are suitable under different conditions. Qualitative as well as Quantitative parameters for the evaluation of techniques outlined in literature are: Qualitative parameters: Suitable Confidence Relevance Focus No. of clones reported Quantitative parameters: False positives Kind of matches detected Precision Recall An effort has been made to compare techniques and tools using following qualitative criteria: Criteria1. No. of Clones: Which technique finds more number of clones in each file of the given project? 38 Copyright c 2017 SERSC

9 Criteria2. Suitable: Which technique is suitable to detect Type-1, Type-2, Type-3 and Type-clones? Which technique detects clones, which are suitable for refactoring? Criteria3. Relevance: Whether any technique prioritize the match found or not for refactoring purpose? Like, if a segment of code is copied again and again, then it is more relevant for refactoring, because its removal has direct impact on the code. Like, codeclones in the same class are easy to modify than clones in different classes [13]. Criteria4. Confidence Which Code-Clone-Detection tool gives reliable results, or a lineto-line manual inspection is necessary [13]? Criteria5. Focus Does one have to concentrate on a single class or is it also possible to asses an entire project [13]? Criteria6. Scalable to Speed Which technique is adjustable to speed in relation with size of project taken (from small size projects to large size projects)? 4.1 Various Issues Some of the issues or research questions in concern of clone detection techniques are: What kinds of matches are found? Overall report of the amount of duplication existing in all program files Programming constructs that one can restructure using a particular tool. How accurate are the results? -which technique results in more number of false positives that is incorrectly identified pieces of duplicated code. -number of useless matches that is the matches which are not relevant for refactoring. - number of recognizable matches that is, the matches which are interesting for refactoring. Precision: Precision is the percentage of accurate clones detected relative to total clones detected by the technique. Precision= relevant clones in detection/ Total detected Candidates Recall: Recall is the number of reference clone groups detected by each technique relative to all of the reference clone groups. Recall= no of relevant clones detected/ Total no of relevant clones in database Q: How much execution time is consumed by detection technique? Clone detection techniques takes time to process input source code to detect clones ranging from small to large applications. Q: Are code clones detected by each technique helpful to derive any information related to design and maintenance issues? This point interprets that detected clones provide some information related to structural clones or reusable components or debug removal opportunities etc. which will be helpful in understanding the design of application in terms of components. Copyright c 2017 SERSC 39

10 Q: Are clones detected by techniques removable? It means clones detected by any technique can be refactored or not. Which techniques are good in combination with refactoring tools? Q: Which technique to use to find different types of clones? This addresses the issue of application of technique to find type of clone, For Example which technique is useful to find out type-1, type-2, type-3 or type-4 clones. Q: What should be the minimum length of a clone? The minimum number of statements (threshold value)that should be considered as a clone is an utmost important factor to think upon. If threshold value is very large then the clone detection technique will report less clones. But if threshold value is very small then it cal lead to large number of clones to be detected which can consists of so many spurious clones. Mainly in previous studies the clone length varies around (in ccfinder clone length is 30 but in clone Miner clone length is value less than 30). 4.2 Mostly Used Case Studies in Literature -ScoreMaster is a Java application consequently created for the Enhydra web server. Since a large portion of the code has been created naturally, it contains a high level of duplication. - TextEdit is an projevct that is dispersed with Borland's JBuilder to exhibit GUI programming in Java. Because of its instructive nature it contains little duplication[20]. - Brahms is music sequencing and documentation programming for linux written in C++ and was earlier known as KooBase. The little measure of duplication present is of an alternate nature on the grounds that the code was composed physically in an open source context [1]. - JMocha is a Java beans benchmark developed by IBM[11]. - JavaParser of JMetric is, as indicated by its name, a Java parser generated by Java for the JMetric project. It concerns a larger example of automatically generated code full of duplication[10]. ANTLR (Version 2.7.1)[1]. ANTLR (ANother Tool for Language Recognition) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing C++ or Java actions. ANTLR includes 189 files and the size is 42000LOC [37]. 5. Results and Discussions Analysis has been performed by comparing clone detection techniques and by comparing automated tools based on those techniques. 5.1 Comparison of Techniques In order to evaluate the performance of code-clone-detection techniques, Netbeans- Javadoc, Java-Quizz and Jboss SP1-src, open-source-codes were taken. This section reports the results of the techniques using the criteria discussed in Section 3. Table 6 describes about total number of clones detected by each technique. And then all the criteria discussed above are evaluated. 40 Copyright c 2017 SERSC

Based clonedr 19-clone sets O( Subtrees of AST ) Text Based Dude 40 clone-pairs O (LOC) Analysis on the basis of Chosen Criteria: A) No.

11 Table 6. Number of Clones Detected by Three Techniques (Netbeans Javadoc): Approach Tool Used Output of Complexity clones reported Token- Based ccfinder 28 clone sets O(LENGTH(longest clone) * Tokens ) Tree Based clonedr 19-clone sets O( Subtrees of AST ) Text Based Dude 40 clone-pairs O (LOC) Analysis on the basis of Chosen Criteria: A) No. of Clones: The number of reported clones is also important in assessing clone detection techniques and tools. Text Based technique gives maximum number of clones in the whole source-code, but token based techniques detects more number of code-clones in each respective file as shown in graphs below (only one case study shown here) CASEA: Netbeans javadoc Figure 2. No of Clones in Each File (Netbeans Javadoc)(Text-based) Figure 3. No of clones in each file (tree-based) (Netbeans-javadoc) Copyright c 2017 SERSC 41

12 Figure 4. No of Clones in each File (Token-based) (Netbeans-javadoc) Text-Based Techniques: Because textual approaches apply negligible or no transformation on the source code during pre-processing, therefore Text-based techniques and tools are not good at detecting Type-3 near-miss clones. In text based techniques there are fewer chances to find uninterested clones because exact match of text is required in this technique [14]. B) Suitable All the techniques find Exact-Clones, but in evaluation, it has been observed that textual and token based technique find matching case structures. And it is easier to remove these kinds of clones by combining different functions or methods under onesingle-common-name. Also Text based technique gives more details about renaming of variables in source code, therefore it is easy to locate situations for removal of duplicate code. Tree based techniques are more oriented for finding near-miss clones (Type-III) (As shown in Table 7) as compared to text-based and token-based techniques. Text-based code clone detectors rely solely on the textual representation of the source code. Only minor transformations are performed, such as whitespace, comments and layout is normalized. This makes is difficult for the clone detector to detect Type II, III and IV clones [16]. The token-based detector parses the whole source code and works on a token sequence as representation of the code. During creation of this token sequence, identifiers, whitespace and layout are normalized. Therefore, the clone detector should be able to detect Type I and II clones. However, Type III and IV clones represent a difficulty for the clone detector; since they interrupt a token chain [16]. The tree-based detector uses an abstract syntax tree as work object. Similar to the token-based approach whitespace, layout and identifiers are normalized. Type I and II clones should be detectable by this clone detector. Also, the loop transformation could be easily detected, since the AST representation of a for and while loop are very similar [16]. 42 Copyright c 2017 SERSC

13 Evaluation Clone Detection Report For Project File: C:/Documents and Settings/Administrator/Desktop/parameters-final-try.prj using CloneDR tool Table 7. Near-miss Clones Reported by Tree-based Technique Clone Detection Statistics Statistic Value File Count 21 Total Source Lines of Code (SLOC) 2852 Total CloneSets 19 Exact-match CloneSets 9 Near-miss CloneSets 10 Number of cloned SLOC 469 SLOC in clones % 16.3% C) Relevance Concerning this criterion, all techniques behave more or less the same. All these techniques report clones in terms of clone-pairs, clone-sets, clone-length, line-number and file-names. No information is given about priority of any clones. But in token-based approach, up to some extent we can filter unwanted clones by the use of filter-file, filterclone-set as in ccfinder. Tree clone examination (Baxter et al, 1998) endeavors to be more precise than a literary or programming dialect token based approach by building the theoretical sentence structure tree. Because token-generation still somewhat depends upon transformation rules applied on original source-code, but tree generated gives the exact view of code and information flow. Therefore it gives more accurate results w.r.t other two techniques. Token-based technique finds some uninterested code clones. Like CCFinder. CcFinder is a language dependent code-clone. The grey colored parts in figure 5 represent clones between files A and B. The variable and method names in the code fragments are different. As the CcFinder algorithm transforms user-defined names into the same special token. Therefore the source code having different variable names; for example, after copy and paste some variable names are changed are detected. [14]. But only clone-length and file-name information is not sufficient for real evaluation of relevant code-clones. All techniques would provide information like the class-name to which a particular clone belongs, so that the user can have a better view, which will help in refactoring. Copyright c 2017 SERSC 43

getting confused with language constructs, in comparison with token-based technique.

14 Figure 5. Example of Uninterested Clones Detected by CCFinder [14] D) Confidence The results obtained, has shown that Textual technique gives good confidence, because it detects exact-matches rather than getting confused with language constructs, in comparison with token-based technique. This technique also gives details about number of occurrences (Instances) of each clone-id as explained in figure below: Figure 6. No. of Occurrences of each Clone-set Textual-approach gives detail about renaming of variables and fan-out (that is a codeclone is scattered among how many number of files), which is very beneficial to know about the effort required to delete a duplicate code. Token based approach becomes less confidence because more no of false positives are detected in this (as the case with ccfinder in Figure 4). Tree based approach leads to a far better confidence because it ignores accidental matches. Tree Based Technique(Syntactic approaches) use a parser to convert source programs into parse trees or abstract syntax trees (ASTs) which are then processed using tree-matching to find clones. However tree based technique is more accurate because this technique search clones syntactically, by comparing the syntax trees [14] and ignores accidental matches. 44 Copyright c 2017 SERSC

E) Focus In our case study we noticed that all the techniques were able to focus on the entire project at once, including all the classes and methods used in the whole-project.

15 E) Focus In our case study we noticed that all the techniques were able to focus on the entire project at once, including all the classes and methods used in the whole-project. But for large LOC, it won t be possible to focus on whole project at once. Use of swapping is needed in those cases. Table 8. Clone Detection Run Time Duration Case Study Lines of Code Clone Detection run Duration Netbeans Javadoc SLOC-2852 DuDe- 11 secs solidsdd-4 sec clonedr secs Java Quizz 10 DuDe- 4 secs solidsdd-3 sec clonedr sces Jboss SP1-src SLOC Dude secs solidsdd- 82 secs clonedr Secs Netbeans Javadoc Dude Netbeans Javadoc solidsdd 4000 Netbeans Javadoc clonedr Java Quizz Dude 3000 Java Quizz solidsdd Java Quizz clonedr 2000 Jboss SP1-src Dude Jboss SP1-src solisdd 1000 Jboss SP1-src clonedr 0 Clone Detection run Duration Figure 7. Clone Detection Speed by the Tools Table 9. Observation on the Basis of Mentioned Criteria Criteria Relevance Confidence Focus Suitability Applicability of Detection Techniques chosen All Are Same Text based All are same Tree Based F) Scalable to Speed In the three case studies, it is clear from Table8 that Token based technique is scalable to all kinds of projects from short lines of code (Java Quizz) to large scale projects (Jboss SP1-src). But text based technique takes more time to detect clones among these three cases. Speed of tools used is shown graphically in Figure 7. Table 9 describe observation of chosen criteria on respective techniques. Copyright c 2017 SERSC 45

16 5.2. Comparison of Tools Used To study the three techniques discussed in this paper, we worked on different automated tools which operate on different background techniques (as described in Table10). Four tools are used: DuDe, solidsdd, CCFinder and clonedr. For comparison of tools used above, we described the properties of clone detection tools according to following criteria (Table11 shows details of comparison): Platform: Platform describes the execution platform for tool. Table 10. Tools and their Background Technique Tool DuDe solidsdd CCFinder clonedr Working Technique Text-based approach Token Based Approach Token Based Approach Tree- Based Approach Special Environment: whether the tool requires a special environment for operating. Availability: whether the tool is freely available, or under evaluation version or need any license under which the tool is made available. User Interface: This describes whether the tool is graphically interactive or it is used command line. Output: The Output indicates the kind of output supported by the particular tool. Like some tools provide information textually with file name and begin-end line numbers of the cloned fragments, some tools give original source of the cloned fragments in some format, some tools show scatter plot cloned code Clone Relation: The Clone Relation describes how clones are reported as clone pairs, clone classes, or clone-sets. Types of clones: whether tool detects type-1, type-2 or type-3 clones. Criteria Platform Special Environment Availability User Interface Table 11. Comparison of Tools Description of Tool Used Dude- run has been on windows, no other information available ccfinder- Run on windows, but also support Linux solidsdd is platform independent CloneDR- has been run on windows DuDe available in JAR file, JRE 1.5 required CCFinder- python and JRE 1.5 required fir its working solidsdd- no extra software required for running clonedr- no extra software required Dude- evaluation version available on request CCFinder- freely available for research solidsdd there is free available evaluation license clonedr- This tool is freely available for research All these tools provides graphical user interface to operate 46 Copyright c 2017 SERSC

17 Output Type of Clone detected Clone Relation DuDe Shows results textually (but its output is most suited and easily understandable) CCFinder- shows results textually, in scatter plot and also shows scrap book solidsdd- Shows results textually clonedr- Provides result in the form of a web page(html page) on the system directory DuDe: detects type-1, type-2 and type-3 CCFinder: detects type-1 and type-2 solidsdd : detects type-1, type-2 and type-3 clonedr: detects type-1, type-2 and type-3 Dude- produces clone pairs CCFinder- produces clone sets solidsdd- produces clone pairs clonedr- Produces clone-sets DuDe: Although DuDe [15] is text-based, but it can combine small duplicated segments to form larger ones by allowing gaps in duplicate segments thus able to detect near miss clones. In our case study also, DuDe is capable to detect type-3 clones. CCFinder is token based and detects type-1 and type-2 clones solidsdd: is also based on token based approach, but it detects type-1, type-2 and also type-3 clones. This tool shows detail of inserted and deleted lines in two copied fragments. In Figure8 detail given by tool solidsdd is shown, like in two files: Reference file: ExternalJavadocExecu torbeaninfo.java Duplicate File: JavadocModule.java Line number 27 is modified to line number 28 in same file, but lines 35 and 36 are deleted from reference file and rest of the clone is copied in duplicate file (as shown in Figure7). Therefore, solid SDD gives a better view about the source code. clonedr: In clonedr, a compiler generator is used to generate an annotated parse tree (AST) and compares its subtrees by characterization metrics based on a hash function. Source code of similar subtrees is then returned as clones. The hash function enables one to do parameterized matching and to detect gapped clones, especially if the gaps are within a line [14]. It detects type-1, type-2 and type-3 clones. Copyright c 2017 SERSC 47

18 Modifications (solid-sdd view) 27 : desc.setdisplayname (NbBundle.getMessage (ExternalJavadocExecutorBeanInfo.class, "CTL_Javadoc_executor")); //NOI18N 28 : desc.setshortdescription (NbBundle.getMessage (ExternalJavadocExecutorBeanInfo.class, HINT_Javadoc_executor")) 87 : bd.setvalue ("global", Boolean.TRUE) Deletions 35 : if (Boolean.getBoolean ("netbeans.debug.exceptions")) //NOI18N 36 : ie.printstacktrace () 6. Conclusions and Future Work Figure 8. Output of solid-sdd In this paper, we have focused on clone detection techniques and tools, providing a review of tools and techniques. Previous researches show that token based approach returns more false positives than other techniques. Text-based and the token-based tools had very similar recall values, but their Precisions were different due to the fact that they found different numbers of clone pairs. The tree-based tool had higher recall value, and less precision value than the tools in the other techniques. Token based, tree based and metric based techniques are also helpful in combination with refactoring tools. An attempt is made to evaluate the performance of Text-based, Tree-based and Token-based approaches for detecting duplicate code. The respective techniques are examined by the use of DuDe, CcFinder, solidsdd and clonedr tools. It has been observed that text based-approach is best suited for restructuring with little effort and find exact-clones. Tree based approach works effectively to find near-miss clones, and it ignores accidental matches. Token based approach also detects exact-clones as well as near-miss clones, but it strongly depends upon language constructs. In terms of scalability to speed, Token based technique is adjustable to short as well as large scale projects. All the techniques provide file-level, class level and clone-set details. Actual evaluation of relevant codeclone is measured by comparing the same-picture of one code-clone-reported-portion by three techniques to have a parallel view. Among the tools used: Dude is lass suitable to large scale projects, Dude takes maximum time to evaluate for large LOC. But solidsdd is suitable to all kinds of application from less LOC to large LOC and clonedr also takes more time to evaluate large projects. The work can be extended, by applying Metric based approach and study its performance in conjunction with evaluated techniques. Refactoring components can also be determined on the basis of metrics-based-approach. References [1] N. F. Schneidewind and H. Hoffman, An experiment in software error data collection and analysis, IEEE Transaction on Software Engineering, vol. 5, no. 3, (1979), pp [2] W. Yang. Identifying syntactic differences between two programs. Software Practice and Experience, vol. 21, no. 7, (1991), pp [3] J. H. Johnson, Identifying redundancy in source code using fingerprints, CASCON. IBM Press, (1993). [4] J. Mayrand, C. Leblanc and E. M. Merlo, Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics, Proceeding of IEEE Int l Conf. on Software Maintenance(ICSM) 96, (1996), pp [5] D. Baxter, A. Yahin, L. Moura, M. Sant Anna and L. Bier, Clone Detection Using Abstract Syntax Trees, In ICSM, (1998). [6] S. Ducasse, M. Rieger and S. Demeyer, A Language Independent Approach for Detecting Duplicated Code, ICSM, (1999). [7] T. Kamiya, S. Kusumoto and K. Inoue, A Token Based code clone detection tool-ccfinder and its 48 Copyright c 2017 SERSC

19 empirical evaluation, Technical Report (2000). [8] T. Kamiya, S. Kusumoto, and K. Inoue, CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code, IEEE, vol. 28, no. 7, (2002). [9] The Source for Java Technology, (2002). [10] C. Liu, C. Chen and J. Han, GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis, in the proceedings of 12th ACM SIGKDD International Conference on knowledge discovery and data mining, (2006), pp [11] R. Koschke, R. Falke and P. Frenzel, Clone Detection Using Abstract Syntax Suffix Trees, Proceedings of the 13th Working Conference on Reverse Engineering, WCRE 2006, (2006), pp [12] C. K. Roy, J. R. Cordy and R. Koschke, Comparison and evaluation of code clone Detection Techniques and Tools: A Qualitative Approach, Science of Computer Programming, vol. 74, no. 7, (2009), pp [13] F. Van Rysselberghe and S. Demeyer Evaluating clone detection techniques from a refactoring perspective, Lab on Re-Engineering, University of Antwerp. [14] C. K. Roy, J. R. Cordya, Rainer Koschke, Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach, School of Computing, Queen s University, Canada University of Bremen, Germany. [15] R. Wettel and R. Marinescu, Archeology of Code Duplication: Recovering Duplication Chains From Small Duplication Fragments, Proceedings of the 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC (2005). [16] D. Meyer, Analyzing the Robustness of Clone Detection Tools Regarding Code Obfuscation, University of Magdeburg School of Computer Science, (2012) October. [17] [18] ccfinder is available at http: [19] [20] [21] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, Gemini: Maintenance Support Environment Based on Code Clone Analysis, Proceedings of the eighth IEEE Symposium on software metrics, Ottawa, Canada, (2002). [22] B. S. Baker, A Program for Identifying Duplicated Code, Proceedings Computing Science and Statistics: 24th Symp. Interface, vol. 24, (1992) March, pp [23] R. Falke, R. Koschke and P. Frenzel, Empirical Evaluation of Clone Detection Using Syntax Suffix Trees, Empirical Software Engineering, vol. 13, (2008), pp [24] L. Jiang, G. Misherghi, Z. Su and S. Glondu, DECKARD: Scalable and Accurate Tree-based Detection of Code Clones, in: Proceedings of the 29th International Conference on Software Engineering, ICSE 2007, (2007), pp [25] S. Lee and I. Jeong, SDD: High performance Code Clone Detection System for Large Scale Source Code, in: Proceedings of the Object Oriented Programming Systems Languages and Applications Companion to the 20 th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA Companion 2005, pp (2005). [26] C.K. Roy and J.R. Cordy, NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization, in: Proceedings of the 16th IEEE International Conference on Program Comprehension, ICPC 2008, pp (2008). [20] P. Bulychev and M. Minea, Duplicate Code Detection Using Anti-Unification, in: Spring Young Researchers Colloquium on Software Engineering, SYRCoSE 2008,4 pp. (2008). [27] Minhaz F. Zibran, Chanchal K. Roy: IDE-based Real-time Focused Search for Near-miss Clones in SAC 12 March 25-29, 2012, Riva del Garda, Italy. Copyright 2011 ACM /12/03 [28] Fabio Calefato, Filippo Lanubile, Teresa Mallardo, "Function Clone Detection in Web Applications: A Semiautomated Approach", Journal of Web Engineering, Vol. 3, No.1, pp , [29] Muhammad Asaduzzaman, Visualization and Analysis of Software Clones, A Thesis in the Department of Computer Science University of Saskatchewan Saskatoon January [30] R.Wettel and R. Marinescu, Archeology of Code Duplication: Recovering Duplication Chains From Small Duplication Fragments, in:proceedings of the 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC, (2005), p. 8. [31] S. Ducasse, M. Rieger and S. Demeyer, A Language Independent Approach for Detecting Duplicated Code, in:proceedings of the 15th International Conference on Software Maintenance, ICSM 1999, (1999), pp [32] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code, IEEE Transactions on Software Engineering, vol. 32, no. 3, (2006), pp [33] Tool Clone Detective (part of ConQAT). URL Page Last accessed November [34] Tool SimScan, URL Last accessed November [35] P. Bulychev and M. Minea, Duplicate Code Detection Using Anti-Unification, in: Spring Young Researchers Colloquium on Software Engineering, SYRCoSE 2008, (2008), p. 4 Copyright c 2017 SERSC 49

20 [36] [37] Saeed Shafieian, Ying Zou, Comparison of Clone Detection Techniques, Technical Report, pp [38] K. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein, Pattern Matching for Clone and Concept Detection, Journal of Automated Software Engineering, vol.3, no. 1-2, (1996), pp Copyright c 2017 SERSC

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones Detection using Textual and Metric Analysis to figure out all Types of s Kodhai.E 1, Perumal.A 2, and Kanmani.S 3 1 SMVEC, Dept. of Information Technology, Puducherry, India Email: kodhaiej@yahoo.co.in