The University of Saskatchewan Department of Computer Science. Technical Report #

Size: px
Start display at page:

Download "The University of Saskatchewan Department of Computer Science. Technical Report #"

Transcription

1 The University of Saskatchewan Department of Computer Science Technical Report #

2 The Road to Software Clone Management: ASurvey Minhaz F. Zibran Chanchal K. Roy {minhaz.zibran, University of Saskatchewan, Canada February 13, 2012

3 Contents 1 Introduction and Motivation 1 2 A Systematic Review on Clone Literature Threat to Validity Code Clone Clones beyond Source Code Clone Relationship Clone Granularity Which Level of Granularity Is Appropriate? Intentional and Accidental Clones Clone Detection Techniques Strengths and Weaknesses of Clone Detection Techniques Challenges in the Empirical Evaluation of Clone Detection Tools Clone Evolution Clone Genealogy Genealogy Extraction: Clone Change Patterns Need for Improvements Visualization of Clone Evolution Clone Management Clone Management Strategies Design Space for a Clone Management System Architectural Centrality Triggering of Clone Management Activity Scope of Clone Management Activity Clone Management Activities Integrated Clone Detection Clone Documentation Clone Tracking Incremental Clone Detection Clone Annotation Techniques for Reengineering/Refactoring of Clones Generics and Templates Design Level Approaches Design Patterns: Traits: Aspects: i

4 4.8.3 Synchronized Modification Consistent Renaming Refactoring Patterns Tool Support for Refactoring Patterns: Analysis and Identification of Clones for Refactoring Visualization of Distribution and Properties of Clones Analysis to Find Clone Based Reengineering Opportunity Clone Categorization Based On Reengineering Opportunity: Cost-benefit Analysis and Scheduling of Refactoring Verification of Clone Modification/Refactoring Industrial Adoption of Clone Management 52 6 Conclusion 53 Bibliography 54 ii

5 List of Figures 1 Yearly number of distinct authors contributing to clone research Categories of publications on software clone research in di erent years 5 3 Proportion of publications in each category over the period A clone genealogy with two lineages over versions v k through v k+3 [201] 18 5 Clone change patterns and types of genealogies [171] Clone management workflow Kapser and Godfrey [107] Taxonomy: clone categorization based on location and functionality iii

6 List of Tables 1 Categories of theses on software clone research Code Clone Detection Techniques (extended from [126]) Summary of techniques for clone genealogy extraction Summary of clone management support from integrated tools Summary of tool support for incremental clone detection Summary of clone visualization techniques (extended from Jiang et al. [94]) Balazinska et al. [16] Taxonomy: categories of function/method level clones based on (dis)similarity caused by di erent types of di erences 46 8 Koni-N Sapu [121] Taxonomy: applicability of di erent refactoring patterns on di erent categories of clones Schulze et al. [178] Taxonomy: applicability of object-oriented (first three) and aspect-oriented (last three) refactoring patterns to refactor categories of clones Torres [189] Taxonomy: clone categorization based on concept location Comparison of Code Clone Refactoring Schedulers iv

7 1 Introduction and Motivation Copying an existing code and pasting it in somewhere else followed by major or minor edits is a common practice that the developers adopt to increase productivity. Such a reuse mechanism typically results in duplicate or very similar code fragments residing in the code base. Those duplicate or near-duplicate code segments are commonly known as code clones. There are many reasons why the developers intentionally perform such code cloning. Obvious reasons include reuse of existing implementation without re-inventing the wheel. More comprehensive discussions on the reasons for code cloning can be found elsewhere [106, 161, 163]. Code clones may also appear in the code base without the awareness of the developers. Such unintentional/accidental clones may be introduced, for example, due to the use of certain design patterns, use of certain API s to accomplish similar programming tasks, code conventions imposed by the organization, and so on. The reuse mechanism by code cloning o ers some benefits. For instance, cloning of an existing code that is already known to be flawless, might save the developers from probable mistakes they might have made if they had to implement the same from the scratch. It also saves the time and e ort in devising the logic and typing the corresponding textual code. Code cloning may also help in decoupling classes or components and facilitate independent evolution of similar feature implementations. On the other end of the spectrum, code clones may also be detrimental in many cases. Obviously, redundant code may inflate the code base, and may increase resource requirements. This may be crucial for embedded systems and systems such as hand held devices, telecommunication switches, and small sensor systems. Moreover, cloning a code snippet that contains any unknown fault may result in propagation of that fault in all copies of the faulty fragment. From the maintenance perspective, a change in one code segment may necessitate consistent changes in all clones of that fragment. Any inconsistency may introduce bugs or vulnerabilities in the system. Fowler et al. [58] recognize code clones as a serious kind of code smell. However, during the software development process, duplication cannot be avoided at times. For example, duplication may be enforced by the limitation of the programming language in facilitating with necessary mechanism to implement an e cient generic solution of a problem at hand. Code generators may also generate duplicated code, that the developers do not want to modify. Previous research reports empirical evidences that a significant portion (generally 9%-17% [206]) of a typical software system consists of cloned code, and the proportion of code clones in the code base may be as low as 5% [163] and as high as even 50% [162]. Indeed, due to the negative impact of code clones in the maintenance e ort, one might want to remove code clones by active refactoring, wherever feasible. However, in reality, aggressive refactoring of code clones appears not to be a very good idea [39], and not all clones are really removable through refactoring. Due to the dual 1

8 role of code clones in the development and maintenance of software systems, as well as the pragmatic di culty in avoiding or removing those, researchers and practitioners have agreed that code clones should be detected and managed e ciently. Since the emergence of software clones as a research area, significant contributions over years made the field grow and become quite a matured area of research. Nonetheless, over the entire course of software clone research there have been notably three surveys. Koschke [125], in 2007, presented a brief summary of the important findings about di erent aspects of software clones including cause-e ect of cloning, clone avoidance, detection, and evolution. along with a set of open questions. In the same year, Roy and Cordy [163] also published another survey containing a thorough review on those same areas with specific focus on clone the detection tools and techniques. Recently, Pate et al. [159] published a systematic review on 30 studies on clone evolution only. In this paper, we present an extensive survey on code clone research with strong emphasis on clone management. This paper is organized as follows. In Section 2, we present a systematic review on a repository of 262 papers on software clone research published over 19 years. The review draws birds-eye view on the overall contributions and growth along di erent dimensions of software clone research. The remaining of this survey is the outcome of careful investigation of literature beyond the said repository, and through analysis in the light of our experience. Section 3 starts with the necessary background including the definition and types of clones. Section 3.3 describes the di erent granularities, at which code clones can be addressed. Section 3.4 distinguishes intentional and accidental clones. In Section 3.5, we describe the di erent clone detection techniques along with their strengths and weaknesses. Section 3.6 addresses clone evolution with emphasis on clone change patterns based on the genealogy based evolution model. Section 4 begins with the management of clones beyond detection. Section 4.1 characterizes di erent strategies for clone management. In Section 4.2, we briefly describe the design space for a clone management system. Section 4.3 starts discussion on the di erent clone management activities, including integrated clone detection (Section 4.4), clone documentation (Section 4.5), tracking (Section 4.6) and annotation (Section 4.7). Section 4.8 presents the techniques for clone removal or clone based reengineering. In Section 4.9, we describe the analyses for the identification of potential clones as candidates for refactoring/reengineering. The cost-benefit analysis and scheduling of clone refactoring is then presented in Section A discussion on the verification of clone refactoring is accommodated in Section Our understanding on the challenges for industrial adoption of clone management is presented in Section 5, and finally, Section 6 concludes the paper. 2

9 2 A Systematic Review on Clone Literature There has been more than a decade of research in the field of software clones. To understand the growth and trends in the di erent dimensions of clone research we carried out a quantitative review on the related publications. Robert Tiras at the University of Alabama at Birmingham has been maintaining a repository [181] of scholarly articles that make significant contributions in the area. Until today, the corpus consists of 264 scholarly articles published between 1994 and 2012 in di erent refereed venues including Ph.D., M.Sc., and Diploma theses. The repository organizes the publications by categorizing them based on their contributions in four major subareas of clone research. The categories are as follows: Analysis: This category contains publications that perform analysis on the various traits of software clones, their reasons, existence, e ects in software systems, as well as investigation of clone reengineering opportunities and implications. A majority of such publications report findings from qualitative or quantitative empirical studies. Detection: Publications in this category address techniques and tools for the detection of software clones. Tool Evaluation: This category comprises the publications that contribute to the quantitative or qualitative evaluation of the techniques and tools for clone detection. Management: Publications in this category address the issues, techniques and tools for the management of code clones beyond detection. Other than the publications that fall in the aforementioned four categories, the corpus also contains three publications categorized as Survey of Overall Research. After careful reading of those articles, we put them in the Analysis category. Moreover, the repository classified the published theses based on simply whether they were outcomes of Ph.D., M.Sc., and Diploma programs. We also went through them, and based on their research focus, we classified them (Table 1) into the above mentioned four categories. In our review, we include a total of 262 scholarly articles and theses published since 1994 until Since, it is now early 2012 and more contributions are expected to appear by the end of 2012, we deliberately excluded publications after 2011 for sanity. The repository also contains a list of tools, events, research groups, and links relevant to software clone research, which we excluded from the review. We collected the information by performing automated parsing (followed by manual verification) of the web (HTML) interface of the repository. 3

10 Ph.D. Thesis M.Sc. Thesis Diploma Thesis Table 1: Categories of theses on software clone research Title Author Year Category Representation, Analysis, and Refactoring Techniques R. Tairas 2010 Management to Support Code Clone Maintenance Scalable Detection of Similar Code: Techniques L. Jiang 2009 Detection and Applications Toward an Understanding of Software Code C. Kapser 2009 Management Cloning as a Development Practice Assessing the E ect of Source Code Characteristics A. Lozano 2009 Analysis on Changeability Detection and Analysis of Near-Miss Software C. Roy 2009 Detection Clones Code Clone Analysis Methods for E cient Software Y. Higo 2006 Analysis Maintenance E ective Clone Detection Without Language Barriers M. Rieger 2005 Detection Automated Duplicated-Code Detection and Procedure R. Komondoor 2003 Detection Extraction Improving Clone Detection for Models P. Pfaehler 2009 Detection Clone Detection Using Dependence Analysis and Y. Jia 2007 Detection Lexical Analysis Clone Detection Using Pictorial Similarity in Slice Y. Jafar 2007 Detection Traces Visualizing and Understanding Code Duplication Z. Jiang 2006 Analysis in Large Software Systems Incremental Clone Detection N. Göde 2008 Detection CPC: An Eclipse Framework for Automated Clone V. Weckerle 2008 Management Life Cycle Tracking and Update Anomaly Detection Semi Automatic Removal of Duplicated Code Y. Liu 2004 Management Automated Detection Of Code Duplication Clusters R. Wettel 2004 Detection A Scenario Based Approach for Refactoring Duplicated G. Koni-N Sapu 2001 Management Code in Object Oriented Systems 4

11 Figure 1: Yearly number of distinct authors contributing to clone research Analysis Detection Management Tool Evaluation Figure 2: Categories of publications on software clone research in di erent years

12 Figure 1 plots the number of distinct authors contributing to clone research in the years from 1994 through As the figure indicates, the clone research community has experienced a significant growth over the recent years. In the Figure 2, we present the number of publications appeared every year making contributions to each of the four sub-areas of clone research. As seen in the figure, early work on software clone research were dominated by the research on clone detection with some work on analysis. In the recent years, the work on clone analysis and detection has grown significantly while clone management has emerged and growing as a significant research topic. Despite the fast growth of the clone research community, the work on clone management remained much less compared to analysis and detection, which can be more clearly perceived from the Figure 3. This, in combination with the realized importance of research in clone management, point to the further need and potential for research in this sub-area. Figure 3: Proportion of publications in each category over the period It can also be noticed from both the Figure 2 and Figure 3 that over the entire span ( ) of software clone research very few work focused on the evaluation of clone detection techniques or tools, although more than 40 di erent clone detection tools have been produced realizing a wide variety of techniques [168]. Indeed, the detection of clones is a fundamental topic for software clone research, and the e ectiveness of clone management largely depends of clone detection. Therefore, more work on the evaluation of clone detection tools is necessary to inform scopes for further improvements. However, the task can be challenging due to a number of reasons, which we discuss in Section

13 2.1 Threat to Validity The corpus of publications used in our systematic review contains a significant number of scholarly articles published over 19 years. Still, a few publications might have escaped the repository. For example, two M.Sc. theses [5, 171] that were completed at the end of 2011, are yet to appear in the corpus. However, to preserve the reproducibility our systematic review, we did not include such few missing publications. In fact, the corpus is a result of consistent e ort for exhaustive preservation of publications, which significantly contributed to the software clones research. Hence, we believe that those few missing articles, if included, would not cause confounding variation in the findings of our review. Indeed, the remaining of this survey is based on relevant literature beyond the corpus used in the systematic review. 3 Code Clone Though, duplicate or similar code fragments are roughly known to be code clones, the definition of clone has remained more or less vague over the last decade. The vagueness is reflected in the definition given by Ira Baxter, Clones are segments of code that are similar according to some definition of similarity (Ira Baxter, 2002) [125]. Such a definition is dependent on how similarity is defined, and also raises question on how much of code can be regarded as a code segment. Clone research over the past decade somewhat addressed these issues, and the following categorizing definitions of code clone currently have been widely acceptable [125, 163, 161]. Type-1 Clone: Identical code fragments except for variations in white-spaces and comments are Type-1 clones. Type-2 Clone: : Structurally/syntactically identical fragments except for variations in the names of identifiers, literals, types, layout and comments are called Type-2 clones. Type-3 Clone: Code fragments that exhibit similarity as of Type-2 clones and also allow further di erences such as additions, deletions or modifications statements are known as Type-3 clones. Type-4 Clone: Code fragments that exhibit identical functional behaviour but implemented through very di erent syntactic structure are known as Type-4 clones. Indeed, a number of similarity based other taxonomies were also proposed by Mayrand et al. [148], Balazinska et al. [16], Bellon et al. [28, 127], and Davey et al. [42], and Kontogiannis [122]. 7

14 As described, the first three types of clones are defined based on the similarity in the program text, while Type-4 clone is defined based on semantic similarity. Thus, Type-4 clones are also called semantic clones. One the other hand, Type-1 clones are also known as exact clones, Type-2 clones are also sometimes called renamed clones, whereas, the Type-2 and Type-3 clones jointly are called near-miss clones [163]. The terms parameterized clone or p-match clone are often used in the community to refer to a subset of Type-2 clones, where there must be a bijective mapping between the identifiers of the two Type-2 clones [125]. Thus, renaming (arbitrary or systematic) of identifiers are allowed in Type-2 clones, whereas for parameterized clones, those renaming must be consistent/systemic [163]. For Type-3 clones, the deletion of statement from one code segment can be considered as an addition of the statement in the other. The di erence in the additional or changed statements in the Type-3 clones are often called the gaps [193], and thus Type-3 clones are sometimes also referred to as gapped clones [125, 163, 193]. While the definition of Type-1 and Type-2 clones are quite precise, the definition of Type-3 clones still remains vague [125]. The definition does not precisely indicate how much di erences in terms of addition, modification, or deletion of statements are allowed in code segments to be regarded as Type-3 clones. The practitioners commonly consider code segments as Type-3 clones when the di erence in the statements remain below a (dis)similarity threshold [11, 140, 164, 205]. However, a consensus on an appropriate value for such a threshold is yet to be established [125]. 3.1 Clones beyond Source Code Ongoing research also attempts to deal with clones in software artifacts other than the source code [99], such as clones in higher level code structure [21, 22] or high level concepts [146], clones in the models of formal model based development [46, 47], in UML domain models [180], UML sequence diagrams [142], in the graph based Matlab/Simulink models [160], and duplication in requirement specification documents [99, 100]. However, this paper focuses on the management of clones in the source code only. 3.2 Clone Relationship A clone relationship exists between two code segments, which are clones to each other according to specification of similarity as prescribed by the definitions stated above. Such a clone relationship is reflexive (i.e., if code segment A is a clone of B, then B is also a clone of A). Moreover, for Type-1 and Type-2 clones, the transitive relationship also exists (i.e., if code segment A is a clone of B, and B is a clone of C, then A is also a clone of C). However, such a transitive property may not hold for Type-3 clones [27]. The aforementioned definitions also imply that a subset relationship exists among the 8

15 Type-1, Type-2, and Type-3 clones. Mathematically, Type-i Type-j, for i 2 {1, 2} and j = i + 1 [119]. Two code segments that are clones to each other (i.e., having clone relationship between them) are called a clone pair. A clone-class or clone-group is a set of code segments such that any two of them are clone pairs. The term clone is used in the software community in the following two ways. clone as a noun refers to a code fragment that is, according to the aforementioned definitions, similar enough to one of more other code segments. For instance, our objective is to find all clones in the code-base. clone as a verb indicates the act of producing a code segment (e.g., by copypasting) that is, according to the aforementioned definitions, similar enough to one of more other code segments. For example, I cloned the function to reuse it in my context. 3.3 Clone Granularity The definitions of the all four types of clones are based on the notion of code segment, but how much of contiguous code can be considered as a code fragment is not made specific. Thus, contiguous portion of code at di erent levels of granularity have been used in the literature. As concerned with source code, the most commonly used granularities are at the level of the entire source file, class definition, method body, code block, and statements, which yield the the notion of code clones of the following five types: File clone: When two files are found to have contained similar enough source code, they are called file clones. Class clone: Two classes of an object-oriented code can be considered as class clones if they have identical or near identical code. Function clone: Two functions are considered as clones when the bodies of the functions consists of code that are similar enough. Block clone: When two blocks of code (marked with opening and closing braces or indentation, or the like) are similar enough, they are called block clones. Arbitrary statements clone: When two groups of statements at arbitrary regions of the source file are found to be similar enough, they are also regarded as clones (CCFinder detects such clones). 9

16 3.3.1 Which Level of Granularity Is Appropriate? According to the definition of clone, a pair of very small portion portion of code such as, two identical identifiers, two similar statements, functions or blocks each having only one statement can also be valid candidates for clones. However, those tiny code segments cannot be real clones of pragmatic significance. Thus, the practitioners typically disregard those code segments that are smaller than a given threshold. Again, due to the di erences in the varying contexts and techniques for clone detection, there has been no consensus in the community on such a threshold, though minimum 20 to 30 tokens [118, 172] or three to five lines of code [167, 206, 203, 205] is a common threshold used in practice. We believe that the chosen granularity of the code segments should exhibit some characteristics so that the detected clones at that granularity actually becomes useful from the maintenance perspective. In this regard, we propose the following desired characteristics as inspired by our experience and the criteria proposed by Giesecke [62]. Coverage: The set of all code segments should cover maximal behavioural aspects of the software. Significance: Each code segment should possess implementation of a significant functionality. Intelligibility: Each code segment should constitute su cient amount of code such that a developer can understand its purpose with little e ort. Reusability: The code segments should feature a high probability for informal reuse. A source file typically contains a large bulk of code (a Java source file may contain multiple classes, interfaces, and so on). Though, the set of all source files cover all behavioural aspects of the software, informal reuse at file level is unlikely in a software system. Classes in object-oriented systems also satisfy the coverage criterion, but they are also unlikely to be informally reused, as more elegant concepts such as inheritance and delegation are there for their formal reuse. Methods or blocks (a method body itself forms a block) cover almost all behavioural aspects, and those that are missed (e.g., library/package inclusion, declaration and initializations) can be neglected [62]. Methods and blocks contain implementation of significant functionality that can be understood with less e ort than to do for an entire source file or class. Methods isolate a functionality and so often do the blocks, and thus they are likely to be informally reused, as there is no easy rigorous way to reuse methods from another context [62]. Code segments at the arbitrary statement level satisfy the coverage criterion, but not the other three criteria. Single statement or short sequence of statements typically 10

17 don t cover a unit of significant functionality. The meaning of a statement in general is likely to be incomprehensible without considering its context, and the context of the host block or function. Moreover, sequences of import/include statements, declaration/initialization statements, or the sequences of statements spanning code boundaries such as boundaries of two functions, classes, or blocks do not exhibit significant potential for reuse, rather those should be ignored in many cases. Thus, we suggest that code segments at the level of functions or blocks can be the most suitable granularity for dealing with code clones, specially for maintenance. Giesecke [62] also proposed in favour of function as the most adequate level of clone granularity. Indeed, block level granularity also covers functions, though it does not distinguish a block that constitutes a function body from that which does not. Driven by the same understanding, Higo et al. developed CCShaper [79, 76] to post-process the clone detection result from CCFinder to extract block level clones as the potential candidates for refactoring. 3.4 Intentional and Accidental Clones Code clones in a software system can appear in two ways. First, the programmer copies an existing code, pastes it in another place, and thus reuses the implementation with or without further modifications. The resulting code may still remain similar to the original and form a clone-pair. Such clones created by the programmer s deliberate copy-paste-modification activities are known as intentional clones or copy-pasted clones. There are a number of reasons why programmers create such intentional clones. An obvious reason is to reuse existing implementation without re-inventing the wheel. Indeed, similar code segments may also appear in the system without the programmer s intention, and often the developer can even be unaware of the creation of such clones. For example, due to the developer s mental model, frequently used idioms are reproduced from memory instead of deliberate copy-paste [161]. Even di erent developers may also produce very similar code while solving a similar problem, or due to the dictation of certain APIs they use, or using the same design pattern [203]. Such similar code snippets that are not produced by deliberate copy-paste operations are known as unintentional clones or accidental clones. Practically, there are many reasons behind the creation of intentional and accidental code clones in a software system. Further details on the root causes of cloning can be found elsewhere [109, 110, 125, 163]. 3.5 Clone Detection Techniques Over more than a decade of code clone research a number of techniques have been devised for the detection of code clones and many clone detection tools have been 11

18 developed. Prominent techniques can be categorized as shown in the Table 2. The categorization is based on the type of information used in the analysis/comparison and the type of approaches used for the analysis. The first technique (Tracking Clipboard Operations) in the table is an action-trace based technique and the rest others are code similarity based techniques. In this section, we provide a brief summary of di erent clone detection techniques, and point out the strengths and weakness of those techniques in general. More detailed descriptions of those techniques can be found in the corresponding papers and elsewhere [163]. Tracking Clipboard Operations: This technique of clone detection is based on the assumption that programmers copy-paste activities are the primary reason for the creation of code clones. So, the technique simply tracks clipboard activities in the editor (inside IDEs such as Eclipse) when a programmer copies a code segment and reuses by pasting it. The copied and the pasted code segments are recorded as clone-pairs. Metrics Comparison: Metrics based techniques are usually used to detect function clones. The techniques are based on the assumption that similar code fragments should yield very similar values for di erent software metrics (e.g., cyclomatic complexity, fan-in, fan-out). Typically, for the code segments a set of metrics are gathered into vectors. The di erences in the vectors are calculated, where the close vectors (e.g., measured by Euclidean distance) indicate that their corresponding code fragments are clones. Texual Comparison: Text based techniques compare program text, typically line by line, with or without normalizing the text by renaming the identifiers, filtering out the comments and di erences in the layout. Token Based Comparison: The entire program is transformed into a stream of tokens (i.e., individual units/words of meaning) through lexical analysis. Then the token stream is scanned to find similar token subsequences, and the original code portions corresponding to those subsequences are reported as clones. Syntax Comparison: Syntax comparison based techniques are developed on the fact that similar code segments should also have similar syntactic structure. Thus, the program is parsed to produce syntax tree, where the similar subtrees indicate that their corresponding code segments are clones. PDG Based Comparison: For a given program, a set of PDGs (Program Dependency Graphs) [57] are produced based on the data and control dependencies among the statements of the program. The code segments corresponding to the isomorphic subgraphs are identified and reported as clones. Comparison of Low Level Form of Code: Instead of analyzing and comparing textual source code, the techniques analyze the lower level code (e.g., assembly code, Java bytecode) as obtained from the transformation done by the compiler. Other Techniques: Beside the aforementioned prominent techniques for clone 12

19 Table 2: Code Clone Detection Techniques (extended from [126]) Tracking Clipboard Operations copy-pasted code segments are recorded as clones [37, 45, 83, 197] Metrics Comparison comparing metrics for functions [123, 124, 132, 148] comparing metrics for web sites [48, 133] Textual Comparison hashing of strings per line, then textual comparison [95, 96] hashing of strings per line, then visual comparison using dotplots [52] latent semantic indexing for identifiers and comments [146] syntactic pretty-printing and normalization, then textual [164] comparison between lines syntactic pretty-printing and normalization, then comparison [205] of fingerprinted lines using su xtree syntactic pretty-printing and normalization, then hash based [190] comparison of functions/blocks Token-based (Lexical) Comparison su x trees for tokens per line [10, 11, 12] token normalizations, then su x tree/array based search [23, 105, 112] data mining for frequent token sequences [140] Syntax Comparison hashing of syntax trees and tree comparison [25, 198, 8] data mining for frequent syntax subtrees [194] serialization of syntax trees and su x-tree detection [55, 127, 138] derivation of syntax patterns and pattern matching [54] metrics for syntax trees and metric vector comparison [91, 154] PDG Based Comparison approximate search for similar subgraphs in PDGs [61, 78, 81, 120, 128] Comparison of Low Level Form of Code comparing Java bytecode [13, 175] comparison of compiled (assembler) code [43, 44, 170] 13

20 detection, other techniques, such as anti-unification [35, 138], formal methods [175], and combination of distinct techniques [60, 137] were also approached. Tracing of abstract memory states during the execution of the program was also attempted to detect semantic clones [116] Strengths and Weaknesses of Clone Detection Techniques Due to the orthogonal nature of the di erent approaches for clone detection, it is inappropriate rate one technique over another. Each of the techniques have their strengths and weaknesses. In this section, we briefly discuss the strengths and weaknesses of di erent clone detection techniques. More comprehensive qualitative evaluation of the existing clone detection techniques and tools can be found elsewhere [165, 168]. Clone detection using clipboard operations (copy-paste) has the benefit that clones are captured at the time of their creation. However, the technique poses a number of questions and limitations towards pragmatic design decisions to be made for realizing the technique in a tool: How much (granularity) of copy-pasted code should be recorded as clones? Should the copy-pasted code that spans beyond a syntactic block be considered as clone? Copy-pasted code may later be modified and become very di erent from original. Should the very dissimilar code segments still be treated as clones (as done in CloneScape [37]), or some similarity based decision has to be made? Since the tracking of copy-paste activity is coupled with a certain editor, the technique is vulnerable to changes in the source code outside the editor. The technique cannot deal with unintentional/accidental clones, or clones in legacy systems. The e ectiveness of metrics based techniques depends on the selection of a broad set of orthogonal significant metrics. Typically, metrics based techniques may be computationally expensive for the overhead of calculating the chosen metrics. Moreover, due to the fact that di erent code segments may, at times, have the same value for certain metrics, the metric based techniques may report many false positives. The advantage of simple textual comparison is that the technique, in general, is independent of programming languages and the program does not need to be complete of syntactically correct. However, the technique (without advanced normalization and filtering of di erences in layouts) may susceptible to even minor changes in the source code [126]. Due to the use of lexical analysis only, the token based techniques also can operate on incomplete or syntactically incorrect program. But, the token based 14

21 techniques may detect code clones that span beyond the boundaries of syntactic blocks. The techniques based on syntax comparison su er from the overhead of invoking a parser to generate parse tree. Due to the use of a parser, the technique is coupled with programming language syntax. The parser based syntax comparison techniques are very accurate in detecting clones that have similar syntactic structure. Still, such techniques are vulnerable to even subtle di erences in the nesting of code, so may fail to detect many potential clones. Earlier empirical evaluations [28, 27, 9] also suggest low recall of syntax comparison based clone detection techniques. PDG based techniques analyze the source code at a level higher than that of syntax comparison based techniques, and thus can capture the program semantics to some extent. PDG based techniques can detect clones consisting of non-contiguous code. However, producing PDGs is computationally expensive. In addition, finding isomorphic subgraphs from PDGs is an NP-hard problem [126]. Therefore, approximation algorithms are used, which do not aim for the optimum. The PDG based approach is tolerant to reordering of certain program elements, which is both a strength and weakness of the technique. The ordering of statements specify the execution path of a program written in traditional imperative language, but the PDG based techniques often disregard such order and reports clones composed of non-contiguous code segments, there remains quite good chance for false positives in detecting clones of practical significance. Clone detection in the compiled code (e.g., assembly code, binary executable, Java bytecode) requires the source code to be complete and syntactically correct. Moreover, compilation generates the lower level code that often normalizes syntactic variants of source code into a compact canonical representation free from comments and di erences in the layouts as in the source code. This, in principle, should make it easier to capture semantically equivalent constructs that might look di erent in the source code written in a high level language [43]. On the contrary, small di erence in the source code may produce very di erent sequences of bytes in the compiled code, and thus finding code clones based on byte code similarities may be even more di cult than finding clones in source code [13]. Moreover, clone detection based on the analysis of compiled code is typically performed at class level, because distinguishing the portion corresponding the functions, blocks or other level of granularity and mapping those back to locations in the higher level source code can be very challenging. The purpose of clone detection techniques is not limited to only clone management for improved software maintenance. The variable strengths of di erent clone detection techniques lead to their applications in solving diverse research problems such as, plagiarism detection [31, 74], program fault localization [26, 36, 89, 92, 103], malware detection [34, 196], finding crosscutting concern code [33], aspect mining [113, 32], 15

22 quality assessment of requirements specification [101], and more. Indeed, the choice of an appropriate clone detection technique largely depends on the context and the purpose [126] Challenges in the Empirical Evaluation of Clone Detection Tools Typically, the e ectiveness of clone detection is measured in terms of precision (p) and recall (r), p = C s \ C d, r = C s \ C d C d C s where C d denotes the set of clones that a particular clone detection tool identifies, and C s refers to the set of clones actually existing in the code base. But, how can one accurately find all the clones in a system? A reference corpus is necessary to serve as the set of all clones in a system. To obtain this, manual investigation can be a solution for small systems, but the time and e ort needed for such a subjective analysis is likely to be impractical for a fairly large system. However, the reference corpus needs to be large to be able to evaluate the scalability of the clone detector under consideration. Therefore, in practice [190, 205], one or more clone detectors that are already known to be e ective, are applied to detect clones from a system, and the union of all the sets of clones reported by those tools, is used as a reference corpus to examine the precision and recall of the clone detection tool to be evaluated. Using a similar idea, Bellon et al. [28] carried out a case study to compare the e ectiveness of six clone detectors (CCFinder, CloneDr, Dup, Duploc, Duplix and CLAN) representing di erent clone detection techniques. Bellon s study [28], which was performed about six years ago, is still the most comprehensive quantitative study on the comparative evaluation of clone detectors, and the reference benchmark produced from the study is the only notable benchmark data available to date [126]. A similar but smaller scale study was also conducted earlier by Bailey and Burd [9]. Bellon s benchmark data was constructed as the union of the sets of clones reported by the clone detectors subject to the study. Then the precision and recall for each tool was computed based on how many of the benchmark clones a particular tool detected. Thus, the result was skewed in favour of those tools that contributed more in the construction of the benchmark data. Moreover, only 2% clones of the reference set were manually verified. Hence, there is a possibility that the benchmark data might have included significant amount of false positives that could have been reported by those participating tools. The Need for Reference Benchmark: The above discussion suggests that, a reliable large set of benchmark data 1 is still necessary, and new tools should be evalu- 1 Similar to what available at used in comparative studies of algorithms for numerical linear algebra. 16

23 ated with the standard benchmark data before publication [125]. Both precision and recall should be reported, as precision or recall alone draws a partial picture only. Moreover, both the runtime complexity and memory e ciency should be reported, since these two criteria a ect the scalability of a certain technique. Theoretical runtime complexity only can be misleading, because high memory requirement may cause time-consuming page swapping at the operating system level, and increase the execution time in practice [169]. The creation of a reliable large benchmark data set with manually verified clones can be a mammoth task. The tedious work of manual investigation and verification can be accomplished by collaborative e orts from the community. The mutation framework [166] proposed by Roy and Cordy can also be used for the purpose by injecting a set of artificially created (and manually verified) clone fragments in various locations of a certain code base, and the set of synthetic clones can serve the purpose of a reference benchmark. Indeed, the vagueness in the definitions of clones may lead to disagreement in di erent human evaluators perception of clones, as found in the study of Walenstein et al. [195]. Therefore, the purpose of the benchmark data should be defined first, and the definition of similarity should be formalized accordingly, as well as the level of clone granularity. A purpose could be, for example, the evaluation of clone detection techniques for clone maintenance and removal. 3.6 Clone Evolution Software development and maintenance in practice follow a dynamic process. With the growth of the program source, code clones also experience evolution from version to version. Many studies have been conducted to date for understanding the overall evolution [4, 3, 143, 65, 206], stability of cloned code [14, 66, 82, 130, 131, 151], the relation of clone evolution with software faults [19, 69, 179], and other characteristics of clone evolution. While the high studies inform the characteristic and impact of code cloning, more lower level analyses that investigate the change patterns in the evolution of individual clone fragments can suggest techniques for optimizing clone management including refactoring and removal. However, there is a recent survey on clone evolution [159]. Hence, we keep this section brief with specific focus on the evolution of individual clone fragments and their change patterns Clone Genealogy Kim et al. [117, 118] first coined the term clone genealogy, which refers to a set of one of more lineage(s) originating from the same clone-group. A clone lineage is a sequence of clone-groups evolving over a series of versions (revision or release) of the software system. Thus, a clone genealogy (Figure 4) describes the evolution history of a particular clone-group over subsequent versions of the system. The extraction 17

24 of clone genealogies from a series of versions of a program has been identified as the fundamental task to study the evolution of individual clone-groups. Figure 4: A clone genealogy with two lineages over versions v k through v k+3 [201] Genealogy Extraction: To map clones across subsequent versions of a program (i.e. extraction of clone genealogies) mainly four di erent approaches have been found in the literature, as summarized in Table 3. While most of these approaches [19, 14, 118, 172, 7] focused on genealogies of Type-1 and Type-2 clones, gcad [171, 173] is the only Type-3 clone genealogy extractor to date released as a separate tool Clone Change Patterns Recent genealogy based studies [118, 172, 173] on clone change patterns characterized the evolution of a particular clone-group based on the following four categories of transitions between subsequent versions of the software system. Addition/Grow: One or more clone fragments is added to the clone-group in the next version. Deletion/Shrink: One or more clones disappeared from the clone-group in the next version. Consistent Change: All clone in the clone-group experience the same set of changes during transition to the next version. 18

25 Table 3: Summary of techniques for clone genealogy extraction Approach Strength Weakness Citation Separate clone detection from each version, and then similarity based heuristic mapping of clones in pairs of subsequent versions Clones detected from the first version are mapped to consecutive versions based on change logs obtained from source code repositories Clones are mapped during the incremental clone detection that used source code changes between revisions Separate clone detection from each version, functions are mapped across subsequent versions, then clones are mapped based on the mapped functions flexible quadratic runtime [70], susceptible to large change in clone faster than the above technique faster than the above two techniques improved runtime can miss the clones introduced after the first version [171] cannot operate on the clone detection results obtained from traditional non-incremental tools [171] susceptible to similar overloaded/overriden functions [14, 118, 172] [7, 19, 129, 187] [68, 155] [145, 171, 173] *A combination of the first and second approaches was also used in some studies [29] Inconsistent Change: During transition to the next version, at least one clone in the clone-group experiences changes that are inconsistent with changes in other member(s) of the clone-group. Same/No Change The clone-group moves to the next version without experiencing any change in the members. On the basis of the aforementioned categorization, clone genealogies are classified as follows: Static Genealogy: A genealogy where the clones in the clone-group do not experience any change over the series of subsequent version. Inconsistently Changed Genealogies: A genealogy where one or more clones in the clone-group experience inconsistent changes during transition between any two subsequent versions. Consistently Changed Genealogies: A genealogy where none of the clones in the clone-group experience inconsistent changes during transition to subsequent versions. Dead Genealogy: A genealogy that does not survive till the last version (i.e. before the last version the clone-group disappears). 19

26 Figure 5: Clone change patterns and types of genealogies [171] Alive Genealogy: A genealogy that survives at the last version of the software system. Aversano et al. [7] further caracterized inconsistent change pattern into two categories: Independent Evolution: In independent evolution, the clones of a clone-group, once changed inconsistently, evolve independently across versions. Thus, independent evolution may cause branching with multiple lineages in a genealogy. Late Propagation: In late propagation, one or more clones in a clone-group may experience inconsistent changes and may disappear from the clone-group in the next version, but at a later version those clones re-appear in the clone-group. Studies [14, 29] on clone change patterns revealed that inconsistent changes in clones sometimes caused program faults. Moreover, late propagation is reported to have even more significant correlation with software defects and thus concluded to be more risky than other clone genealogies [19]. Using the clone genealogy model, recently Saha [171] studied the change patterns in the evolution of clones in five open-source software systems. Major findings from the study report some common properties of frequently changed clone-groups, such as, cohesive clone-groups having small number of clones experienced more changes than 20

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Lecture 25 Clone Detection CCFinder Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic) Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002 Recap of Polymetric

More information

Automatic Identification of Important Clones for Refactoring and Tracking

Automatic Identification of Important Clones for Refactoring and Tracking Automatic Identification of Important Clones for Refactoring and Tracking Manishankar Mondal Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {mshankar.mondal,

More information

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price. Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which

More information

Software Clone Detection. Kevin Tang Mar. 29, 2012

Software Clone Detection. Kevin Tang Mar. 29, 2012 Software Clone Detection Kevin Tang Mar. 29, 2012 Software Clone Detection Introduction Reasons for Code Duplication Drawbacks of Code Duplication Clone Definitions in the Literature Detection Techniques

More information

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies Ripon K. Saha Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {ripon.saha,

More information

Searching for Configurations in Clone Evaluation A Replication Study

Searching for Configurations in Clone Evaluation A Replication Study Searching for Configurations in Clone Evaluation A Replication Study Chaiyong Ragkhitwetsagul 1, Matheus Paixao 1, Manal Adham 1 Saheed Busari 1, Jens Krinke 1 and John H. Drake 2 1 University College

More information

Token based clone detection using program slicing

Token based clone detection using program slicing Token based clone detection using program slicing Rajnish Kumar PEC University of Technology Rajnish_pawar90@yahoo.com Prof. Shilpa PEC University of Technology Shilpaverma.pec@gmail.com Abstract Software

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones Detection using Textual and Metric Analysis to figure out all Types of s Kodhai.E 1, Perumal.A 2, and Kanmani.S 3 1 SMVEC, Dept. of Information Technology, Puducherry, India Email: kodhaiej@yahoo.co.in

More information

A Tree Kernel Based Approach for Clone Detection

A Tree Kernel Based Approach for Clone Detection A Tree Kernel Based Approach for Clone Detection Anna Corazza 1, Sergio Di Martino 1, Valerio Maggio 1, Giuseppe Scanniello 2 1) University of Naples Federico II 2) University of Basilicata Outline Background

More information

A Novel Technique for Retrieving Source Code Duplication

A Novel Technique for Retrieving Source Code Duplication A Novel Technique for Retrieving Source Code Duplication Yoshihisa Udagawa Computer Science Department, Faculty of Engineering Tokyo Polytechnic University Atsugi-city, Kanagawa, Japan udagawa@cs.t-kougei.ac.jp

More information

Detection of Non Continguous Clones in Software using Program Slicing

Detection of Non Continguous Clones in Software using Program Slicing Detection of Non Continguous Clones in Software using Program Slicing Er. Richa Grover 1 Er. Narender Rana 2 M.Tech in CSE 1 Astt. Proff. In C.S.E 2 GITM, Kurukshetra University, INDIA Abstract Code duplication

More information

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools Jeffrey Svajlenko Chanchal K. Roy University of Saskatchewan, Canada {jeff.svajlenko, chanchal.roy}@usask.ca Slawomir

More information

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Emerging Approach

More information

Code Clone Detector: A Hybrid Approach on Java Byte Code

Code Clone Detector: A Hybrid Approach on Java Byte Code Code Clone Detector: A Hybrid Approach on Java Byte Code Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted By

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

COMPARISON AND EVALUATION ON METRICS

COMPARISON AND EVALUATION ON METRICS COMPARISON AND EVALUATION ON METRICS BASED APPROACH FOR DETECTING CODE CLONE D. Gayathri Devi 1 1 Department of Computer Science, Karpagam University, Coimbatore, Tamilnadu dgayadevi@gmail.com Abstract

More information

Human Error Taxonomy

Human Error Taxonomy Human Error Taxonomy The Human Error Taxonomy (HET) provides a structure for requirement errors made during the software development process. The HET can be employed during software inspection to help

More information

A Measurement of Similarity to Identify Identical Code Clones

A Measurement of Similarity to Identify Identical Code Clones The International Arab Journal of Information Technology, Vol. 12, No. 6A, 2015 735 A Measurement of Similarity to Identify Identical Code Clones Mythili ShanmughaSundaram and Sarala Subramani Department

More information

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Jadson Santos Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte, UFRN Natal,

More information

Rearranging the Order of Program Statements for Code Clone Detection

Rearranging the Order of Program Statements for Code Clone Detection Rearranging the Order of Program Statements for Code Clone Detection Yusuke Sabi, Yoshiki Higo, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, Japan Email: {y-sabi,higo,kusumoto@ist.osaka-u.ac.jp

More information

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE OF MASTER OF

More information

Categorizing Migrations

Categorizing Migrations What to Migrate? Categorizing Migrations A version control repository contains two distinct types of data. The first type of data is the actual content of the directories and files themselves which are

More information

code pattern analysis of object-oriented programming languages

code pattern analysis of object-oriented programming languages code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s

More information

Harmonization of usability measurements in ISO9126 software engineering standards

Harmonization of usability measurements in ISO9126 software engineering standards Harmonization of usability measurements in ISO9126 software engineering standards Laila Cheikhi, Alain Abran and Witold Suryn École de Technologie Supérieure, 1100 Notre-Dame Ouest, Montréal, Canada laila.cheikhi.1@ens.etsmtl.ca,

More information

Machine-Independent Optimizations

Machine-Independent Optimizations Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter

More information

Visualization of Clone Detection Results

Visualization of Clone Detection Results Visualization of Clone Detection Results Robert Tairas and Jeff Gray Department of Computer and Information Sciences University of Alabama at Birmingham Birmingham, AL 5294-1170 1-205-94-221 {tairasr,

More information

An Empirical Study on Clone Stability

An Empirical Study on Clone Stability An Empirical Study on Clone Stability Manishankar Mondal Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {mshankar.mondal, chanchal.roy, kevin.schneider}@usask.ca

More information

Part I: Preliminaries 24

Part I: Preliminaries 24 Contents Preface......................................... 15 Acknowledgements................................... 22 Part I: Preliminaries 24 1. Basics of Software Testing 25 1.1. Humans, errors, and testing.............................

More information

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection of Potential

More information

A Survey of Software Clone Detection Techniques

A Survey of Software Clone Detection Techniques A Survey of Software Detection Techniques Abdullah Sheneamer Department of Computer Science University of Colorado at Colo. Springs, USA Colorado Springs, USA asheneam@uccs.edu Jugal Kalita Department

More information

Variability Implementation Techniques for Platforms and Services (Interim)

Variability Implementation Techniques for Platforms and Services (Interim) Engineering Virtual Domain-Specific Service Platforms Specific Targeted Research Project: FP7-ICT-2009-5 / 257483 Variability Implementation Techniques for Platforms and Services (Interim) Abstract Creating

More information

Refactoring Support Based on Code Clone Analysis

Refactoring Support Based on Code Clone Analysis Refactoring Support Based on Code Clone Analysis Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1 and Katsuro Inoue 1 1 Graduate School of Information Science and Technology, Osaka University, Toyonaka,

More information

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing

More information

An Exploratory Study on Interface Similarities in Code Clones

An Exploratory Study on Interface Similarities in Code Clones 1 st WETSoDA, December 4, 2017 - Nanjing, China An Exploratory Study on Interface Similarities in Code Clones Md Rakib Hossain Misu, Abdus Satter, Kazi Sakib Institute of Information Technology University

More information

What do Compilers Produce?

What do Compilers Produce? What do Compilers Produce? Pure Machine Code Compilers may generate code for a particular machine, not assuming any operating system or library routines. This is pure code because it includes nothing beyond

More information

Basic Concepts of Reliability

Basic Concepts of Reliability Basic Concepts of Reliability Reliability is a broad concept. It is applied whenever we expect something to behave in a certain way. Reliability is one of the metrics that are used to measure quality.

More information

On Refactoring for Open Source Java Program

On Refactoring for Open Source Java Program On Refactoring for Open Source Java Program Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1, Katsuro Inoue 1 and Yoshio Kataoka 3 1 Graduate School of Information Science and Technology, Osaka University

More information

Reverse Software Engineering Using UML tools Jalak Vora 1 Ravi Zala 2

Reverse Software Engineering Using UML tools Jalak Vora 1 Ravi Zala 2 IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 03, 2014 ISSN (online): 2321-0613 Reverse Software Engineering Using UML tools Jalak Vora 1 Ravi Zala 2 1, 2 Department

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Folding Repeated Instructions for Improving Token-based Code Clone Detection

Folding Repeated Instructions for Improving Token-based Code Clone Detection 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation Folding Repeated Instructions for Improving Token-based Code Clone Detection Hiroaki Murakami, Keisuke Hotta, Yoshiki

More information

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, 2 Head of Department, Department of Computer Science & Engineering, Universal Institute of Engineering

More information

Recalling the definition of design as set of models let's consider the modeling of some real software.

Recalling the definition of design as set of models let's consider the modeling of some real software. Software Design and Architectures SE-2 / SE426 / CS446 / ECE426 Lecture 3 : Modeling Software Software uniquely combines abstract, purely mathematical stuff with physical representation. There are numerous

More information

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique 1 Syed MohdFazalulHaque, 2 Dr. V Srikanth, 3 Dr. E. Sreenivasa Reddy 1 Maulana Azad National Urdu University, 2 Professor,

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1635-1649 Research India Publications http://www.ripublication.com Study and Analysis of Object-Oriented

More information

The Structure of a Syntax-Directed Compiler

The Structure of a Syntax-Directed Compiler Source Program (Character Stream) Scanner Tokens Parser Abstract Syntax Tree Type Checker (AST) Decorated AST Translator Intermediate Representation Symbol Tables Optimizer (IR) IR Code Generator Target

More information

Architectural Design. Architectural Design. Software Architecture. Architectural Models

Architectural Design. Architectural Design. Software Architecture. Architectural Models Architectural Design Architectural Design Chapter 6 Architectural Design: -the design the desig process for identifying: - the subsystems making up a system and - the relationships between the subsystems

More information

International Journal for Management Science And Technology (IJMST)

International Journal for Management Science And Technology (IJMST) Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION

More information

HOW AND WHEN TO FLATTEN JAVA CLASSES?

HOW AND WHEN TO FLATTEN JAVA CLASSES? HOW AND WHEN TO FLATTEN JAVA CLASSES? Jehad Al Dallal Department of Information Science, P.O. Box 5969, Safat 13060, Kuwait ABSTRACT Improving modularity and reusability are two key objectives in object-oriented

More information

Refactoring Practice: How it is and How it Should be Supported

Refactoring Practice: How it is and How it Should be Supported Refactoring Practice: How it is and How it Should be Supported Zhenchang Xing and EleniStroulia Presented by: Sultan Almaghthawi 1 Outline Main Idea Related Works/Literature Alignment Overview of the Case

More information

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes 1 K. Vidhya, 2 N. Sumathi, 3 D. Ramya, 1, 2 Assistant Professor 3 PG Student, Dept.

More information

Chapter 6 Architectural Design

Chapter 6 Architectural Design Chapter 6 Architectural Design Chapter 6 Architectural Design Slide 1 Topics covered The WHAT and WHY of architectural design Architectural design decisions Architectural views/perspectives Architectural

More information

Compilers and Interpreters

Compilers and Interpreters Overview Roadmap Language Translators: Interpreters & Compilers Context of a compiler Phases of a compiler Compiler Construction tools Terminology How related to other CS Goals of a good compiler 1 Compilers

More information

Business Rules Extracted from Code

Business Rules Extracted from Code 1530 E. Dundee Rd., Suite 100 Palatine, Illinois 60074 United States of America Technical White Paper Version 2.2 1 Overview The purpose of this white paper is to describe the essential process for extracting

More information

Software Clone Detection and Refactoring

Software Clone Detection and Refactoring Software Clone Detection and Refactoring Francesca Arcelli Fontana *, Marco Zanoni *, Andrea Ranchetti * and Davide Ranchetti * * University of Milano-Bicocca, Viale Sarca, 336, 20126 Milano, Italy, {arcelli,marco.zanoni}@disco.unimib.it,

More information

Sub-clones: Considering the Part Rather than the Whole

Sub-clones: Considering the Part Rather than the Whole Sub-clones: Considering the Part Rather than the Whole Robert Tairas 1 and Jeff Gray 2 1 Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL 2 Department

More information

Chapter 4. Fundamental Concepts and Models

Chapter 4. Fundamental Concepts and Models Chapter 4. Fundamental Concepts and Models 4.1 Roles and Boundaries 4.2 Cloud Characteristics 4.3 Cloud Delivery Models 4.4 Cloud Deployment Models The upcoming sections cover introductory topic areas

More information

Impact of Dependency Graph in Software Testing

Impact of Dependency Graph in Software Testing Impact of Dependency Graph in Software Testing Pardeep Kaur 1, Er. Rupinder Singh 2 1 Computer Science Department, Chandigarh University, Gharuan, Punjab 2 Assistant Professor, Computer Science Department,

More information

challenges in domain-specific modeling raphaël mannadiar august 27, 2009

challenges in domain-specific modeling raphaël mannadiar august 27, 2009 challenges in domain-specific modeling raphaël mannadiar august 27, 2009 raphaël mannadiar challenges in domain-specific modeling 1/59 outline 1 introduction 2 approaches 3 debugging and simulation 4 differencing

More information

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su Automatic Mining of Functionally Equivalent Code Fragments via Random Testing Lingxiao Jiang and Zhendong Su Cloning in Software Development How New Software Product Cloning in Software Development Search

More information

Dealing with Clones in Software : A Practical Approach from Detection towards Management

Dealing with Clones in Software : A Practical Approach from Detection towards Management Dealing with Clones in Software : A Practical Approach from Detection towards Management A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for

More information

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY Anna Corazza, Sergio Di Martino, Valerio Maggio Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello, Fabrizio Silverstri JIMSE 2012 August 28, 2012

More information

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February-2017 164 DETECTION OF SOFTWARE REFACTORABILITY THROUGH SOFTWARE CLONES WITH DIFFRENT ALGORITHMS Ritika Rani 1,Pooja

More information

Existing Model Metrics and Relations to Model Quality

Existing Model Metrics and Relations to Model Quality Existing Model Metrics and Relations to Model Quality Parastoo Mohagheghi, Vegard Dehlen WoSQ 09 ICT 1 Background In SINTEF ICT, we do research on Model-Driven Engineering and develop methods and tools:

More information

Part 5. Verification and Validation

Part 5. Verification and Validation Software Engineering Part 5. Verification and Validation - Verification and Validation - Software Testing Ver. 1.7 This lecture note is based on materials from Ian Sommerville 2006. Anyone can use this

More information

An Object Oriented Runtime Complexity Metric based on Iterative Decision Points

An Object Oriented Runtime Complexity Metric based on Iterative Decision Points An Object Oriented Runtime Complexity Metric based on Iterative Amr F. Desouky 1, Letha H. Etzkorn 2 1 Computer Science Department, University of Alabama in Huntsville, Huntsville, AL, USA 2 Computer Science

More information

Semantics, Metadata and Identifying Master Data

Semantics, Metadata and Identifying Master Data Semantics, Metadata and Identifying Master Data A DataFlux White Paper Prepared by: David Loshin, President, Knowledge Integrity, Inc. Once you have determined that your organization can achieve the benefits

More information

Formats of Translated Programs

Formats of Translated Programs Formats of Translated Programs Compilers differ in the format of the target code they generate. Target formats may be categorized as assembly language, relocatable binary, or memory-image. Assembly Language

More information

Software Design Fundamentals. CSCE Lecture 11-09/27/2016

Software Design Fundamentals. CSCE Lecture 11-09/27/2016 Software Design Fundamentals CSCE 740 - Lecture 11-09/27/2016 Today s Goals Define design Introduce the design process Overview of design criteria What results in a good design? Gregory Gay CSCE 740 -

More information

Component-Based Software Engineering TIP

Component-Based Software Engineering TIP Component-Based Software Engineering TIP X LIU, School of Computing, Napier University This chapter will present a complete picture of how to develop software systems with components and system integration.

More information

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report Clone Detection Using Dependence Analysis and Lexical Analysis Final Report Yue JIA 0636332 Supervised by Professor Mark Harman Department of Computer Science King s College London September 2007 Acknowledgments

More information

Identification of Crosscutting Concerns: A Survey

Identification of Crosscutting Concerns: A Survey Identification of Crosscutting Concerns: A Survey Arvinder Kaur University School of IT GGSIP, University, Delhi arvinder70@gmail.com Kalpana Johari CDAC, Noida kalpanajohari@cdacnoida.in Abstract Modularization

More information

Large-Scale Clone Detection and Benchmarking

Large-Scale Clone Detection and Benchmarking Large-Scale Clone Detection and Benchmarking A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy in

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

The Analysis and Proposed Modifications to ISO/IEC Software Engineering Software Quality Requirements and Evaluation Quality Requirements

The Analysis and Proposed Modifications to ISO/IEC Software Engineering Software Quality Requirements and Evaluation Quality Requirements Journal of Software Engineering and Applications, 2016, 9, 112-127 Published Online April 2016 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2016.94010 The Analysis and Proposed

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

Definition of Information Systems

Definition of Information Systems Information Systems Modeling To provide a foundation for the discussions throughout this book, this chapter begins by defining what is actually meant by the term information system. The focus is on model-driven

More information

Coding and Unit Testing! The Coding Phase! Coding vs. Code! Coding! Overall Coding Language Trends!

Coding and Unit Testing! The Coding Phase! Coding vs. Code! Coding! Overall Coding Language Trends! Requirements Spec. Design Coding and Unit Testing Characteristics of System to be built must match required characteristics (high level) Architecture consistent views Software Engineering Computer Science

More information

Clone Detection Using Abstract Syntax Suffix Trees

Clone Detection Using Abstract Syntax Suffix Trees Clone Detection Using Abstract Syntax Suffix Trees Rainer Koschke, Raimar Falke, Pierre Frenzel University of Bremen, Germany http://www.informatik.uni-bremen.de/st/ {koschke,rfalke,saint}@informatik.uni-bremen.de

More information

Software Clone Detection Using Cosine Distance Similarity

Software Clone Detection Using Cosine Distance Similarity Software Clone Detection Using Cosine Distance Similarity A Dissertation SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF DEGREE OF MASTER OF TECHNOLOGY IN COMPUTER SCIENCE & ENGINEERING

More information

CERT C++ COMPLIANCE ENFORCEMENT

CERT C++ COMPLIANCE ENFORCEMENT CERT C++ COMPLIANCE ENFORCEMENT AUTOMATED SOURCE CODE ANALYSIS TO MAINTAIN COMPLIANCE SIMPLIFY AND STREAMLINE CERT C++ COMPLIANCE The CERT C++ compliance module reports on dataflow problems, software defects,

More information

Small Scale Detection

Small Scale Detection Software Reengineering SRe2LIC Duplication Lab Session February 2008 Code duplication is the top Bad Code Smell according to Kent Beck. In the lecture it has been said that duplication in general has many

More information

On the Robustness of Clone Detection to Code Obfuscation

On the Robustness of Clone Detection to Code Obfuscation On the Robustness of Clone Detection to Code Obfuscation Sandro Schulze TU Braunschweig Braunschweig, Germany sandro.schulze@tu-braunschweig.de Daniel Meyer University of Magdeburg Magdeburg, Germany Daniel3.Meyer@st.ovgu.de

More information

Aspect Refactoring Verifier

Aspect Refactoring Verifier Aspect Refactoring Verifier Charles Zhang and Julie Waterhouse Hans-Arno Jacobsen Centers for Advanced Studies Department of Electrical and IBM Toronto Lab Computer Engineering juliew@ca.ibm.com and Department

More information

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012 Patent documents usecases with MyIntelliPatent Alberto Ciaramella IntelliSemantic 25/11/2012 Objectives and contents of this presentation This presentation: identifies and motivates the most significant

More information

CHAPTER 9 DESIGN ENGINEERING. Overview

CHAPTER 9 DESIGN ENGINEERING. Overview CHAPTER 9 DESIGN ENGINEERING Overview A software design is a meaningful engineering representation of some software product that is to be built. Designers must strive to acquire a repertoire of alternative

More information

Computation Independent Model (CIM): Platform Independent Model (PIM): Platform Specific Model (PSM): Implementation Specific Model (ISM):

Computation Independent Model (CIM): Platform Independent Model (PIM): Platform Specific Model (PSM): Implementation Specific Model (ISM): viii Preface The software industry has evolved to tackle new approaches aligned with the Internet, object-orientation, distributed components and new platforms. However, the majority of the large information

More information

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation 1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation Administrivia Lecturer: Kostis Sagonas (kostis@it.uu.se) Course home page: http://www.it.uu.se/edu/course/homepage/komp/h18

More information

Taxonomy Dimensions of Complexity Metrics

Taxonomy Dimensions of Complexity Metrics 96 Int'l Conf. Software Eng. Research and Practice SERP'15 Taxonomy Dimensions of Complexity Metrics Bouchaib Falah 1, Kenneth Magel 2 1 Al Akhawayn University, Ifrane, Morocco, 2 North Dakota State University,

More information

Refactoring and Rearchitecturing

Refactoring and Rearchitecturing Refactoring and Rearchitecturing Overview Introduction Refactoring vs reachitecting Exploring the situation Legacy code Code written by others Code already written Not supported code Code without automated

More information

1DL321: Kompilatorteknik I (Compiler Design 1)

1DL321: Kompilatorteknik I (Compiler Design 1) Administrivia 1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation Lecturer: Kostis Sagonas (kostis@it.uu.se) Course home page: http://www.it.uu.se/edu/course/homepage/komp/ht16

More information

5/9/2014. Recall the design process. Lecture 1. Establishing the overall structureof a software system. Topics covered

5/9/2014. Recall the design process. Lecture 1. Establishing the overall structureof a software system. Topics covered Topics covered Chapter 6 Architectural Design Architectural design decisions Architectural views Architectural patterns Application architectures Lecture 1 1 2 Software architecture The design process

More information

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended. Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews cannot be printed. TDWI strives to provide

More information

Ans 1-j)True, these diagrams show a set of classes, interfaces and collaborations and their relationships.

Ans 1-j)True, these diagrams show a set of classes, interfaces and collaborations and their relationships. Q 1) Attempt all the following questions: (a) Define the term cohesion in the context of object oriented design of systems? (b) Do you need to develop all the views of the system? Justify your answer?

More information

Capturing and Formalizing SAF Availability Management Framework Configuration Requirements

Capturing and Formalizing SAF Availability Management Framework Configuration Requirements Capturing and Formalizing SAF Availability Management Framework Configuration Requirements A. Gherbi, P. Salehi, F. Khendek and A. Hamou-Lhadj Electrical and Computer Engineering, Concordia University,

More information

Automatic Clone Recommendation for Refactoring Based on the Present and the Past

Automatic Clone Recommendation for Refactoring Based on the Present and the Past Automatic Clone Recommendation for Refactoring Based on the Present and the Past Ruru Yue, Zhe Gao, Na Meng Yingfei Xiong, Xiaoyin Wang J. David Morgenthaler Key Laboratory of High Confidence Software

More information

SourcererCC -- Scaling Code Clone Detection to Big-Code

SourcererCC -- Scaling Code Clone Detection to Big-Code SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories

More information

Chapter 6 Architectural Design. Lecture 1. Chapter 6 Architectural design

Chapter 6 Architectural Design. Lecture 1. Chapter 6 Architectural design Chapter 6 Architectural Design Lecture 1 1 Topics covered ² Architectural design decisions ² Architectural views ² Architectural patterns ² Application architectures 2 Software architecture ² The design

More information

CnP: Towards an Environment for the Proactive Management of Copy-and-Paste Programming

CnP: Towards an Environment for the Proactive Management of Copy-and-Paste Programming CnP: Towards an Environment for the Proactive Management of Copy-and-Paste Programming Daqing Hou, Patricia Jablonski, and Ferosh Jacob Electrical and Computer Engineering, Clarkson University, Potsdam,

More information