Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Similar documents
Software Clone Detection. Kevin Tang Mar. 29, 2012

Lecture 24 Software Visualization and Metrics Polymetric Views. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Detection of Non Continguous Clones in Software using Program Slicing

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

Thomas LaToza 5/5/2005 A Literature Review of Clone Detection Analysis

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

A Novel Technique for Retrieving Source Code Duplication

Rearranging the Order of Program Statements for Code Clone Detection

Token based clone detection using program slicing

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Code Clone Detector: A Hybrid Approach on Java Byte Code

Lecture 21. Regression Testing Path Spectra. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

2IMP25 Software Evolution. Code duplication. Alexander Serebrenik

Code duplication in Software Systems: A Survey

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

On Refactoring for Open Source Java Program

Code Clone Analysis and Application

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

The University of Saskatchewan Department of Computer Science. Technical Report #

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu

Refactoring Support Based on Code Clone Analysis

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report

EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS

Folding Repeated Instructions for Improving Token-based Code Clone Detection

A Survey of Software Clone Detection Techniques

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering

PAPER Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files

Code Duplication. Harald Gall seal.ifi.uzh.ch/evolution

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools

Software Clone Detection Using Cosine Distance Similarity

KClone: A Proposed Approach to Fast Precise Code Clone Detection

Clone Detection Using Abstract Syntax Suffix Trees

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm

Software Similarity Analysis. c April 28, 2011 Christian Collberg

Lecture 22. Path Spectra Change Impact Analysis. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

A Tree Kernel Based Approach for Clone Detection

Detection and Analysis of Software Clones

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

Incremental Clone Detection and Elimination for Erlang Programs

COMPARISON AND EVALUATION ON METRICS

A Measurement of Similarity to Identify Identical Code Clones

Baishakhi Ray and Miryung Kim The University of Texas at Austin

An Experience Report on Analyzing Industrial Software Systems Using Code Clone Detection Techniques

SourcererCC -- Scaling Code Clone Detection to Big-Code

Searching for Configurations in Clone Evaluation A Replication Study

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

Towards the Code Clone Analysis in Heterogeneous Software Products

Dealing with Clones in Software : A Practical Approach from Detection towards Management

Code Similarity Detection by Program Dependence Graph

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Scenario-Based Comparison of Clone Detection Techniques

An Approach to Detect Clones in Class Diagram Based on Suffix Array

Implementing evolution: Aspect-Oriented Programming

On Refactoring Support Based on Code Clone Dependency Relation

Re-usability based approach Reusability of code, logic, design and/or an entire system are the major reasons of code clone occurrence.

An Exploratory Study on Interface Similarities in Code Clones

Incremental Code Clone Detection: A PDG-based Approach

Detection and Behavior Identification of Higher-Level Clones in Software

A Study of Repetitiveness of Code Changes in Software Evolution

Code Clone Detection on Specialized PDGs with Heuristics

A Weighted Layered Approach for Code Clone Detection

On the Robustness of Clone Detection to Code Obfuscation

Falsification: An Advanced Tool for Detection of Duplex Code

Detecting software defect patterns and rule violation identification in source code 1

Semantic Clone Detection Using Machine Learning

Proceedings of the Eighth International Workshop on Software Clones (IWSC 2014)

Visualization of Clone Detection Results

Clone Tracker: Tracking Inconsistent Clone Changes in A Clone Group

Gapped Code Clone Detection with Lightweight Source Code Analysis

CCLearner: A Deep Learning-Based Clone Detection Approach

Enhancing Source-Based Clone Detection Using Intermediate Representation

IJREAS Volume 2, Issue 2 (February 2012) ISSN: SOFTWARE CLONING IN EXTREME PROGRAMMING ENVIRONMENT ABSTRACT

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

CCFinder: A Multi-Linguistic Token-based Code Clone Detection System for Large Scale Source Code

CBCD: Cloned Buggy Code Detector. Technical Report UW-CSE May 2, 2011 (Revised March 20, 2012)

좋은 발표란 무엇인가? 정영범 서울대학교 5th ROSAEC Workshop 2011년 1월 6일 목요일

SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY

Relation of Code Clones and Change Couplings

Detecting code re-use potential

Cloning by Accident?

arxiv: v1 [cs.se] 25 Mar 2014

Toward a Taxonomy of Clones in Source Code: A Case Study

On the effectiveness of clone detection by string matching

CONVERTING CODE CLONES TO ASPECTS USING ALGORITHMIC APPROACH

On the effectiveness of clone detection by string matching

Abstract. We define an origin relationship as follows, based on [12].

Using Slicing to Identify Duplication in Source Code

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015

Similar Code Detection and Elimination for Erlang Programs

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

Identification of Crosscutting Concerns: A Survey

Clone Detection Using Scope Trees

Transcription:

Lecture 25 Clone Detection CCFinder

Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic)

Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002

Recap of Polymetric Views Polymetric view is a customizable software visualization tool enriched software metrics. This tool targets initial understanding of a legacy system. This tool can help programmers develop a high-level mental model. It is simple, powerful, scalable, and customizable; however, it requires some training to parse these generated views.

Class Presentation Suchitra (advocate) Reza (skeptic)

CCFinder CCFinder: A multilinguistic token-based code clone detection system for large scale source code, Kamiya et al. TSE 2002

Definition of Code Clones There is no precise or consistent definition on what clones are. a code portion in source files that is identical or similar to another code Clone are often operationally defined by a definition of a clone detector.

When and Why do programmers create clones?

When and Why do programmers create clones? What we have is slight different what we want. When reusing code as a mental macro template Due to programming language limitations Legacy code is well-tested and often reliable. Management reasons A team does not want to create a dependency on another team s code. A team does not support other teams usage scenarios and customization Automatic code generation

Why is code cloning a problem during software evolution?

Why is code cloning a problem during software evolution? When a fault is found in one system, it may have to be propagated to other counterpart systems. When cloned systems require similar changes, all systems need to be modified consistently. If you miss to update these clones consistently, missed updates could lead to a potential bug. Redundant development efforts Code plagiarism

Research problem addressed by CCFinder How can we find clones written in popular programming languages in a fast & scalable way? industrial strength million-line size system within affordable computation time and memory can use heuristics for finding helpful clones robust to renaming & small edits limited uses of language-dependent clone detection

Approach Language-dependent parts Lexical analysis Rule-based source transformation Language-independent parts: Suffix-tree matching algorithm for matching token sequences

Rule-based Transformations Remove package names Supplement callees Remove initialization lists Separate class definitions Remove accessibility keywords Convert to compound block

%-1 '( &," '$ 1.%" 77 %" : &' 7; '0&.'%! '( " -0" '%1E &''1 2'"! F"0"D - 61'%"G0"1-&.'%.! 2"(.%"2 B.&, &," &0-%!('0$-&.'% 0?1"! -%2 &," /-0-$"&"0G0"/1-6"$"%& 2"!60.@"2 -@'C"4 H&,"0 61'%" 0"1-&.'%! 6-% @" 2"(.%"2 B.&, 2.(("0"%& &0-%!('0$-&.'% 0?1"! '0 @E %"#1"6&.%# &," /-0-$"&"0 0"/1-6"$"%&4 I% &," 6-!"!&?2."! 2"!60.@"2.% >"6&.'% JD - Parameter Replacement!"#$%" &' "()* ++,$-.&/* " #0('$($-10$2'$+ '3!&-45"2&. +3.& +(3-&.&'&+'$3-2!"#$% &%"'() (# '!"#$% &!"# $%&'( /26 4 4#& 5/'$()!"#$%,7 *%.%(&)!, %$'!"',, *)"" 2 *)(+ ',+' *)1%"; #& - -CH3 L3 &MF B;<=>?@;AFK J@KF NO BMF B;<=>?@;A<BC@= ;DEF>3,9:) ;) '<= >?@= ABC=D EADAF=C=D D=EGA>=F=HC) oaded on April 22, 2009 at 09:02 from IEEE Xplore. Restrictions apply.!"#"$ %&'!"#>$? ), (+!"',, % ), (+!'$ 2%.%.%',/&%,, '$1!#..% $/"" #&!#. (#A%$, #- % *+%$ )( ),

Other minor contributions Similar to duploc s scatter-plot visualization Suggestions of metrics for clones

Evaluation (1) Research questions RQ1: Is CCFinder scalable and can be applied to industry size programs? e.g. Two versions of OpenOffice. 10 million lines in total. 68 minutes e.g. FreeBSD, NetBSD, and OpenBSD RQ2: What is the impact of each transformation rule?

Evaluation (2) RQ3: Can CCFinder be used for investigating where and how similar code fragments are used among similar software systems such as FreeBSD, NetBSD, and Linux? A hypothesis: FreeBSD and NetBSD are more similar to each other than Linux. Results: about 40% of source files in FreeBSD have clones with NetBSD; whereas less than 5% of source fules in FreeBSD or NetBSD have clones with Linux.

Other Existing Clone Detection Techniques (1) String Baker s Dup: a lexer and a line-based string matching tool: it removes white spaces and comments; replaces identifiers; concatenates all files; hashes each line for comparison; and extracts a set of pairs of longest matches using a suffix tree algorithm Token CCFinder transforms tokens using a language specific rules and performs a token-by-token comparison

Other Existing Clone Detection Techniques (2) AST Baxter et al. s CloneDr parses source code to build an abstract syntax tree, compares its subtrees by characterization metrics. Jiang et al. and Koschke et al. PDG Komondoor and Horwitz clone detector finds isomorphic PDG subgraphs using program slicing Krinke uses a k-length patch matching to find similar PDG subgraphs. PDG-based clone detectors are robst to reordered statements, code insertion and deletion, intertwined code, non-contiguous code.

Other Existing Clone Detection Techniques (3) Metric-based Metric-based clone detectors compare various metrics called fingerprinting functions. They find clones at a particular syntactic granularity such as a class, a function, or a method because these fingerprints are often defined for a particular syntactic unit.

My general thoughts on CCFinder CCFinder is a robust and scalable clone detector. As there is no consistent definition of code clones, finding X% of clones in one system does not mean very much; however, Its case studies show that CCFinder can be applied to industrial size programs. Its case studies show that CCFinder can be used for checking hypotheses about the origin of a system.

Revisiting this course s goal (1) I hope you had a fun learning about state-of-the-art methods and tools in software evolution research. You have learned how to break down challenges in constructing and evolving software. You have learned how to cope with software engineering problems systematically. Now you probably know that building and evolving large scale software systems is challenging, yet there are systematic solutions (tool support and techniques) out there.

Revisiting this course s goal (2) I hope you gained confidence in doing research. Why? I believe that research skills are important for both practitioners and researchers. I hope you gained perspectives in identifying and formulating research questions. I hope you now have learned how to identify open problems through a literature survey. I hope you are more comfortable about reading research papers critically and evaluating research works. I hope you learned the importance of evaluation component and how to evaluate research solutions.

Preview for Next Lecture We will continue with code duplication research. Empirical studies of code clone genealogies, Kim et al. FSE 2005

Announcement The peer review form is available on the blackboard. Please take your graded homework -- practical uses of software evolution research, part 1. Your grade review period ends on Apr 27th 11:50 PM.