SourcererCC -- Scaling Code Clone Detection to Big-Code

Similar documents
arxiv: v1 [cs.se] 20 Dec 2015

An Exploratory Study on Interface Similarities in Code Clones

A Technique to Detect Multi-grained Code Clones

Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering

Large-Scale Clone Detection and Benchmarking

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

A Tree Kernel Based Approach for Clone Detection

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

COMPARISON AND EVALUATION ON METRICS

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech

Software Clone Detection. Kevin Tang Mar. 29, 2012

A Measurement of Similarity to Identify Identical Code Clones

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Token based clone detection using program slicing

Detection of Non Continguous Clones in Software using Program Slicing

An Information Retrieval Approach for Source Code Plagiarism Detection

DCC / ICEx / UFMG. Software Code Clone. Eduardo Figueiredo.

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Incremental Clone Detection and Elimination for Erlang Programs

Analytical Modeling of Parallel Programs

Code duplication in Software Systems: A Survey

UC Irvine UC Irvine Electronic Theses and Dissertations

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Code Duplication. Harald Gall seal.ifi.uzh.ch/evolution

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Reusing Reused Code II. CODE SUGGESTION ARCHITECTURE. A. Overview

On Refactoring for Open Source Java Program

A Weighted Layered Approach for Code Clone Detection

Review on Managing RDF Graph Using MapReduce

Comparing Multiple Source Code Trees, version 3.1

Detection and Analysis of Software Clones

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Pointers. Chapter 8. Decision Procedures. An Algorithmic Point of View. Revision 1.0

Classifier Case Study: Viola-Jones Face Detector

Code Clone Detection on Specialized PDGs with Heuristics

Software product quality control Dr. Stefan Wagner Dr. Florian Deißenböck Technische Universität München

Quick Parser Development Using Modified Compilers and Generated Syntax Rules

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Automated Refinement Checking of Asynchronous Processes. Rajeev Alur. University of Pennsylvania

좋은 발표란 무엇인가? 정영범 서울대학교 5th ROSAEC Workshop 2011년 1월 6일 목요일

An Approach to Detect Clones in Class Diagram Based on Suffix Array

``System Maintenance: Please verify your details''

Towards the Code Clone Analysis in Heterogeneous Software Products

Lesson 06 Arrays. MIT 11053, Fundamentals of Programming By: S. Sabraz Nawaz Senior Lecturer in MIT Department of MIT FMC, SEUSL

System Maintenance: Please verify your details

1 Unit 8 'for' Loops

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

Deriving Network Traffic Signatures via Large Graphs

MINING SOURCE CODE REPOSITORIES AT MASSIVE SCALE USING LANGUAGE MODELING COMP 5900 X ERIC TORUNSKI DEC 1, 2016

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes

Refactoring Support Based on Code Clone Analysis

CloCom: Mining Existing Source Code for Automatic Comment Generation

Lexical Analysis. Introduction

Detection and Behavior Identification of Higher-Level Clones in Software

Management. Software Quality. Dr. Stefan Wagner Technische Universität München. Garching 28 May 2010

Getting the right module structure: finding and fixing problems in your projects!

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Classification of Java Programs in SPARS-J. Kazuo Kobori, Tetsuo Yamamoto, Makoto Matsusita and Katsuro Inoue Osaka University

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

CONTENTS: Array Usage Multi-Dimensional Arrays Reference Types. COMP-202 Unit 6: Arrays

Master of Computer Application (MCA) Semester III MC0071 Software Engineering 4 Credits

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

From Smith-Waterman to BLAST

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

Full file at

Empath. A Modeling Language for Living Beings. White Paper. Jeremy Posner, Nalini Kartha, Sampada Sonalkar, William Mee

Semantic Clone Detection Using Machine Learning

Clone Detection in Matlab Simulink Models

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Abstract Semantic Differencing for Numerical Programs

Efficient Systems. Micrel lab, DEIS, University of Bologna. Advisor

CSE 582 Autumn 2002 Exam 11/26/02

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Min-Hashing and Geometric min-hashing

Python The way of a program. Srinidhi H Asst Professor Dept of CSE, MSRIT

Topic: Duplicate Detection and Similarity Computing

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

PHPKB API Reference Guide

CHAPTER 6 EFFICIENT TECHNIQUE TOWARDS THE AVOIDANCE OF REPLAY ATTACK USING LOW DISTORTION TRANSFORM

Edge and corner detection

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

Formally Certified Satisfiability Solving

Chapter 4. Capturing the Requirements. 4th Edition. Shari L. Pfleeger Joanne M. Atlee

IDENTIFICATION OF PROMOTED ECLIPSE UNSTABLE INTERFACES USING CLONE DETECTION TECHNIQUE

Formal specification of semantics of UML 2.0 activity diagrams by using Graph Transformation Systems

The University Of Michigan. EECS402 Lecture 05. Andrew M. Morgan. Savitch Ch. 5 Arrays Multi-Dimensional Arrays. Consider This Program

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Rearranging the Order of Program Statements for Code Clone Detection

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Ctcompare: Comparing Multiple Code Trees for Similarity

A Serializability Violation Detector for Shared-Memory Server Programs

A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering

Transcription:

SourcererCC -- Scaling Code Clone Detection to Big-Code

What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories using a standard workstation.

Background - Code Clone Detection Code Clones -- separate fragments of code that are very similar. Why are there so many duplicated code? - Reusing code fragments by copying and pasting Why it need to be detected? - harmful in software maintenance and evolution

Type 1 -- white-space, layout, comments int[] s ; int i ; s = new int[5] ; for(i = 0 ; i < 5 ; i++){ s[i] = i ; } int[] s ; int i ; s = new int[5]; //initiation for (i = 0; i < 5; { s[i] = i ; } i++)

Type 2 -- identifier names, literal values int[] s ; int i ; s = new int[5] ; for(i = 0 ; i < 5 ; i++){ s[i] = i ; } int[] array ; int i ; array = new int[6] ; for(i = 0 ; i < 6 ; i++){ array[i] = i ; }

Type 3 (near-miss clones) -- statements, added, modified int[] s ; int i ; s = new int[5] ; for(i = 0 ; i < 5 ; i++){ s[i] = i ; } int[] s = new int[5] ; for(int i = 0 ; i < 5 ; i++){ s[i] = i ; }

Type 4 -- same functionality, different Syntax int[] s ; int i ; s = new int[5] ; int[] array = {0, 1, 2, 3, 4}; for(i = 0 ; i < 5 ; i++){ s[i] = i ; }

A token: programming language keywords, literals, and identifiers. DEFINITIONS Code Fragment: A continuous segment of source code Clone Pair: A pair of code fragments that are similar Clone Class: A set of code fragments that are similar Code Block: 1)A sequence of code statements within braces, 2) is represented as a bag-of-tokens

DEFINITIONS Degree of similarity -- The higher the value, the greater the similarity between the code blocks. Threshold -- if the Degree of similarity is greater than threshold, the code blocks are considered as code clones.

DEFINITIONS tokens project block int i ; tokens? int [ ] s ; s = new int[5] ; for(i = 0 ; i < 5 ; i++) { s[i] = i ; }

GOAL To find: B1 B2 θ max( B1, B2 ) B1 = {a, b, c, d, e} (5 tokens) B2 = {b, c, d, e, f} (5 tokens) B1 B2 = 4 If threshold θ = 0.8 θ max( B1, B2 ) = 0.8 * max(5, 5) = 4

Intuitive way B1 = {a, b, c, d, e} B2 = { b, c, d, e, f} O(n^2)

Heuristics Filter 1 B1 = {a, b, c, d, e} B2 = {b, c, d, e, f} with 5 tokens (t = 5) How many different tokens t i + 1 = 2 tokens match at least 1 token. Which ORDER is best? Similar to natural languages -the popularity of tokens threshold θ = 0.8 if clones : 0.8 5 = 4 tokens (i=4). Change the order means change the source code meaning.

Result

Ineffective case B1 = {a, b, c, d} (4 tokens) B2 = {b, c, d, e, f} (5 tokens) Let threshold θ = 0.8 False positive to be considered clones : 0.8 5 = 4 tokens (i=4) 2 tokens match at least 1 token

Heuristics Filter 2 B1 ={ a, b, c, d } (4 tokens) 2 B2 ={ b, c, d, e, f } (5 tokens) If clones --- 4 same tokens Filter 1: 1 same token Upper bound: 1 + min(2, 4) = 3 4

Result

Clone Detection Algorithm Two stages: (1) Partial Index Creation - creates indexes for tokens only in sub-blocks. (2) Clone Detection - a) establishes a map of candidates; b) Candidate Verification (check candidates that were not rejected in step a) Filter 1 is involved in (1) Filter 2 is involved in (2)

Detection of Near-miss (Type-3) clones Type-3: The fragments have statements added, modified and/or removed with respect to each other Semantic difference, but syntactic similarity at a token level to be detected as similar.

Evaluation -- Execution time Also can be executed for the IJaDataset (250MLOC) Can out of memory issue indicate the scalability?

Evaluation -- Recall (Percent of genuine clones that are reported)

Evaluation -- Recall using BigCloneBench

Evaluation -- Precision

Summary of its evaluation methodology Evaluated the scalability, execution time, recall and precision Compared 4 publicly available and state-of-the-art tools. Used two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones.

Related work: How is the research different from prior work? Rattan et al. --- found very few tools target scalability to very large repositories Liveri et al. --- partition the input for the large repositories, distribution of the executions over a large number of machines. Svajlenko et al. --- non-deterministic shuffling heuristic to reduce the number of tool execution, reduction in recall SourcererCC --- scale to large repositories on a single machine.

Related work: How is the research different from prior work? Ishihara et al. --- MD5 hashing to scale method-clone detection, fast execution time, cannot detect Type-3 clones Hummel et al. --- first to use an index-based approach, detect only Type-1 and Type-2 clones Others have scaled clone detection in domain-specific ways SourcererCC --- small index, detects Type-3 clones, generic method

What is learnt from the paper? Idea is simple, implementation may be simple too Why can it be published on ICSE?

Use Math Formula

Write Algorithm

Evaluation 1. Show impressive results 2. Cover the features we mentioned before

Discussion: Should we change the order of tokens? Change the order means change the meaning Same material can generate different code

Discussion: Is code clone totally bad? Learning is kind of code clone. We clone the knowledge from professors, from books Understand it and build up our own knowledge system

Question How can we partition the program to different blocks?