A Tree Kernel Based Approach for Clone Detection

Similar documents
MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

SourcererCC -- Scaling Code Clone Detection to Big-Code

Software Clone Detection. Kevin Tang Mar. 29, 2012

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

A Survey of Software Clone Detection Techniques

Syntax and Grammars 1 / 21

Token based clone detection using program slicing

Incremental Clone Detection and Elimination for Erlang Programs

Code duplication in Software Systems: A Survey

A Technique to Detect Multi-grained Code Clones

Folding Repeated Instructions for Improving Token-based Code Clone Detection

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Towards the Code Clone Analysis in Heterogeneous Software Products

Semantic Clone Detection Using Machine Learning

DCC / ICEx / UFMG. Software Code Clone. Eduardo Figueiredo.

Detection and Behavior Identification of Higher-Level Clones in Software

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

ISSN: (PRINT) ISSN: (ONLINE)

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

LECTURE 3. Compiler Phases

Rearranging the Order of Program Statements for Code Clone Detection

TypeChef: Towards Correct Variability Analysis of Unpreprocessed C Code for Software Product Lines

Compiling clones: What happens?

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

COMPARISON AND EVALUATION ON METRICS

A Simple Syntax-Directed Translator

On the Robustness of Clone Detection to Code Obfuscation

Semantic Analysis. Lecture 9. February 7, 2018

Detection and Analysis of Software Clones

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

The role of semantic analysis in a compiler

Gapped Code Clone Detection with Lightweight Source Code Analysis

On Refactoring for Open Source Java Program

An Exploratory Study on Interface Similarities in Code Clones

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

KClone: A Proposed Approach to Fast Precise Code Clone Detection

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

A programming language requires two major definitions A simple one pass compiler

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu

Detection of Non Continguous Clones in Software using Program Slicing

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler so far

Introduction to Compiler Construction

Syntax-Directed Translation. Lecture 14

EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS

Software Clone Detection and Refactoring

Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report

A simple syntax-directed

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Software Clone Detection Using Cosine Distance Similarity

CA Compiler Construction

Classification Using Genetic Programming. Patrick Kellogg General Assembly Data Science Course (8/23/15-11/12/15)

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization

Code Duplication. Harald Gall seal.ifi.uzh.ch/evolution

Neural Detection of Semantic Code Clones via Tree-Based Convolution

A Measurement of Similarity to Identify Identical Code Clones

Introduction to Compiler Construction

Clone Detection Using Abstract Syntax Suffix Trees

CONVERTING CODE CLONES TO ASPECTS USING ALGORITHMIC APPROACH

Re-engineering Software Variants into Software Product Line

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

Introduction to Compiler Construction

Type Inference Systems. Type Judgments. Deriving a Type Judgment. Deriving a Judgment. Hypothetical Type Judgments CS412/CS413

Code Clone Detection on Specialized PDGs with Heuristics

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

Problematic Code Clones Identification using Multiple Detection Results

Lecture 12: Conditional Expressions and Local Binding

Chapter 4. Abstract Syntax

SolidSDD Duplicate Code Detector User Manual

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm

Detecting Self-Mutating Malware Using Control-Flow Graph Matching

The Compiler So Far. Lexical analysis Detects inputs with illegal tokens. Overview of Semantic Analysis

Page No 1 (Please look at the next page )

SEMANTIC ANALYSIS TYPES AND DECLARATIONS

ELEC 876: Software Reengineering

Abstract Syntax Tree Generation using Modified Grammar for Source Code Plagiarism Detection

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

Similar Code Detection and Elimination for Erlang Programs

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015

Design and Implementation of a Java to CodeML converter

Code Clone Detector: A Hybrid Approach on Java Byte Code

Compilers and Interpreters

Introduction. Compiler Design CSE Overview. 2 Syntax-Directed Translation. 3 Phases of Translation

Lecture 4: The Declarative Sequential Kernel Language. September 5th 2011

Lecture 21: Relational Programming II. November 15th, 2011

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

The Substitution Model

Extension of GCC with a fully manageable reverse engineering front end

Grammars and Parsing. Paul Klint. Grammars and Parsing

Parsing CSCI-400. Principles of Programming Languages.

Principles of Programming Languages. Lecture Outline

Web Application Testing in Fifteen Years of WSE

The University of Saskatchewan Department of Computer Science. Technical Report #

Transcription:

A Tree Kernel Based Approach for Clone Detection Anna Corazza 1, Sergio Di Martino 1, Valerio Maggio 1, Giuseppe Scanniello 2 1) University of Naples Federico II 2) University of Basilicata

Outline Background Clone detection definition State of the Art Techniques Taxonomy Our Abstract Syntax Tree based Proposal A Tree Kernel based approach for clone detection A preliminary evaluation

Code Clones 1 Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools

Code Clones 1 Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) Similarity based on Program Text or on Semantics 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools

Code Clones 1 Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) Similarity based on Program Text or on Semantics Program Text can be further distinguished by their degree of similarity 1 Type 1 Clone: Exact Copy Type 2 Clone: Parameter Substituted Clone Type 3 Clone: Modified/Structure Substituted Clone 1. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools

State of the Art Techniques 2 Classified in terms of Program Text representation 2 String, token, syntax tree, control structures, metric vectors String/Token based Techniques Abstract Syntax Tree (AST) Techniques... 2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009

State of the Art Techniques 2 String/Token based Techniques Abstract Syntax Tree (AST) Techniques... Combined Techniques (a.k.a. Hybrid) Combine different representations Combine different techniques Combine different sources of information Tree Kernel based approach (Our approach :)

The Proposed Approach

The Goal 3 Define an AST based technique able to detect up to Type 3 Clones

The Goal 3 Define an AST based technique able to detect up to Type 3 Clones The Key Ideas: Improve the amount of information carried by ASTs by adding (also) lexical information Define a proper measure to compute similarities among (sub)trees, exploiting such information

The Goal 3 Define an AST based technique able to detect up to Type 3 Clones The Key Ideas: Improve the amount of information carried by ASTs by adding (also) lexical information Define a proper measure to compute similarities among (sub)trees, exploiting such information As a measure we propose the use of a (Tree) Kernel Function

Kernels for Structured Data 4 Kernels are a class of functions with many appealing features: Are based on the idea that a complex object can be described in terms of its constituent parts Can be easily tailored to a specific domain There exist different classes of Kernels: String Kernels Graph Kernels Tree Kernels Applied to NLP Parse Trees (Collins and Duffy 2004)

Defining a new Tree Kernel 5 The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees

Defining a new Tree Kernel 5 The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes

Defining a new Tree Kernel 5 The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes (3) A proper Kernel Function to compare subparts of trees

(1) The defined features 6 We annotate each node of AST by 4 features:

(1) The defined features 6 We annotate each node of AST by 4 features: Instruction Class i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...

(1) The defined features 6 We annotate each node of AST by 4 features: Instruction Class i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... Instruction i.e. FOR, WHILE, IF, RETURN, CONTINUE,...

(1) The defined features 6 We annotate each node of AST by 4 features: Instruction Class i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... Instruction i.e. FOR, WHILE, IF, RETURN, CONTINUE,... Context Instruction class of statement in which node is enclosed

(1) The defined features 6 We annotate each node of AST by 4 features: Instruction Class i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... Instruction i.e. FOR, WHILE, IF, RETURN, CONTINUE,... Context Instruction class of statement in which node is enclosed Lexemes Lexical information within the code

Context Feature 7 Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++) x += i+2; while (i<10) x += i+2; if (i<10) x += i+2;

Context Feature 7 Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++) x += i+2; while (i<10) x += i+2; if (i<10) x += i+2;

Context Feature 7 Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++) x += i+2; while (i<10) x += i+2; if (i<10) x += i+2;

Context Feature 7 Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++) x += i+2; while (i<10) x += i+2; if (i<10) x += i+2;

Context Feature 7 Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++) x += i+2; while (i<10) x += i+2; if (i<10) x += i+2;

Lexemes Feature 8 For leaf nodes: It is the lexeme associated to the node For internal nodes: It is the set of lexemes that recursively comes from subtrees with minimum height

Lexemes Propagation 9 block while return < block y x 0 %= x y

Lexemes Propagation 9 block while return < block y y x x 0 0 %= x x y y

Lexemes Propagation 9 block while return x, 0 < block y y x x 0 0 %= x x y y

Lexemes Propagation 9 block while return x, 0 < block x, y y y x x 0 0 %= x, y x x y y

Lexemes Propagation 9 block x, 0, while while return y, return x, 0 < block x, y y y x x 0 0 %= x, y x x y y

Lexemes Propagation 9 block y, return x, 0, while while return y, return x, 0 < block x, y y y x x 0 0 %= x, y x x y y

(2) Applying features in a Kernel 10 We exploits these features to compute similarity among pairs of nodes, as follows: Instruction Class filters comparable nodes We compare only nodes with the same Instruction Class Instruction, Context and Lexemes are used to define a value of similarity between compared nodes

(Primitive) Kernel Function between nodes 1.0 If two nodes have the same values of features 0.8 If two nodes differ in lexemes (same instruction and context) 11 s(n1,n2)= 0.7 If two nodes share lexemes and are the same instruction 0.5 If two nodes share lexemes and are enclosed in the same context 0.25 If two nodes have at least one feature in common 0.0 no match

(3) Tree Kernel: Kernel on entire Tree Structures We apply nodes comparison recursively to compute similarity between subtrees 12 We aim to identify the maximum isomorphic tree/subtree

Overall Process 13 1. Preprocessing 2. Extraction 3. Match Detection 4. Aggregation

A Preliminary evaluation

Evaluation Description 14 We considered a small Java software system We choose to identify clones at method level We checked system against the presence of up to Type 3 clones Removed all detected clones through refactoring operations We manually and randomly injected a set of artificially created clones One set for each type of clones We applied our prototype and CloneDigger* to mutated systems We evaluated performances in terms of Precision, Recall and F1 *http://clonedigger.sourceforge.net/

Results (1) 15 Type 1 and Type 2 Clones: We were able to detect all clones without any false positive This was obtained also by CloneDigger Both tools expressed the potential of AST-based approaches

Results (2) 16 Type 3 clones: We classified results as true Type 3 clones according to different thresholds on similarity values We measured performance on different thresholds We get best results with threshold equals to 0.70

Conclusions and Future Works 17 Measure performance on real systems and projects Bellon's Benchmark Investigate best results with 0.7 as threshold Measure Time Performances Improve the scalability of the approach Avoid to compare all pairs Improve similarity computation Avoid manual weighting features Extend Supported Languages Now we support Java, C, Python

18 Thank you for listening. Questions?

Let's go to the backup slides

Evaluation on Different Thresholds

Exact Clone (Type 1): is an exact copy of consecutive code fragments without modifications (except for white spaces and comments)

Exact Clone (Type 1): is an exact copy of consecutive code fragments without modifications (except for white spaces and comments)

Parameter-substituted clone (Type 2): is a copy where only parameters (identifiers or literals) have been substituted

Parameter-substituted clone (Type 2): is a copy where only parameters (identifiers or literals) have been substituted

Structure-substituted clone (Type 3): is a copy where program structures have been substituted, added and/or deleted.

Structure-substituted clone (Type 3): is a copy where program structures have been substituted, added and/or deleted.