A Comparison of Code Similarity Analyzers C. Ragkhitwetsagul, J. Krinke, D. Clark

Similar documents
CloPlag. A Study of Effects of Code Obfuscation to Clone/Plagiarism Detection Tools. Jens Krinke, Chaiyong Ragkhitwetsagul, Albert Cabré Juan

CloPlag. A Study of Effects of Code Obfuscation to Code Similarity Detection Tools. Chaiyong Ragkhitwetsagul, Jens Krinke, Albert Cabré Juan

Research Note RN/17/04. A Comparison of Code Similarity Analysers

Code Duplication: A Measurable Technical Debt?

A Comparison of Code Similarity Analysers

Measuring Code Similarity in Large-scaled Code Corpora

Using Compilation/Decompilation to Enhance Clone Detection

Source Code Plagiarism Detection using Machine Learning

Searching for Configurations in Clone Evaluation A Replication Study

Plagiarism detection for Java: a tool comparison

On the Robustness of Clone Detection to Code Obfuscation

Large-Scale Clone Detection and Benchmarking

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus

Duplication de code: un défi pour l assurance qualité des logiciels?

Overview of SOCO Track on the Detection of SOurce COde Re-use

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech

Toxic Code Snippets on Stack Overflow

Compiling clones: What happens?

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

A Technique to Detect Multi-grained Code Clones

A Framework for Evaluating Mobile App Repackaging Detection Algorithms

CS/COE 1501

ISSN: (PRINT) ISSN: (ONLINE)

An Information Retrieval Approach for Source Code Plagiarism Detection

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

Improving Plagiarism Detection. Tomas Votroubek

switch case Logic Syntax Basics Functionality Rules Nested switch switch case Comp Sci 1570 Introduction to C++

Cosc 241 Programming and Problem Solving Lecture 17 (30/4/18) Quicksort

MeCC: Memory Comparisonbased Clone Detector

Enhancing Source-Based Clone Detection Using Intermediate Representation

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

Extracting Code Clones for Refactoring Using Combinations of Clone Metrics

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

An Exploratory Study on Interface Similarities in Code Clones

Software Clone Detection. Kevin Tang Mar. 29, 2012

Pertinence of Lexical and Structural Features for Plagiarism Detection in Source Code

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

Phase-based algorithms for file migration

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering

Ctcompare: Comparing Multiple Code Trees for Similarity

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

Folding Repeated Instructions for Improving Token-based Code Clone Detection

COS 226 Midterm Fall 2007

Incremental Clone Detection and Elimination for Erlang Programs

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015

Detection and Analysis of Software Clones

Lecture 1: Overview of Java

COMP 202 Recursion. CONTENTS: Recursion. COMP Recursion 1

2IMP25 Software Evolution. Code duplication. Alexander Serebrenik

OSSPolice - Identifying Open-Source License Violation and 1-day Security Risk at Large Scale

Lecture 4: MIPS Instruction Set

Java Archives Search Engine Using Byte Code as Information Source

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Automatic Identification of Important Clones for Refactoring and Tracking

Lecture 1 - Introduction (Class Notes)

ON AUTOMATICALLY DETECTING SIMILAR ANDROID APPS. By Michelle Dowling

Opera Web Browser Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

A Tree Kernel Based Approach for Clone Detection

2IS55 Software Evolution. Code duplication. Alexander Serebrenik

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

CS Programming I: Programming Process

Lab5. Wooseok Kim

Rearranging the Order of Program Statements for Code Clone Detection

Programming by Delegation

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Navigating the Guix Subsystems

Technical lossless / near lossless data compression

Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering

Sub-clones: Considering the Part Rather than the Whole

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

Static Pruning of Terms In Inverted Files

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

INF 212 ANALYSIS OF PROG. LANGS PLUGINS. Instructors: Crista Lopes Copyright Instructors.

OFF-SITE LEARNING SCHEDULE

Lab5. Wooseok Kim

JSCTracker: A Tool and Algorithm for Semantic Method Clone Detection

Playing Cupid: The IDE as a Matchmaker for Plug-Ins

엄현상 (Eom, Hyeonsang) School of Computer Science and Engineering Seoul National University COPYRIGHTS 2017 EOM, HYEONSANG ALL RIGHTS RESERVED

Soot, a Tool for Analyzing and Transforming Java Bytecode

A Survey of Software Clone Detection Techniques

MeCC: Memory Comparison-based Clone Detector

Research Article An Empirical Study on the Impact of Duplicate Code

On the Stability of Software Clones: A Genealogy-Based Empirical Study

Big picture. Definitions. Internal sorting. Exchange sorts. Insertion sort Bubble sort Selection sort Comparison. Comp Sci 1575 Data Structures

Mining Revision Histories to Detect Cross-Language Clones without Intermediates

Lecture 2. COMP1406/1006 (the Java course) Fall M. Jason Hinek Carleton University

code pattern analysis of object-oriented programming languages

Cost of Your Programs

arxiv: v1 [cs.se] 8 Aug 2017

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

Code Clone Detection on Specialized PDGs with Heuristics

Process Model Improvement for Source Code Plagiarism Detection in Student Programming Assignments

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

4.1, 4.2 Performance, with Sorting

Sorting. 4.2 Sorting and Searching. Sorting. Sorting. Insertion Sort. Sorting. Sorting problem. Rearrange N items in ascending order.

Gapped Code Clone Detection with Lightweight Source Code Analysis

Transcription:

A Comparison of Code Similarity Analyzers C. Ragkhitwetsagul, J. Krinke, D. Clark SCAM 16, EMSE (under reviewed) Photo: https://c1.staticflickr.com/1/316/31831180223_38db905f28_c.jpg 1

When source code is copied and modified, which code similarity detection techniques or tools get the most accurate results? 2

Bellon et al. (TSE 2007) Roy et al. (Sci Comp Prog. 2009) Hage et al. (CSERC 2010) Biegel et al. (MSR 11) 3

1 The selected tools are limited to only a subset of clone or plagiarism detectors (and their parameters). 2 The results are based on different data sets. 4

5 30 tools

Pervasive Modifications From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/ /* ORIGINAL */ private static int partition (Comparable[] a, int lo, int hi) { int i = lo; int j = hi+1; Comparable v = a[lo]; while (true) { while (less(a[++i], v)) { if (i == hi) break; } while (less(v, a[--j])) { if (j == lo) break; } if (i >= j) break; exch(a, i, j); } exch(a, lo, j); return j; } /* PERVASIVELY MODIFIED CODE */ private static int partition (int[] bob, int left, int right){ int x = left; int y = right+1; for (;;) { while (less(bob[left],bob[--y])) if (y == left) break; while (less(bob[++x],bob[left])) if (x == right) break; if (x >= y) break; swap(bob, y, x); } swap(bob, y, left); return y; } SW Plagiarism clone evolution refactoring 6

7

pervasively modified code to be used in detection phase source obfuscator compiler bytecode obfuscator decompilers pervasively modified code original ARTIFICE javac ProGuard Krakatau BubbleSort.java EightQueens.java GuessWord.java TowerOfHanoi.java InfixConverter.java Kapreka_Tran.java MagicSquare.java RailRoadCar.java SLinkedList.java SqrtAlgorithm.java Procyon 8

Boiler-Plate Code Detection of SOurce COde re-use (SOCO). Flores E., Rosso P., Moreno L., Villatoro-Tello E. (2014) http://users.dsic.upv.es/grupos/nle/soco/ 9

Parameter Settings 10 Jonathan H. Ward (Wikipedia CC BY-SA 3.0)

11

Similarity Report orig orig no kraka tau orig no procy on orig pg kraka tau orig pg procy on no kraka tau no procy on pg kraka tau pg procy on Sqrt/ orig Sqrt/ Squr/ pg kraka tau Squr/ pg procy on InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 14 17 InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 14 17 InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 28 17 InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 15 21 InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 29 20 InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 14 21 InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 28 17 InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 15 19 InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 28 17 InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 14 21 Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 14 16 Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 15 18 Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 100 32 Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 32 100 12

Similarity Threshold = 50 orig orig no kraka tau orig no procy on orig pg kraka tau orig pg procy on no kraka tau no procy on pg kraka tau pg procy on Sqrt/ orig Sqrt/ Squr/ pg kraka tau Squr/ pg procy on InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 14 17 InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 14 17 InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 28 17 InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 15 21 InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 29 20 InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 14 21 InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 28 17 InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 15 19 InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 28 17 InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 14 21 Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 14 16 Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 15 18 Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 100 32 Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 32 100 13

Best Threshold 1.00 F-measure = 0.8282 0.75 F-measure 0.50 0.25 0.00 31 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Threshold Value (T) 14

Optimal Configuration Best Param Settings Best Threshold Pervasive: 14,880,000 pairwise comparisons SOCO: 99,816,528 pairwise comparisons Icons made by Freepik from www.flaticon.com is licensed by Creative Commons BY 3.0 15

Clone det. Plag det. Comp. Others ccfx deckard iclones nicad simian jplag-java jplag-text plaggie sherlock simjava simtext 7zncd-BZip2 7zncd-Deflate 7zncd-Deflate2 7zncd-LZMA 7zncd-Deflate64 7zncd-PPMd bzip2ncd gzipncd icd ncd-bzlib ncd-zlib xz-ncd bsdiff diff difflib fuzzywuzzy jellyfish ngram cosine Pervasive Mod. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F1

Clone det. Plag det. Comp. Others ccfx deckard iclones nicad simian jplag-java jplag-text plaggie sherlock simjava simtext 7zncd-BZip2 7zncd-Deflate 7zncd-Deflate2 7zncd-LZMA 7zncd-Deflate64 7zncd-PPMd bzip2ncd gzipncd icd ncd-bzlib ncd-zlib xz-ncd bsdiff diff difflib fuzzywuzzy jellyfish ngram cosine Boiler- Plate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F1

Highly specialised source code similarity detection techniques and tools can perform better than more general, compression & textual similarity measures. Interesting: difflib and fuzzywuzzy. Icons made by Freepik from www.flaticon.com is licensed by Creative Commons BY 3.0 18

Optimal Configurations CCFX s Precision vs. Recall Measure Value ccfx s params b t Precision 1.00 19 7, 8, 9 Recall 0.98 5 12 19

CCFX Optimal Config. 20

b = 5, t = 11, 12 b = 19, t = 7, 8, 9 21

Pervasive Mod. Boiler- Plate

The optimal configurations derived from one data set has a detrimental impact on the similarity detection results for another data set. Cbuckley, Jpowell on en.wikipedia Icons made by Freepik from www.flaticon.com is licensed by Creative Commons BY 3.0 23

Normalisation by Decompilation Pervasively modified code Normalisation Normalised code Decompile Compile 24

Clone det. Plag det. Comp. Others ccfx deckard iclones nicad simian jplag-java jplag-text plaggie sherlock simjava simtext 7zncd-BZip2 7zncd-Deflate 7zncd-Deflate2 7zncd-LZMA 7zncd-LZMA2 7zncd-PPMd bzip2ncd gzipncd icd ncd-bzlib ncd-zlib xz-ncd bsdiff diff py-difflib py-fuzzywuzzy py-jellyfish py-ngram py-sklearn 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F1 F1 Orig. Dec.

Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code (with statistical significance) IWSC 17 Icons made by Freepik from www.flaticon.com is licensed by Creative Commons BY 3.0 26

Ranked Results Only Top k Results ccfx fuzzywuzzy ncd-bzlib bzip2ncd simian gzipncd ncd-zlib jplag-java difflib jplag-text simjava gzipncd ncd-zlib sherlock jplag-text 7zncd-PPMd xzncd 7zncd-Deflate64 7zncd-Deflate fuzzywuzzy 0.8 0.85 0.9 0.95 1 Mean Average Precision (MAP) Pervasive Mod. 0.8 0.85 0.9 0.95 1 Mean Average Precision (MAP) Boiler-Plate 27

Distribution of tool s F1 scores vs. pervasive mod. type Original O = original Obfuscator A = Artifice (source) Pg = ProGuard (bytecode) Decompiler K = Krakatau Pc = Procyon 28

F1 Score 0.1 0.4 0.4 0.6 0.6 0.8 0.8 1.0 Tool O A K Pc Pg K Pg Pc A K A Pc A Pg K A Pg Pc Original O = original Obfuscator A = Artifice (source) Pg = ProGuard (bytecode) Decompiler K = Krakatau Pc = Procyon ccfx deckard iclones nicad simian jplag-java jplag-text plaggie sherlock simjava simtext 7zncd-BZip2 7zncd-Deflate 7zncd-Deflate2 7zncd-LZMA 7zncd-LZMA2 7zncd-PPMd bzip2ncd gzipncd icd ncd-zlib ncd-bzlib xzncd bsdiff diff difflib fuzzywuzzy jellyfish ngram cosine

To Sum Up A Comparison of Code Similarity Analyzers 30 Research Note: http://www.cs.ucl.ac.uk/research/research_notes/ Website: http://crest.cs.ucl.ac.uk/resources/cloplag/