Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Similar documents
Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing

Context-Based Detection of Clone-Related Bugs. Lingxiao Jiang, Zhendong Su, Edwin Chiu University of California at Davis

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Code Clone Detector: A Hybrid Approach on Java Byte Code

Pointers II. Class 31

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

Token based clone detection using program slicing

Software Clone Detection. Kevin Tang Mar. 29, 2012

Detection of Non Continguous Clones in Software using Program Slicing

A Novel Technique for Retrieving Source Code Duplication


Recap: Functions as first-class values

Code Clone Analysis and Application

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

On Refactoring for Open Source Java Program

Refactoring Support Based on Code Clone Analysis

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

The PCAT Programming Language Reference Manual

Problematic Code Clones Identification using Multiple Detection Results

Software Clone Detection Using Cosine Distance Similarity

Motivation was to facilitate development of systems software, especially OS development.

Sub-clones: Considering the Part Rather than the Whole

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

4. An interpreter is a program that

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Introduction to Compiler Construction

A Tree Kernel Based Approach for Clone Detection

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

Some notes about Event-B and Rodin

DECKARD: Scalable and Accurate Tree-based Detection of Code Clones

Enabling Variability-Aware Software Tools. Paul Gazzillo New York University

Why are there so many programming languages? Why do we have programming languages? What is a language for? What makes a language successful?

Kurt Schmidt. October 30, 2018

Big Bang. Designing a Statically Typed Scripting Language. Pottayil Harisanker Menon, Zachary Palmer, Scott F. Smith, Alexander Rozenshteyn

LECTURE 3. Compiler Phases

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

List Functions, and Higher-Order Functions

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

Variables and Bindings

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILING

Introduction to Compiler Construction

An Information Retrieval Process to Aid in the Analysis of Code Clones

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

A CRASH COURSE IN SEMANTICS

SourcererCC -- Scaling Code Clone Detection to Big-Code

SOFTWARE ARCHITECTURE 5. COMPILER

Enhancing Source-Based Clone Detection Using Intermediate Representation

COMPARISON AND EVALUATION ON METRICS

Code No: R Set No. 1

CMPT 379 Compilers. Anoop Sarkar.

Static Program Analysis Part 1 the TIP language

The role of semantic analysis in a compiler

Introduction to Compiler Construction

Page 1. Reading assignment: Reviews and Inspections. Foundations for SE Analysis. Ideally want general models. Formal models

TypeChef: Towards Correct Variability Analysis of Unpreprocessed C Code for Software Product Lines

CS 415 Midterm Exam Fall 2003

Compiler Architecture

PAPER Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files

Indicate the answer choice that best completes the statement or answers the question. Enter the appropriate word(s) to complete the statement.

KU Compilerbau - Programming Assignment

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Computer Algorithms. Introduction to Algorithm

An Overview to Compiler Design. 2008/2/14 \course\cpeg421-08s\topic-1a.ppt 1

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

CS 415 Midterm Exam Spring SOLUTION

OCaml. ML Flow. Complex types: Lists. Complex types: Lists. The PL for the discerning hacker. All elements must have same type.

The SPL Programming Language Reference Manual

COURSE: DATA STRUCTURES USING C & C++ CODE: 05BMCAR17161 CREDITS: 05

Compiling and Interpreting Programming. Overview of Compilers and Interpreters

A Measurement of Similarity to Identify Identical Code Clones

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

COMPILER DESIGN. For COMPUTER SCIENCE

Principles of Programming Languages. Lecture Outline

Clone Detection Using Scope Trees

SE352b: Roadmap. SE352b Software Engineering Design Tools. W3: Programming Paradigms

Daqing Hou Electrical and Computer Engineering Clarkson University, Potsdam, NY

Programming Languages Third Edition. Chapter 7 Basic Semantics

IMPORTANT QUESTIONS IN C FOR THE INTERVIEW

Have examined process Creating program Have developed program Written in C Source code

Data Structure and Algorithm, Spring 2013 Midterm Examination 120 points Time: 2:20pm-5:20pm (180 minutes), Tuesday, April 16, 2013

Syntax-Directed Translation. Lecture 14

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco

Compilers. Prerequisites

Rearranging the Order of Program Statements for Code Clone Detection

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

Quiz 0 Review Session. October 13th, 2014

CSE 504: Compiler Design. Runtime Environments

A Survey of Software Clone Detection Techniques

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING B.E SECOND SEMESTER CS 6202 PROGRAMMING AND DATA STRUCTURES I TWO MARKS UNIT I- 2 MARKS

Early computers (1940s) cost millions of dollars and were programmed in machine language. less error-prone method needed

Automating Big Refactorings for Componentization and the Move to SOA

Parsing and Pattern Recognition

Reading assignment: Reviews and Inspections

Crafting a Compiler with C (II) Compiler V. S. Interpreter

SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY

MeCC: Memory Comparisonbased Clone Detector

An Experience Report on Analyzing Industrial Software Systems Using Code Clone Detection Techniques

Transcription:

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing Lingxiao Jiang and Zhendong Su

Cloning in Software Development How New Software Product

Cloning in Software Development Search How Copy Paste Prior Knowledge Modify Compose Reimplement Specification Documentation Code Base Test Suites Bug Database New Software Product

Applications of Clone Detection Refactoring Pattern mining Reuse Debugging Evolution study Plagiarism detection

A Spectrum of Clone Detection String Token Syntax Tree Program Dependence Graph Birthmark Actual Behavior Semantic Awareness of Clone Detection

A Spectrum of Clone Detection String Token Syntax Tree Program Dependence Graph Birthmark Actual Behavior Semantic Awareness of Clone Detection 1992: Baker, parameterized string algorithm 2002: Kamiya et al., CCFinder 2004: Li et al., CP-Miner 2007: Basit et al., Repeated Tokens Finder

A Spectrum of Clone Detection String Token Syntax Tree Program Dependence Graph Birthmark Actual Behavior Semantic Awareness of Clone Detection 1998: Baxter et al., CloneDR 2004: Wahler et al., XML-based 2007: Jiang et al., Deckard 2000, 2001: Komondoor et al. 2006: Liu et al., GPLAG 2008: Gabel et al.

A Spectrum of Clone Detection String Token Syntax Tree Program Dependence Graph Birthmark Actual Behavior Semantic Awareness of Clone Detection 1999: Collberg et al., Software watermarking 2007: Schuler et al., Dynamic birthmarking 2008: Lim et al., Static birthmarking 2008: Zhou et al., Combined approach

A Spectrum of Clone Detection String Token Syntax Tree Program Dependence Graph Birthmark Functionality Semantic Awareness of Clone Detection Functional equivalence How extensive is its existence

Functional Equivalence Definition Inputs Code #1 Code #2 Outputs Applicability: arbitrary piece of code Source and binary From whole program to whole function to code fragments Example: sorting algorithms Bubble, selection, merge, quick, heap

Previous Work on Program Equivalence [Cousineau 1979; Raoult 1980; Zakharov 1987; Crole 1995; Pitts 2002; Bertran 2005; Matsumoto 2006; Siegel 2008; ] Many based on formal semantics Consider whole programs or functions only Not arbitrary code fragments Check equivalence among given pieces of code Not scalable detection

Our Objectives Detect functionally equivalent code fragments Program.. Code 1 Code i Code n Compare I/O behaviors directly Run each piece of code with random inputs

Our Objectives Challenges Detect functionally equivalent code fragments Program.. Code 1 Code i Code n Large number of code fragments Unclear I/O interfaces Compare I/O behaviors directly Run each piece of code with random inputs Huge number of code executions

Key 1: Semantic-Aware I/O Identification Identify input and output variables based on data flows in the code: Variables used before defined are inputs Variables defined but may not used are outputs Input variables: i and data Output variables: data X Xx

Key 2: Limit Number of Inputs Schwartz-Zippel lemma: polynomial identities can be tested with few random values Let D(x) be p 1 (x) p 2 (x) If p 1 (x) = p 2 (x), D(x) If p 1 (x) p 2 (x), D(x) = 0 has at most finite number d of roots Prob ( D(v) = 0 ) is bounded by d, for any random value v from the domain of x. x D(x) x

EqMiner Input Generator Source Code Code Chopper Code Transformer Code Clustering Functionally Equivalent Code Clusters Fragment Extraction Fragment Compilation I/O Identification Fragment Execution Output Comparison Code Filter

Code Chopper Sliding windows of various sizes on serialized statements

Code Transformer Declare undeclared variables, labels Define all used types Remove assembly code Replace goto, return statements Replace function calls Replace each call with a random input variable Ignore side effects, only consider return values Read inputs Dump outputs

Input Generation In order to share concrete input values among input variables for different code fragments, separate the generation into two phases: 1. Construct bounded memory pools filled with random primary values and pointers. E.g., Primary value pool (bytes): 100-78 Pointer value pool (0/1): 1 0 2. Initialize each variable with values from the pools. E.g., struct { int x, y; } X; Input variables: X* x; int* y; x = malloc(sizeof(x)); x.x = 100; x.y = -78; y = 0;

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : O 1 f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1

Code Clustering Eager partitioning of code fragments for a set of random inputs f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn I 1 : O 2 C2: f2 C1: f1

Code Clustering Eager partitioning of code fragments for a set of random inputs f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn I 1 : O 3 C1: f1 C2: f2 f3

Code Clustering Eager partitioning of code fragments for a set of random inputs f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn I 1 : O 4 C1: f1 C2: f2 f3 C3: f4

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn I 2 : repeat the same for each intermediate cluster

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn I 2 : repeat the same for each intermediate cluster O 1 C11: f1

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn I 2 : repeat the same for each intermediate cluster O 5 C11: f1 C12: f5

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn I 2 : repeat the same for each intermediate cluster C11: f1 C12: f5 Ck1: fi fj Ck2: fl, fp Ckx: fq, fn

Code Clustering Eager partitioning of code fragments for a set of random inputs I 1 : f1, f2, f3, f4, f5, f6, f7, f8, f9,, fi,, fn C1: f1 f5 C2: f2 f3, f6 C3: f4 C4: f7 Ck: fi, fn I 2 : repeat the same for each intermediate cluster C11: f1 C12: f5 Ck1: fi fj Ck2: fl, fp Ckx: fq, fn I s : until only one code fragment is left for each cluster, or until a reasonable number s of inputs are used

EqMiner Input Generator Source Code Code Chopper Code Transformer Code Clustering Functionally Equivalent Code Clusters Fragment Extraction Fragment Compilation I/O Identification Fragment Execution Output Comparison Code Filter

Results on Sorting Algorithms 5 sorting algorithms with both recursive and non-recursive versions ~350 LoC ~200 code fragments s = 10 69 clone clusters reported Most are portions of the algorithms 4 non-recursive versions are in a same cluster

Results on the Linux Kernel s = 10 >800K code fragments were separated into 32K non-trivial clusters # of Clusters (Log10 Scale) 100000 10000 Additional 100 for 128 semi-randomly selected 1000 clusters 3% of 100 all of the code fragments became singletons 10 100 more tests 0.5% 1additional 2 3 4 5-10 11-20 21-100 101-3842 Sizes of Clusters

Results on the Linux Kernel s = 10 >800K code fragments were separated into 32K non-trivial clusters Additional 100 for 128 semi-randomly selected clusters 3% of all of the code fragments became singletons 100 more tests 0.5% additional

Differences from Syntactic Clones # of Code Fragments (Log10 Scale) 100000 10000 1000 100 10 1 arch block crypto drivers fs init ipc kernel lib mm net Directory Names in the Linux Kernel security sound 56% 92K fragments 36% 60K Functionally Equivalent Syntactically Equivalent

Differences from Syntactic Clones False positives Function calls Macro related + few outputs if ( ALWAYS_FALSE ) { } else output = input; output = input; Lexical differences output = input + 10; output = input + 100; output = 0; if ( output < input ) {... output = input + 1; } output = 0; if ( output < input ) {... output = output + 1; }

Conclusion & Future Work First scalable detection of functionally equivalent code based on random testing Confirm the existence of many functional clones which complement syntactic clones Enable further studies on functional clone patterns Explore utilities of functional equivalent code

Thank you! Questions? jiangl@cs.ucdavis.edu