Detecting Code Similarity Using Patterns. K. Kontogiannis M. Galler R. DeMori. McGill University

Similar documents
Impact of Dependency Graph in Software Testing

IMPACT OF DEPENDENCY GRAPH IN SOFTWARE TESTING

A Novel Technique for Retrieving Source Code Duplication

Reverse Engineering with a CASE Tool. Bret Johnson. Research advisors: Spencer Rugaber and Rich LeBlanc. October 6, Abstract

Frama-C's metrics plug-in

ER E P M S S I A SURVEY OF OBJECT IDENTIFICATION IN SOFTWARE RE-ENGINEERING DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A

Generating Continuation Passing Style Code for the Co-op Language

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Software Architecture Recovery based on Dynamic Analysis

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds.

A Content Based Image Retrieval System Based on Color Features

Reverse Engineering to the Architectural Level. The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA.

Brian Drabble, Je Dalton and Austin Tate

Software Architecture Recovery based on Pattern Matching

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

INFORMATION RETRIEVAL USING MARKOV MODEL MEDIATORS IN MULTIMEDIA DATABASE SYSTEMS. Mei-Ling Shyu, Shu-Ching Chen, and R. L.

Computer Technology Institute. Patras, Greece. In this paper we present a user{friendly framework and a

Chapter 12: Query Processing

Application Programmer. Vienna Fortran Out-of-Core Program

Taxonomy Dimensions of Complexity Metrics

Object-oriented Compiler Construction

Arvind Krishnamurthy Fall 2003

A novel approach to automatic model-based test case generation

Iteration constructs in data-ow visual programming languages

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Skill. Robot/ Controller

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Preliminary results from an agent-based adaptation of friendship games

A Catalog of While Loop Specification Patterns

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

A Mixed Fragmentation Methodology For. Initial Distributed Database Design. Shamkant B. Navathe. Georgia Institute of Technology.

VISION-BASED HANDLING WITH A MOBILE ROBOT

GGPerf: A Perfect Hash Function Generator Jiejun KONG June 30, 1997

which a value is evaluated. When parallelising a program, instances of this class need to be produced for all the program's types. The paper commented

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

The Hibernate Framework Query Mechanisms Comparison

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Chapter 12: Query Processing. Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

The S-Expression Design Language (SEDL) James C. Corbett. September 1, Introduction. 2 Origins of SEDL 2. 3 The Language SEDL 2.

The Stepping Stones. to Object-Oriented Design and Programming. Karl J. Lieberherr. Northeastern University, College of Computer Science

FACIAL RECOGNITION BASED ON THE LOCAL BINARY PATTERNS MECHANISM

Automatic Generation of Graph Models for Model Checking

The Simplex Algorithm with a New. Primal and Dual Pivot Rule. Hsin-Der CHEN 3, Panos M. PARDALOS 3 and Michael A. SAUNDERS y. June 14, 1993.

Solutions to exercises on Memory Hierarchy

when a process of the form if be then p else q is executed and also when an output action is performed. 1. Unnecessary substitution: Let p = c!25 c?x:

DHCP for IPv6. Palo Alto, CA Digital Equipment Company. Nashua, NH mentions a few directions about the future of DHCPv6.

and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa

Solve the Data Flow Problem

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes

Parallel Graph Algorithms

ARQo: The Architecture for an ARQ Static Query Optimizer

TEMPORAL AND SPATIAL SEMANTIC MODELS FOR MULTIMEDIA PRESENTATIONS ABSTRACT

Transparent Access to Legacy Data in Java. Olivier Gruber. IBM Almaden Research Center. San Jose, CA Abstract

Applicability of Process Mining Techniques in Business Environments

[14] S. L. Murphy, \Service Specication and Protocol Construction for a Layered Architecture,"

The Architecture of a System for the Indexing of Images by. Content

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*

Generating Executable BPEL Code from BPMN Models

VMTS Solution of Case Study: Reverse Engineering

Ad Hoc Routing. Ad-hoc Routing. Problems Using DV or LS. DSR Concepts. DSR Components. Proposed Protocols

Database System Concepts

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

guessed style annotated.style mixed print mixed print cursive mixed mixed cursive

L-Bit to M-Bit Code Mapping To Avoid Long Consecutive Zeros in NRZ with Synchronization

Chapter 13: Query Processing

DRAFT for FINAL VERSION. Accepted for CACSD'97, Gent, Belgium, April 1997 IMPLEMENTATION ASPECTS OF THE PLC STANDARD IEC

Refactoring Support Based on Code Clone Analysis

An ATM Network Planning Model. A. Farago, V.T. Hai, T. Cinkler, Z. Fekete, A. Arato. Dept. of Telecommunications and Telematics

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining

systems & research project

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

2 Related Work Often, animation is dealt with in an ad-hoc manner, such as keeping track of line-numbers. Below, we discuss some generic approaches. T

An Eclipse Plug-In for Generating Database Access Documentation in Java Code

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

AST: Support for Algorithm Selection with a CBR Approach

XViews: XML views of relational schemas

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

Mutual Exclusion Between Neighboring Nodes in a Tree That Stabilizes Using Read/Write Atomicity?

A Labeling Approach for Mixed Document Blocks. A. Bela d and O. T. Akindele. Crin-Cnrs/Inria-Lorraine, B timent LORIA, Campus Scientique, B.P.

Proceedings of International Computer Symposium 1994, Dec , NCTU, Hsinchu, Taiwan, R.O.C. 1172

Token based clone detection using program slicing

InstructionAPI Reference Manual 7.0. Generated by Doxygen 1.4.7

PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1. We present some preliminary results on a system for tracking 3D motion using

Acknowledgments 2

Buffer_1 Machine_1 Buffer_2 Machine_2. Arrival. Served

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Automatic Code Generation for Non-Functional Aspects in the CORBALC Component Model

9/24/ Hash functions

Measuring IPv6 Deployment

Introduction to SLAM Part II. Paul Robertson

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Reverse Software Engineering Using UML tools Jalak Vora 1 Ravi Zala 2

Scheme: Data. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, April 3, Glenn G.

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Transcription:

1

Detecting Code Similarity Using atterns K. Kontogiannis M. Galler R. DeMori McGill University 3480 University St., Room 318, Montreal, Canada H3A 2A7 Abstract Akey issue in design recovery is to localize patterns of code that may implement a particular plan or algorithm. This paper describes a set of code-to-code and abstract-description-to-code matching techniques. The code-to-code matching uses dynamic programming techniques to localize similar code fragments and is targeted for large software systems (1MLOC). atterns are specied either as source code or as a sequence of abstract statements written in an concept language. Markov models are used to compute dissimilarity distances between an abstract description and a code fragment in terms of the probability that agiven abstract statement can generate a given code fragment. The abstract-description-to-code matcher is under implementation and early experiments show it is a promising technique. 1 Introduction One of the key objectives towards the design recovery of a large complex software system is the identication of common patterns in the code. These common patterns may reveal important operations on data types, identify system clusters, help reorganize the system in an object-oriented way, and suggest points where potential errors can be found. In this approach, source code is parsed and represented as an annotated Abstract Syntax Tree (AST). Nodes in the AST represent language components (e.g statements, s) and arcs represent relationships between these language components. For example an IF-statement is represented as an AST node with three arcs pointing to the condition, thethen- part, and the ELSE-part. During the analysis phase each node is annotated with control and data ow information given as a vec- This work is funded by IBM Canada Ltd. Laboratory - Center for Advanced Studies (Toronto), National Research Council of Canada, and McGill University. tor of software metrics [Buss94], as a set of data bindings with the rest of the system, [Konto94], or as a set of keywords, variable names and data types. The RE- FINE 1 environment is used to analyze and store the AST and its annotations. The annotations are computed in a compositional way from the leaves to the root of the AST. We perform two types of plan localization. The rst is based on localizing similar code fragments and we refer to it as code-to-code matching. It is used to perform code duplication detection in large C software systems. Code-to-code matching is achieved by using a dynamic programming (D) based pattern matcher that calculates the best alignmentbetween two code fragments at the statement level. The distance between the two code fragments is given as a summation of comparison values as well as of insertion and deletion costs corresponding to insertions and deletions that have to be applied in order to achieve such an alignment. The second, abstract-description-to-code matching is a more ambitious objective. Our pattern matcher is driven by amarkov model that describes the similarity between an abstract pattern and a code fragment as a set of probabilities that a given abstract statement can generate a given source statement. 2 Code-To-Code Matching While the general problem of determining if two functions are the same (i.e. produce the same outputs given the same inputs) is known to be undecidable, clone detection is a computable approximation problem in which a measure of structural similarity is used to determine if two fragments are derived from the same original source code. What makes clone detection dicult is that replicas will be somewhat modi- ed and it is necessary to dene a metric for \degree of similarity". 1 REFINE is a trademark of Reasoning Systems

The process starts by transforming the AST that correspond to the two code fragments into two sequences of s and statements. For example, an if-statement will be expanded into the sequence [condition-, THEN-part, ELSE-part]. The comparison at the statement level is based on a set of features that are calculated during the AST annotation phase. rogram features 2 used in the matching process include (a) software quality and complexity metrics (Fanout, D-Complexity, McCabe Cyclomatic Complexity, Albrecht's metric, and the Henry-Kafura metric) (b) Variables set/used (c) Files accessed (read/write) (d) Type and number offormal parameters passed by reference (e) Data bindings, and (f) Keywords and associated concepts All the features can be calculated in a compositional way starting from the leaves and working upwards to the root of the AST. For calculating the distance between two code fragments the following D function is used: D(E(1 p) E(1 j)) = Min where 8>< >: (E(p ; 1 p) E(j ; 1 j))+ D(E(1 p; 1) E(1 j; 1)) I(E(p ; 1 p) E(j ; 1 j))+ D(E(1 p; 1) E(1 j; 1)) C(E(p ; 1 p) E(j ; 1 j))+ D(E(1 p; 1) E(1 j; 1)) E(i j) is a feature vector between positions i and j D(x y) is the the distance between the features x of the code fragment and features y of the the model or code fragment, (x y) is the cost of deleting code feature vectors represented by x and y, I(x, y) the insertion cost, and C(x, y) the comparison cost. The quality and the accuracy of the comparison cost is based on the program features selected and the formula used to compare these features. Currently we are using ratios of used / set variables, data types, and Euclidean distances on metric values. This technique is used to localize a model in a large software system. Moreover it is used to localize all code duplicates longer than n lines long. As it is not always practical to have a complete plan base, the user can select at run time a code fragment 2 Other features can be used too. Model 5 4 3 2 1 Q Q Q 1 2 Q 3 4 Input rogram Figure 1: The matching process between two code fragments. 1 Q 1 represents a match, while 4 Q 4 a deletion. to be used as a model (e.g. ). The matching process will localize all similar code fragments that may implement the same algorithm or plan as the model selected. Fig1. illustrates the matching process between fragments (used as the model) and Q representing a source code fragment. When dierent comparison criteria are used we obtain dierent distances between pairs of possible function clones as shown in Fig.2. 3 Abstract-Descriptions-to-Code Matching The concept assignment [Bigger94] problem consists of assigning concepts described in a concept language to program fragments. Concept assignment can therefore be seen as a matching problem. In our approach concepts are represented as abstract-descriptions using a concept language. An abstract-description is parsed and a corresponding AST T a is created. Similarly source code is represented as an annotated AST T c. Both T a and T C are transformed into a sequence of abstract and source code statements respectively using transformation rules. We use REFINE to build and transform both ASTs. The associated problems with matching concepts with code include : The choice of the conceptual language.

Distance 20 18 16 14 12 10 8 6 4 2 Distances between Function pairs Distances on definitions and uses of variables Distances on data and control flow measurements. 0 0 20 40 60 80 100 120 140 Function airs Figure 2: Distances between function pairs using D based matching. The dashed line represent distances obtained using denitions and uses of variables. The continuous line represents distances obtained by using ve dierent software metrics. The measure of similarity. The selection of a fragment in the code to be compared with the conceptual representation These problems are addressed in the following sections. possible match with one more code statements, while the *-Statement represents a match with zero or more actual code statements. typed variables for dynamic instantiations of variables between pattern fragments and code fragments. operators: fk,, g for specifying the order of the matching process. The (k) operator indicates statements that can be interleaved (i.e. share no data dependencies). The () operator is used to indicate alternative choices while the ( ) operator indicates sequencing. A typical example of a pattern code is given below: } Iterative-Statement(abstract-description uses : [?x : *list], keywords : [ "NULL" ]) +-Statement abstract-description uses : [?y : string,..] keywords : [ "member" ] Assignment-Statement abstract-description uses : [?x,..], defines : [?x], keywords : [ "next" ] } 4 Language for Abstract Representation A number of research teams have identied the problem of code localization using query pattern languages [aul94], [Muller92], [Church93]. In our approach we give weight toastochastic pattern matcher that allows for partial and approximate matching. A concept language species in an abstract way sequences of design concepts. The concept language contains: abstract statements that correspond to one or more statement types in the source code language abstract statement descriptions that contain the feature vector data used for matching purposes Here the query or pattern searches for an Iterative-Statement (e.g. a while, for, or dowhile loop) that has in its condition an that uses variable?x that is a pointer to the abstract type list (e.g. array, linked list) and the contains the keyword \NULL". The body of Iterative-Statement contains a sequence of one or more statements (+-Statement) that in its body uses at least variable?y (which binds to the variable obj) and contains the keyword member, and an Assignment-Statement that uses at least variable?x, denes variable?x which in this example binds to variable field, and contains the keyword next. A code fragment thatmatches the pattern is: wild characters for specifying partial matching, (e.g. abstract statements that match zero or more code statements). The +-Statement represents a while (field!= NULL)

} if (!strcmp(obj,origobj) (!strcmp(field->avaluetype,"member") && notinorig ) ) if (strcmp(field->avalue,"method")!= 0) INSERT_THE_FACT(o->ATTLIST[num].Aname,origObj, field->avalue) field = field->nextvalue } A1 A2 A3 5 Concept-to-Code Distance Calculation Let T c be the abstract syntax tree of the code fragment andt a be the AST of the abstract representation. Using transformation rules we represent these two trees as two sequences of s and statements C a and C c. The process is guided by two categories of Markov models. The rst Markov model is dynamic as it is built at run time while parsing the abstract code description. Nodes represent abstract statements and arcs represent allowable transitions and what is expected to be matched from the source code. The second Markov model category is static and represents the probability a particular abstract statement generates a code statement. There are as many static Markov models as the abstract statements in the abstract language. The best alignment between C a and C c may be computed by the Viterbi dynamic programming algorithm. Currently we are experimenting with the following function: (S 1 S 2 :::S k A nk )= max i!n (S 1 S 2 :::S k;1 ja ik;1) (S k ja nk ) where (S k ja nk )isinterpreted as \The probability that code fragment S k is generated by abstract statement A n at frame k", and i! n indicates that A i! A n is a possible transition between abstract statements in the model. For example given the run-time created model of Fig. 3 the function above will perform as follows : (S 1 ja 11 )=1:0 (S 1 S 2 ja 22 )= (S 1 ja 11 ) (S 2 ja 22 ) (S 1 S 2 ja 32 )= (S 1 ja 11 ) (S 2 ja 32 ) Figure 3: A dynamic model for the pattern A1 A2 A3 Max 8 < : (S 1 S 2 S 3 ja 33 )= (S 1 S 2 ja 23 ) (S 3 ja 33 ) (S 1 S 2 ja 32 ) (S 3 ja 33 ) (S 1 S 2 S 3 ja 23 )= (S 1 S 2 ja 22 ) (S 3 ja 23 ) Note that at the second frame, two transitions have been consumed (two matches have been tried with entities in the source code) and the reachable active states are A2 or A3. The way to calculate distances between individual abstract statements and code fragments is given in terms of probabilities of the form (S i ja j )interpreted as the probability of abstract statement A j generating statement S i. This static model is used to calculate the (S i ja j ) probabilities. On the other hand the dynamic model is created at run-time while parsing the abstract pattern (section 3). With each transition we associate a list of probabilities based on the type of the likely to be found as the next to be matched in the code for the plan that we consider. For example in the Traversal of a linked list plan the while loop condition, which is an, most probably involves an inequality of the form (list-node-ptr!= NULL) which contains an identier reference and the keyword NULL.

attern Inequality 0.25 is an inequality attern 0.25 Equality attern is an equality Expression 0.25 is a id ref attern Id Ref 0.25 is a function call attern Fcn Call 1.0 1.0 Arg1 1.0 1.0 Arg1 1.0 id ref id ref Fcn Name Arg2 Arg2 Fcn Args In this formula cache (S i ja j ) represents the frequency that A j generates S i in the code examined at run time while static (S i ja j ) represents the a-priori probability of A j generating S i given in the static model. is a weighting factor. Finally, the selection of a code fragment to be matched with an abstract description is based on the following criteria : a) the rst source code statement can be generated by the rst pattern statement b) keywords and variable names c) metric values (if provided) Once a candidate list of code fragments has been chosen the actual pattern matching takes place in order to select only these code fragments that best t the pattern. empty id ref 6 Conclusion Figure 4: The static model for the -pattern. Dierent models may exist for dierent plans. For example the traversal of linked-list plan may have higher probability attached to the is-an-inequality transition as the programmer expects a pattern of the form (eld!= NULL) An example of a static model for the pattern is given in Fig. 4. where A j is a attern-expression abstract statement, and S i is an in the source code. The initial probabilities in the static model are provided by the user who either may gives a uniform distribution in all outgoing transitions from a given state or provide some subjectively estimated values. These values may come from the knowledge that a given plan is implemented in a specic way. For example a traversal of linked-list plan usually is implemented with a while loop. The Iterative abstract statementcanbe considered to generate a while statement with higher probability than a for statement. Once the system is used and results are evaluated these probabilities can be adjusted to improve the performance. robabilities can be dynamically adapted to a specic software system using a cache memory method originally proposed (for a dierent application) in [Kuhn90]. Acache is used to maintain the counts for most frequently recurring statement patterns in the code being examined. Static probabilities can be weighted with dynamically estimated ones as follows: (S i ja j )= cache (S i ja j )+(1; ) static (S i ja j ) attern matching is the focal point of plan recognition. In this paper we have investigated two approaches for plan localization. The rst localizes similar code fragments and is mostly used for detecting code duplication. It is based on an dynamic programming pattern matcher that computes the best alignment between two codefragments. Wehave experimented with dierent code features for comparing code statements and we are able to detect clones in large software systems 300 KLOC. The second approach is based on an abstract pattern language that is used to describe programming plans. Markov models are used to guide the matching process and a Viterbi algorithm computes the best match between an abstract plan description and a code fragment. Disimilarity distances are given in terms of the probability a given abstract statement can generate a particular code fragment. The system is implemented using the REFINE environment and supports plan localization in C programs. References [Bigger94] Biggersta, T., Mitbander, B., Webster, D., \rogram Understanding and the Concept Assignment roblem", Communications of the ACM, May 1994, Vol. 37, No.5, pp. 73-83. [Buss94] Buss, E., et. al. \Investigating Reverse Engineering Technologies for the CAS rogram Understanding roject", IBM Systems Journal, Vol. 33, No. 3, 1994, pp. 477-500.

[Church93] Church, K., Helfman, I., \Dotplot: a program for exploring self-similarity in millions of lines of text and code", J. Computational and Graphical Statistics 2,2, June 1993, pp. 153-174. [Kuhn90] Kuhn, R., DeMori, R., \A Cache-Based Natural Language Model for Speech Recognition", IEEE Transactions on attern Analysis and Machine Intelligence, Vol. 12, No.6, June 1990, pp. 570-583. [Konto94] Kontogiannis, K., DeMori, R., Bernstein, M., Merlo, E., \Localization of Design Concepts in Legacy Systems", In roceedings of International Conference on Software Maintenance 1994, September 1994, Victoria, BC. Canada, pp. 414-423. [Muller92] Muller, H., Corrie, B., Tilley, S., Spatial and Visual Representations of Software Structures, Tech. Rep. TR-74. 086, IBM Canada Ltd. April 1992. [aul94] aul, S., rakash, A., \A Framework for Source Code Search Using rogram atterns", IEEE Transactions on Software Engineering, June 1994, Vol. 20, No.6, pp. 463-475.