Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech

Similar documents
CCLearner: A Deep Learning-Based Clone Detection Approach

Automatic Clone Recommendation for Refactoring Based on the Present and the Past

SourcererCC -- Scaling Code Clone Detection to Big-Code

Large-Scale Clone Detection and Benchmarking

An Exploratory Study on Interface Similarities in Code Clones

Software Clone Detection. Kevin Tang Mar. 29, 2012

Semantic Clone Detection Using Machine Learning

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Searching for Configurations in Clone Evaluation A Replication Study

COMPARISON AND EVALUATION ON METRICS

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering

Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Deep Learning for Embedded Security Evaluation

On the Stability of Software Clones: A Genealogy-Based Empirical Study

Automatic Identification of Important Clones for Refactoring and Tracking

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

arxiv: v1 [cs.se] 20 Dec 2015

A Characterization Study of Repeated Bug Fixes

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

CS229 Final Project: Predicting Expected Response Times

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Semantic Analysis of Search-Autocomplete Manipulations

IDENTIFICATION OF PROMOTED ECLIPSE UNSTABLE INTERFACES USING CLONE DETECTION TECHNIQUE

좋은 발표란 무엇인가? 정영범 서울대학교 5th ROSAEC Workshop 2011년 1월 6일 목요일

A Tree Kernel Based Approach for Clone Detection

Object Detection on Self-Driving Cars in China. Lingyun Li

A Category-Theoretic Approach to Syntactic Software Merging

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering

CS 231A Computer Vision (Fall 2011) Problem Set 4

Salman Ahmed.G* et al. /International Journal of Pharmacy & Technology

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

CAP 6412 Advanced Computer Vision

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Rearranging the Order of Program Statements for Code Clone Detection

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

MeCC: Memory Comparisonbased Clone Detector

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Threshold-free Code Clone Detection for a Large-scale Heterogeneous Java Repository

Searching the Deep Web

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Evolving the Software Engineering Toolbox CHRISTOPH REICHENBACH

UML is still inconsistent!

Chapter 8 The C 4.5*stat algorithm

Detecting Malicious Activity with DNS Backscatter Kensuke Fukuda John Heidemann Proc. of ACM IMC '15, pp , 2015.

Code duplication in Software Systems: A Survey

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Neural Detection of Semantic Code Clones via Tree-Based Convolution

Multi-View 3D Object Detection Network for Autonomous Driving

Searching the Deep Web

Metadata Topic Harmonization and Semantic Search for Linked-Data-Driven Geoportals -- A Case Study Using ArcGIS Online

Distress Image Library for Precision and Bias of Fully Automated Pavement Cracking Survey

Investigating the Impact of Active Guidance on Design Inspection

BigDataBench: a Benchmark Suite for Big Data Application

Rapid Extraction and Updating Road Network from LIDAR Data

SENSOR SELECTION SCHEME IN TEMPERATURE WIRELESS SENSOR NETWORK

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

A Virtual Laboratory for Study of Algorithms

Chianti: A Tool for Change Impact Analysis of Java Programs

Fast or furious? - User analysis of SF Express Inc

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Toward a Standard Rule Language for Semantic Integration of the DoD Enterprise

Baishakhi Ray and Miryung Kim The University of Texas at Austin

GRIND Gene Regulation INference using Dual thresholding. Ir. W.P.A. Ligtenberg Eindhoven University of Technology

A Measurement of Similarity to Identify Identical Code Clones

Dealing with Clones in Software : A Practical Approach from Detection towards Management

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

Maintainability Index Variation Among PHP, Java, and Python Open Source Projects

The Problem of Overfitting with Maximum Likelihood

How We Refactor, and How We Know It

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

Learning to Segment Object Candidates

Foundations of AI. 8. Satisfiability and Model Construction. Davis-Putnam, Phase Transitions, GSAT and GWSAT. Wolfram Burgard & Bernhard Nebel

Author Prediction for Turkish Texts

On the unreliability of bug severity data

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Exploring the Influence of Feature Selection Techniques on Bug Report Prioritization

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Compiling clones: What happens?

CERT C++ COMPLIANCE ENFORCEMENT

Empirical study of the dynamic behavior of JavaScript objects

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Large Scale Data Analysis Using Deep Learning

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

Domain-specific Concept-based Information Retrieval System

Adding Source Code Searching Capability to Yioop

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

FaCoY A Code-to-Code Search Engine

ReDroid: Prioritizing Data Flows and Sinks for App Security Transformation

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Domain Partitioning for Multi-Domain Learning

Transcription:

Clone Detection and Maintenance with AI Techniques Na Meng Virginia Tech

Code Clones Developers copy and paste code to improve programming productivity Clone detections tools are needed to help bug fixes or refactor code 2

Existing Clone Detection Tools Textbased Code Clone Detection Graphbased Metricsbased Tokenbased Treebased 3

Problem Statement Each algorithm works better for certain kinds of clones, but worse for the others E.g., token-based, tree-based The algorithms do not prioritize clones based on their likelihoods of being refactored 4

Research Questions How can we automatically characterize the similarity between clones? How can we only report clones that are likely to be refactored by developers? 5

[ICSME 17]: Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, Barbara Ryder CCLEARNER: A DEEP LEARNING-BASED CLONE DETECTION APPROACH 6

Methodology Code Clone Detection Problem Classification Problem 7

Approach Source code Method Extraction Method Pair Enumerator Clone Pairs Feature Extraction Non-Clone Pairs Training Deep Learning Classifier Testing Clone Pairs Non-Clone Pairs 8

Our Hypothesis Code clones are more likely to share certain kinds of tokens than other tokens Tokens likely to be shared Keywords, method names, Tokens less likely to be shared Variable names, literals, 9

Feature Extraction method A method B token_freq_list A [token_freq_cat A1,, token_freq_cat A8 ] Category Name Example Category Name Example Reserved words <if, 2> Type identifiers <URLConnection, 1> Operators <+=, 2> Method identifiers <openconnection, 1> Markers <;, 2> Qualified names <arr.length, 1> Literals <1.3, 2> Variable identifiers <conn, 2> 10

Feature Extraction method A method B token_freq_list A token_freq_list B [token_freq_cat A1,, token_freq_cat A8 ] [token_freq_cat B1,, token_freq_cat B8 ] [sim_score 1,..., sim_score 8 ] 11

Input Clones Training <[sim_score 1,..., sim_score 8 ], 1> Non-clones <[sim_score 1,..., sim_score 8 ], 0> Training Process DeepLearning4j * Output A well-trained classifier (.mdl) * Deeplearning4j, http://deeplearning4j.org/, accessed: 2017-09-04 12

Testing Input A codebase Output 2 nodes in DNN Predict the likelihood of clones and nonclones Challenges Time cost O(n 2 ) Solution Two filters 13

Evaluation Benchmark: BigCloneBench * 10 source code folders One database of ground truth Clone Type: T1, T2, VST3, ST3, MT3 and WT3/4 Data Set Construction Training Data (Folder #4) T1, T2, VST3 and ST3 clones Randomly choose a subset of false clone pairs Testing data (Other 9 folders) All source files * Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy and Mohammad Mamun Mia, "Towards a Big Data Curated Benchmark of Inter- Project Code Clones", In Proceedings of the Early Research Achievements track of the 30th International Conference on Software Maintenance and Evolution (ICSME 2014), 5 pp., Victoria, Canada, September 2014. 14

Evaluation Recall R T1 ST 3 = # of retrieved true clones pairs of T1-ST3 # of known true clones pairs of T1-ST3 Precision P estimated = F score # of retrieved true clones pairs 385 detected clone pair samples F T1 ST 3 = 2 * P es?mated * R T1-ST3 P es?mated + R T1-ST3 15

Evaluation Results Recall (%) CCLearner SourcererCC NiCad Deckard T1 100 100 100 96 T2 98 97 85 82 VST3 98 92 98 78 ST3 89 67 77 78 Precision (%) F (%) CCLearner SourcererCC NiCad Deckard 93 98 68 71 CCLearner SourcererCC NiCad Deckard 93 88 76 77 16

Things We Learnt CCLearner achieves a better trade-off between precision and recall Perhaps deep learning or our unique feature sets play the role CCLearner s recall goes down as more variation exists between clone peers We may need more features to represent the semantic equivalence between clones instead of purely the syntactic similarity 17

[Under submission] Ruru Yue, Zhe Gao, Na Meng, Yingfei Xiong, Xiaoyin Wang CLONERECOMMENDER: MACHINE LEARNING-BASED CLONE RECOMMENDATION FOR REFACTORING 18

Motivation With clone detection, developers apply clone removal refactorings e.g., Extract Method and Form Template Method Each clone detection tool reports too many clones Some clones are more likely to be refactored than others E.g., repetitively updated code clones vs. inactive code clones 19

Our Hypotheses The clones refactored by developers and those not refactored should manifest certain differences Content, context, and evolution of each clone The textual similarity/difference and cochange relationships between clone peers Different developers make refactoring decisions in similar or predictable ways 20

Approach Refactored Clones Non-Refactored Clones Per-Clone Features (15) Feature Extrac+on Code Feature (9) History Features (6) Machine Learning Training Detected Clones Clone Detec+on Clone Group Features (16) LocaBon/Context Features (4) SyntacBc Difference Features (6) Co-Change Features (5) Group Size (1) Classifier Tes3ng A Project Repository Clones to Refactor Clones not to Refactor 21

Things We Learnt The refactoring decisions seem to be predictable in most cases Some refactorings seem to be unpredictable We can only predict refactorings based on code history and its current version Some developers make different refactoring decisions 22

Conclusion Our investigation on CCLearner and CloneRecommender demonstrates that AI can improve the efficiency of software development and reduce maintenance cost AI techniques cannot fully overtake coding tasks due to (1) the difficulty of reasoning semantics, and (2) lack of the domain knowledge to evolve software 23