Semantic Clone Detection Using Machine Learning

Size: px

Start display at page:

Download "Semantic Clone Detection Using Machine Learning"

Avis Booth
5 years ago
Views:

1 Semantic Clone Detection Using Machine Learning Abdullah Sheneamer University of Colorado Colorado Springs, CO USA Jugal Kalita University of Colorado Colorado Springs, CO USA Abstract If two fragments of source code are identical to each other, they are called code clones. Code clones introduce difficulties in software maintenance and cause bug propagation. In this paper, we present a machine learning framework to automatically detect clones in software, which is able to detect Types-3 and the most complicated kind of clones, Type-4 clones. Previously used traditional features are often weak in detecting the semantic clones The novel aspects of our approach are the extraction of features from abstract syntax trees (AST) and program dependency graphs (PDG), representation of a pair of code fragments as a vector and the use of classification algorithms. The key benefit of this approach is that our tool can find both Syntactic and Semantic clones extremely well. Our evaluation indicates that using our new AST and PDG features is a viable methodology, since they improve detecting clones on the IJaDataset2.0. Index Terms Code clones, Software clones, Classifier algorithms, Abstract syntax trees (AST), Program dependence graphs (PDG). I. INTRODUCTION If two fragments of source code are identical to each other, they are called code clones. Code clones introduce difficulties in software maintenance and lead to bug propagation. This paper discusses a novel way to detect semantic clones. Programs have well-defined syntax, which can be represented by Abstract Syntax Trees (ASTs). ASTs have been successfully used to detect code clones [2]. In addition, we hypothesize that semantic Program Dependency Graph (PDG)- based features that are discriminative can be used with machine learning to detect semantic. In this paper, we extract new features from ASTs and PDGs for code fragments to improve the accuracy in detection of code clones. The advantage of PDG-based detection is that it can detect non-contiguous code clones while other detection techniques are not good at it [3]. The contributions of this paper are following. We use novel features of ASTs to detect syntactic clones and features of PDGs to detect semantic clones. We are the first to use such features to the best of our knowledge. We represent a pair of method blocks as a vector in several ways to improve detection of source code clones. We use extracted features to learn a model to detect syntactic and semantic clones using a large number of algorithms. We show that using our new features achieve higher performance for all classifiers we use. The rest of the paper is organized as follows. Section II discusses background material. Related work is introduced in Section III. In Section IV, we describe our feature extraction approach. Our work is discussed in Section V. Evaluation of our work is discussed in Section VI. Finally, the paper is concluded in Section VII. A. Basic Definitions II. BACKGROUND Here, we provide definitions which we use throughout our paper. Definition 1: Code Fragment. A code fragment (CF) is a part of the source code needed to run a program. It can contain functions, begin-end blocks or a sequence of statements. Definition 2: Clone Pair. If a code fragment CF1 is similar to another code fragment CF2 syntactically or semantically, one is called a clone of the other. B. Types of Clones There are four types of clone relations between two code fragments based on the nature of similarity in their text or meaning [10]. Type-1 (Exact clones): Two code fragments are the exact copies of each other except whitespaces, blanks and comments. Type-2 (Renamed): Two code fragments are similar except for names of variables, types, literals and functions. Type-3 (Gapped clones): Two copied code fragments are similar, but with modifications such as added or removed statements, and the use of different identifiers, literals, types, whitespaces, layouts and comments. Type-4 (Semantic clones): Two code fragments are semantically similar, without being syntactically similar. III. RELATED WORK Most clone detection techniques using text-based [1], tokenbased [5] tree-based [6], or PDG-based [7] approaches are good at detecting Type-1 and Type-2 clones, but they miss some Type-3 clones and most Type-4 clones. Komondoor and Horwitz [11] use program slicing [26] to find isomorphic PDG subgraphs and code clones. The nodes in a PDG represent statements and predicates, and edges represent data and control dependencies. Higo and Kusumoto [7] propose a PDG-based incremental two-way slicing approach to detect clones, called Scorpio. Kurtz and Shihab [8] propose a new tool, namely Concolic Code Clone Detection (CCCD). CCCD combines concrete and

2 symbolic values in order to traverse all possible paths of an application. It is able to detect the most complicated kind of clones, Type-4 clones. Wang et al. [27] propose an approach that assists in understanding the harmfulness of intended cloning operations using Bayesian Networks and a set of features such as history, code, destination features. Sheneamer and Kalita [9] present a hybrid clone detection technique that first uses a coarse-grained technique to improve precision and then a fine-grained technique to improve recall. Their approach detects Type-1, Type-2 and Type-3 clones. Sajnani et al. [10] introduce SourcererCC for token-based accurate near-miss clone detection. They use an optimized partial index of tokens and filtering heuristics to achieve large-scale clone detection. However, their tool cannot detect semantic clones. IV. FEATURE EXTRACTION In metric-based clone detection techniques, a number of metrics or features are computed for each fragment of code to find similar fragments by comparing metric vectors instead of comparing code or ASTs directly. Traditional software metrics or 10 traditional features used by different authors [12] include 1) Number of lines of code, 2) Number of assignments, 3) Number of selection statements, 4) Number of iteration statements, 5) Number of switch or case statements, 6) Number of returns, 7) Number of try statements, 8) Number of variable declaration statements, 9) Number of expression statements, and 10) Number of type parameters. All of these features are computed and their values are stored in a database for all methods in a dataset of programs [12]. Pairs of similar methods are detected by comparison of the metric values using machine learning algorithms. Our approach extracts syntactic and semantic features of code fragments from ASTs and PDGs of its methods. A pair of code fragments is represented as a feature vector, and a large collection of such pair vectors are used to train a supervised learning classifiers to identify clone types. Some of the new features we extract from ASTs and PDGs are given in Table I. A. AST and PDG A Program Dependency Graph (PDG) represents control and data dependencies in a program. The nodes of a PDG represent the statements and conditions in a program. Control dependencies represent flow of control information. Data dependencies represent data flow information. A PDG is generated by semantics-aware techniques from the source code [14]. Semantic information contained in a program are captured from PDGs. Therefore, we hypothesize that using features from PDGs will be beneficial in detecting semantic clones. An Abstract Syntax Tree (AST) is a tree representation of the abstract syntactic structure of the source code. Each node of the tree represents a construct occurring in the source code. All of the source code is parsed to convert into an TABLE I SOME AST AND PDG FEATURES AST/PDG Features Description AST No. of Constructors Counts this ( [ Expression, Expression ] ) ; in method blocks. No. of Field Accesses Counts Expression. Identifier in method blocks. No. of Super Constructor Invocations Counts [ Expression. ] super ( [ Expression, Expression ] ) ; in method blocks. No. of Super Method Invocations Counts [ ClassName. ] super. Identifier ( [ Expression, Expression ] ) in method blocks. PDG Decl Assign DA if Assignment comes after declaration else NA Control Decl DC if Declaration comes after Control (e.g. i <count, for, while, if, switch etc.) else NA Control Assign CA if Assignment comes after Control else NA Expr Dec ED if Declaration comes after Expression else NA abstract syntax tree or parse tree for extraction of features. V. OUR WORK We use similarity metrics to help identify Type-4 clones automatically. The proposed method consists of the following steps. Figure 1 illustrates the workflow of our approach. Step 1. Perform lexical analysis and normalization. We transform and normalize all source files into special token sequences to detect not only identical clones but also similar ones. Step 2. Detect method blocks. This step needs lexical analysis and syntactic analysis to detect every block from the given source files. Step 3. Build pairwise method blocks. This step pairs each method with another method block. We compare all blocks in pairs to judge whether or not the two method blocks are identical using a classification algorithm. Step 4. Extract features for two method blocks. In this step, features are extracted from each code fragment using the Java Development Tool (JDT). These features improve the accuracy of Type-3 and Type-4 clone detection. Step 5. Represent pair instances as one vector. In this step, we create pairs of instances from the original data. Two original instances are represented by feature vectors X=(x 1,x 2,x 3,...,x n ) and Y=(y 1,y 2,y 3,...,y n ). We make a pair instance Z=(X,Y) and represent it as a vector using one of the following compostion possibilities: Z =(x 1 y 1,x 2 y 2,x 3 y 3,...,x n y m ),, (1) Z =(x 1,x 2,x 3,...,x n,y 1,y 2,y 3,...,y n ). (2) We perform normalization for Eq. 1 by dividing by the greatest value of each feature so that all feature values are between 0 and 1. The representation of a pair instance by Eq. 1 leads to lower performance than the representation of pair instance by Eq. 2. However, implementing Eq. 2 straightforwardly causes the following problems: The dimension of the feature space becomes 2n and the computational cost becomes expensive. We also have found that using Eq. 1 with our new features achieves higher performance than using Eq. 2. We show only results of using Eq. 1 due to lack of room in this paper.

3 Fig. 1. The overview of the proposed work Step 6. Detect similar blocks using a classification algorithm. After feature extraction, the pair instance is represented as a vector using one of the Eq. 1 or Eq. 2. We feed this data to a classifier. Our labeled data which contains pair instances have match and non-match as class labels. We train and test our classification models using fifteen machine learning algorithms. These include a recently published classification algorithm such as Xgboost [4], and others such as Extra Trees [15] and Rotation Forest [16]. Xgboost is short for (extreme Gradient Boosting). Gradient boost defines an objective function that contains two parts: training loss and regularization [4]. Extra Trees [15] is a tree-based ensemble method for supervised classification and regression problems. Rotation Forest [16] is a classifier ensemble based on feature extraction. It applies Principal Components Analysis (PCA) to each subset of a feature set, which is randomly split into (k) subsets and then trains a decision tree classifier on each subset. Random Forest creates multiple trees for classification. Random Committee [17] generates an ensemble of classifiers as well. In this paper, we train a Random Committee classifier with randomly created subsets instead of a decision tree as in the Rotation Forest classifier, to produce better accuracy. Other algorithms we use include SVM [18], Linear Discriminant Analysis (LDA) [21], Instance Based Learner (IBK) [22], Lazy (K ) [23], Decision Trees, Naïve Bayes [19], Multilayer perceptron (MLP) [20], Bagging [24], and LogitBoost [25]. We compare these classification algorithms in their ability to detect clones and show that supervised machine learning methods using our novel semantic features perform much better than using just traditional features. VI. EVALUATION Our primary goal is to improve clone detection accuracy for Type-3 and Type-4 clones using classification algorithms and comparing with the state-of-the-art. As our target we use IJaDataset 2.0 [13], a large inter-project Java repository called BigCloneBench 1 containing 25,000 open-source projects from SourceForge and Google Code. We use only a part of the dataset as described below. 1 TABLE II RESULTS USING PAIR INSTANCES VECTORS (EQ. 1) Algorithm Features Type of Clone Precision Recall F- Measure False 90.5% 86.0% 88.2% Rotation Forest Traditional Features (10) VST3 85.1% 82.7% 83.8% ST3 70.4% 62.6% 66.3% MT3 56.8% 48.0% 52.0% WT3/4 52.9% 79.8% 63.6% False 85.6% 67.1% 75.2% Syntactic Features (42) VST3 91.8% 89.1% 90.4% ST3 78.4% 80.6% 79.5% MT3 69.5% 64.5% 66.9% WT3/4 73.5% 82.2% 77.6% False 90.7% 86.7% 88.6% Syntactic + Semantic Features (70) VST3 93.1% 93.9% 93.5% ST3 82.2% 86.7% 84.4% MT3 82.9% 79.3% 81.1% WT3/4 91.3% 90.3% 90.8% False 94.5% 93.6% 94.0% Random Forest Traditional Features (10) VST3 85.3% 82.3% 83.8% ST3 70.7% 62.0% 66.1% MT3 56.8% 48.4% 52.3% WT3/4 52.7% 80.2% 63.6% False 85.4% 67.1% 75.1% Syntactic Features (42) VST3 92.3% 88.6% 90.4% ST3 78.7% 80.6% 79.6% MT3 69.8% 64.5% 67.1% WT3/4 73.3% 82.9% 77.8% False 90.4% 86.8% 88.6% Syntactic + Semantic Features (70) VST3 93.9% 93.1% 93.5% ST3 82.0% 86.6% 84.3% MT3 82.9% 79.1% 81.0% WT3/4 91.3% 91.0% 91.1% False 93.7% 93.9% 93.8% Xgboost Traditional Features (10) VST3 87.2% 84.4% 85.8% ST3 59.2% 71.5% 64.6% MT3 43.4% 53.8% 48.1% WT3/4 79.3% 50.6% 61.8% False 66.9% 88.0% 76.0% Syntactic Features (42) VST3 86.9% 93.7% 90.2% ST3 79.4% 78.4% 78.9% MT3 63.1% 68.6% 65.7% WT3/4 82.5% 72.6% 77.3% False 88.3% 88.0% 88.1% Syntactic + Semantic Features (70) VST3 93.0% 94.4% 93.9% ST3 87.5% 85.0% 86.2% MT3 81.3% 81.7% 81.5% WT3/4 90.2% 81.7% 85.7% False 93.9% 93.9% 93.9% A. Big IJaDataset This dataset represents a real use case of clone detection. This benchmark was built by mining IJaDataset for functions. The published version of the benchmark considers 44 target functionalities [14]. We run our programs on a standard workstation with Intel(R) Core(TM) i CPU, 32 GB of memory and 500 GB solid state drive. B. Experimental Setup For experiments, we consider all types of clone lengths in BigCloneBench that are 6 lines or 50 tokens or greater, which

Fig. 2. Results of accuracy for each classification algorithm using Pair Instances Vector (Eq.1) is standard minimum clone size for benchmarking [10].

4 Fig. 2. Results of accuracy for each classification algorithm using Pair Instances Vector (Eq.1) is standard minimum clone size for benchmarking [10]. There is no agreement on when a clone is no longer syntactically similar, and it is also hard to separate the Type-3 and Type-4 clones in the IJaDataset [13]. In IJaDataset, the authors divided Type-3 and Type-4 clones into four classes based on their syntactical similarity [13] as following: Very Strongly Type-3 (VST3) clones that have syntactic similarity [90% - 100%), Strongly Type-3 (ST3) when similarity is in [70% - 90%), Moderately Type-3 when similarity is in [50% - 70%) and Weakly Type-3/Type-4 (WT3/4) with similarity in (0%-50%]. The IJaDataset computes syntactic similarity using Overlap, Cosine or Jaccard calculations to create candidate clones and inspects the candidate clones using experts before labeling them into the 4 categories. C. Experiments We randomly extract sample data for 4,000 pair instances for each of VST3, ST3, MT3 and W3/4 clones and false positives from IJaDataset dataset. We extract ASTs and PDGs features for the source code. We extract 10 traditional features from each method, 32 syntactic features and 28 semantic features. Our baseline uses traditional features which consist of 10 features We train and test each classifier by adding syntactic and semantic features. Then, we represent a pair instance as a vector by the two composition functions. We build clone detection models to compare the impact of the three sets of features: traditional features, syntactic features, and syntactic with semantic features. We conduct 15 sets of code clone detection experiments. Models of the classifiers are produced and tested using cross-validation with 10 folds, using Weka and R where we ensure that the ratio between match and non-match classes is the same in each fold and the same as in the overall dataset. Results for only 3 sets of experiments are given in Table II. Table II shows the precision, recall, and F-Measure of code clone detection experiments. The highest precision, recall and F-Measure of the three sets of features are shown in bold. The results are given in Table II using pair instance vectors by Eq. 1. Figures 2 shows the comparison for all fifteen classifiers using Product feature vectors. Rotation Forest, Random Forest and Xgboost classifiers produce the best results among the classifiers. This figure also shows that each classifier s performance improves substantially as syntactic and semantic features are added. On average, the performance improves by 10.8% when syntactic features are added and another 19.2% when semantic features are added in terms of accuracy over all classification algorithms, using product composition. This proves our hypothesis that adding more complex features beyond those used traditionally is extremely helpful in code clone detection. Figure 3 2 shows comparison of our results with the state-of-the-art detectors based on recall. Complete results for comparison are available for recall only. Our approach gets first position as detector for MT3 and WT3/4 clones. NiCad is the best among the clone detectors for VST3 and ST3 clones and our approach ranks second among clone detectors. Results for precision and F-Measure for existing detectors are incomplete since several prior papers do not report them properly. CCFinder has been found to be 60%-72% in precision and 0%-67% in F-Measure [13]. The precision and F-Measure of NiCad are between 80%-96% and 0%-98%, respectively. SourcererCC has a precision at 91% and between 0%-92% in F-Measure as reported [1]. The highest precision in Decard is 93% and its F-Measure between 2%-74%. Our method using Eq. 1 has a precision between 81%-93% and an F-Measure between 82%-94% and our method using Eq. 2 has a very strong precision between 89%-96% and an F- Measure between 90%-95%. Precision and F-Measure results for various detectors are given in Table III. We conclude that our novel features improve both syntactic and semantic clone detection substantially. Table III clearly establishes our approach as being able to detect all classes of clone detection problems and consistently produce the best results. In this table, precision values have been taken from published sources. Prior authors have not reported precision 2 Results of existing detectors are obtained from Sajnani et al. [10]. Note: the existing detectors use approximately between 1 million and 100 million lines of code and our methodology uses approximately 95,519,780 lines of code (40,000 method blocks).

5 TABLE III BIGCLONEBENCH RECALL, PRECISION AND F-MEASURE MEASUREMENTS. EXISITING DETECTORS RESULTS ARE OBTAINED FROM SAJNANI ET AL. [13]. Tool Type Recall Precision F-Measure of Clone SourcererCC VST3 93% 92% ST3 61% 73% MT3 5% 91% (as reported) 9% CCFinder VST3 62% 61% 67% ST3 15 % 24% 25% MT3 1% 60% 72% (as reported) 2% Deckard VST3 62% 74% ST3 31% 47% MT3 12% 93% (as reported) 21% WT3/4 1% 2% iclones VST3 82% ST3 24% MT3 0% (Unreported) (Unreported) WT3/4 0% NiCad VST3 100% 89% 98% ST3 95% 87% 95% MT3 1% 80% 96% (as reported) 2% Our Method Using Eq. 1 (Xgboost Algorithm) VST3 94% 93% 94% with all features ST3 85.0% 88% 86% MT3 82% 81% 82% WT3/4 82% 90% 86% Our Method Using Eq. 2 (Xgboost Algorithm) VST3 97% 96% 95% with all features ST3 90% 93% 91% MT3 91% 89% 90% WT3/4 94% 96% 95% Fig. 3. BigCloneBench Recall Measurements. Existing Detectors Results are obtained from Sajnani et al. [13]. consistently as we see in the table. We have computed F- measure ourselves the best we can. We present the best results in bold. VII. CONCLUSION This paper proposes an efficient metrics-based approach for clone detection, which is able to detect all of the types of clones. The novel aspect of our proposed method is the extraction of features from ASTs and PDGs. We learn clone detection models using a number of classification algorithms. Our experiments demonstrate that using pairwise instances can improve detection of source code clones and indicate that using our new features achieves higher performance for all classifiers we tested. In this paper, we demonstrate the following. We extract new syntactic and semantic features and represent pair method blocks as a vector using product, or list as given by Eqs. 1 and 2 and show that syntactic and semantic features by Eq. 1 significantly improves the performance on average of all the classifiers by 19.2% in accuracy. We conclude that Xgboost is an excellent classification algorithm for detection of Type-3 and 4 clones. This is our first step in using machine learning for code clone detection. In the future, we plan to extend our work to further improve detecting Type-3 and Type-4 clones using unsupervised and supervised learning algorithms. REFERENCES [1] Roy, Chanchal K., and James R. Cordy. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. Program Comprehension, ICPC. The 16th IEEE International Conf. on. IEEE, [2] Baxter, Ira D., et al. Clone detection using abstract syntax trees. Software Maintenance, Proc., Int.l Conf. on. IEEE, [3] Rattan, D., R. Bhatia, and M. Singh. Software clone detection: A systematic review. Information and Soft. Tech (2013): [4] Chen, T., and C. Guestrin. Xgboost: A scalable tree boosting system. arxiv preprint arxiv: (2016). [5] Li, Z., et al. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Trans. on soft. Eng (2006): [6] Koschke, R., R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees th Working Conf. on Reverse Engin. IEEE, [7] Higo, Y., et al. Incremental code clone detection: A pdg-based approach. 18th Working Conf. on Reverse Eng. IEEE, [8] Krutz, Daniel E., and E. Shihab. CCCD: Concolic code clone detection. WCRE [9] Sheneamer, A., and J. Kalita. Code clone detection using coarse and fine-grained hybrid approaches. IEEE 7th Int. Conf. on Intelligent Comp. and Info. Sys. (ICICIS). IEEE, [10] Saini, V., et al. SourcererCC and SourcererCC-I: tools to detect clones in batch mode and during software development. Proc. of the 38th Int. Conf. on Soft. Eng. Companion. ACM, [11] Komondoor, R., and S. Horwitz. Using slicing to identify duplication in source code. Int. Static Analysis Symposium. Springer Berlin Heidelberg, [12] Kodhai, E., et al. Detection of type-1 and type-2 code clones using textual analysis and metrics. Recent Trends in Information, Telec. and Comp. (ITC), 2010 Int. Conf. on. IEEE, [13] Ambient software evolution group: IJaDataset January 2013 [14] Keivanloo, I., F. Zhang, and Y. Zou. Threshold-free code clone detection for a large-scale heterogeneous Java repository IEEE 22nd Int. Conf. on Soft. Analysis, Evol., and Reengin. (SANER). IEEE, [15] Geurts, P., D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning 63.1 (2006): [16] Rodriguez, J. Jos, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: A new classifier ensemble method. IEEE trans. on pattern analysis and machine intelligence (2006): [17] Witten, Ian H., and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, [18] Fan, R., et al. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9.Aug (2008): [19] John, George H., and Pat Langley. Estimating continuous distributions in Bayesian classifiers. Proc. of the 11th conf. on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., [20] Ruck, Dennis W., et al. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans. on Neural Nets. 1.4 (1990): [21] Morrison, Donald F. Multivariate statistical methods. 3. New York, NY. Mc (1990). [22] Aha, D., Kibler H. Instance-based learning algorithms Machine learning (1990) 6:37-66 [23] Cleary, John G., and Leonard E. Trigg. K*: An instance-based learner using an entropic distance measure. Proc. of the 12th Int. Conf. on Machine learning. Vol [24] Breiman, L. Bagging predictors. Machine learning 24.2 (1996): [25] Friedman, J., T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The annals of statistics 28.2 (2000): [26] Weiser, M. Program slicing. Proc. of the 5th int. conf. on Soft. eng. IEEE Press, [27] Wang, X., et al. Can I clone this piece of code here?. Proc. of the 27th IEEE/ACM Int. Conf. on Automated Soft. Eng., 2012.

Token based clone detection using program slicing

Token based clone detection using program slicing Rajnish Kumar PEC University of Technology Rajnish_pawar90@yahoo.com Prof. Shilpa PEC University of Technology Shilpaverma.pec@gmail.com Abstract Software