Semantic Clone Detection Using Machine Learning

Size: px
Start display at page:

Download "Semantic Clone Detection Using Machine Learning"

Transcription

1 Semantic Clone Detection Using Machine Learning Abdullah Sheneamer University of Colorado Colorado Springs, CO USA Jugal Kalita University of Colorado Colorado Springs, CO USA Abstract If two fragments of source code are identical to each other, they are called code clones. Code clones introduce difficulties in software maintenance and cause bug propagation. In this paper, we present a machine learning framework to automatically detect clones in software, which is able to detect Types-3 and the most complicated kind of clones, Type-4 clones. Previously used traditional features are often weak in detecting the semantic clones The novel aspects of our approach are the extraction of features from abstract syntax trees (AST) and program dependency graphs (PDG), representation of a pair of code fragments as a vector and the use of classification algorithms. The key benefit of this approach is that our tool can find both Syntactic and Semantic clones extremely well. Our evaluation indicates that using our new AST and PDG features is a viable methodology, since they improve detecting clones on the IJaDataset2.0. Index Terms Code clones, Software clones, Classifier algorithms, Abstract syntax trees (AST), Program dependence graphs (PDG). I. INTRODUCTION If two fragments of source code are identical to each other, they are called code clones. Code clones introduce difficulties in software maintenance and lead to bug propagation. This paper discusses a novel way to detect semantic clones. Programs have well-defined syntax, which can be represented by Abstract Syntax Trees (ASTs). ASTs have been successfully used to detect code clones [2]. In addition, we hypothesize that semantic Program Dependency Graph (PDG)- based features that are discriminative can be used with machine learning to detect semantic. In this paper, we extract new features from ASTs and PDGs for code fragments to improve the accuracy in detection of code clones. The advantage of PDG-based detection is that it can detect non-contiguous code clones while other detection techniques are not good at it [3]. The contributions of this paper are following. We use novel features of ASTs to detect syntactic clones and features of PDGs to detect semantic clones. We are the first to use such features to the best of our knowledge. We represent a pair of method blocks as a vector in several ways to improve detection of source code clones. We use extracted features to learn a model to detect syntactic and semantic clones using a large number of algorithms. We show that using our new features achieve higher performance for all classifiers we use. The rest of the paper is organized as follows. Section II discusses background material. Related work is introduced in Section III. In Section IV, we describe our feature extraction approach. Our work is discussed in Section V. Evaluation of our work is discussed in Section VI. Finally, the paper is concluded in Section VII. A. Basic Definitions II. BACKGROUND Here, we provide definitions which we use throughout our paper. Definition 1: Code Fragment. A code fragment (CF) is a part of the source code needed to run a program. It can contain functions, begin-end blocks or a sequence of statements. Definition 2: Clone Pair. If a code fragment CF1 is similar to another code fragment CF2 syntactically or semantically, one is called a clone of the other. B. Types of Clones There are four types of clone relations between two code fragments based on the nature of similarity in their text or meaning [10]. Type-1 (Exact clones): Two code fragments are the exact copies of each other except whitespaces, blanks and comments. Type-2 (Renamed): Two code fragments are similar except for names of variables, types, literals and functions. Type-3 (Gapped clones): Two copied code fragments are similar, but with modifications such as added or removed statements, and the use of different identifiers, literals, types, whitespaces, layouts and comments. Type-4 (Semantic clones): Two code fragments are semantically similar, without being syntactically similar. III. RELATED WORK Most clone detection techniques using text-based [1], tokenbased [5] tree-based [6], or PDG-based [7] approaches are good at detecting Type-1 and Type-2 clones, but they miss some Type-3 clones and most Type-4 clones. Komondoor and Horwitz [11] use program slicing [26] to find isomorphic PDG subgraphs and code clones. The nodes in a PDG represent statements and predicates, and edges represent data and control dependencies. Higo and Kusumoto [7] propose a PDG-based incremental two-way slicing approach to detect clones, called Scorpio. Kurtz and Shihab [8] propose a new tool, namely Concolic Code Clone Detection (CCCD). CCCD combines concrete and

2 symbolic values in order to traverse all possible paths of an application. It is able to detect the most complicated kind of clones, Type-4 clones. Wang et al. [27] propose an approach that assists in understanding the harmfulness of intended cloning operations using Bayesian Networks and a set of features such as history, code, destination features. Sheneamer and Kalita [9] present a hybrid clone detection technique that first uses a coarse-grained technique to improve precision and then a fine-grained technique to improve recall. Their approach detects Type-1, Type-2 and Type-3 clones. Sajnani et al. [10] introduce SourcererCC for token-based accurate near-miss clone detection. They use an optimized partial index of tokens and filtering heuristics to achieve large-scale clone detection. However, their tool cannot detect semantic clones. IV. FEATURE EXTRACTION In metric-based clone detection techniques, a number of metrics or features are computed for each fragment of code to find similar fragments by comparing metric vectors instead of comparing code or ASTs directly. Traditional software metrics or 10 traditional features used by different authors [12] include 1) Number of lines of code, 2) Number of assignments, 3) Number of selection statements, 4) Number of iteration statements, 5) Number of switch or case statements, 6) Number of returns, 7) Number of try statements, 8) Number of variable declaration statements, 9) Number of expression statements, and 10) Number of type parameters. All of these features are computed and their values are stored in a database for all methods in a dataset of programs [12]. Pairs of similar methods are detected by comparison of the metric values using machine learning algorithms. Our approach extracts syntactic and semantic features of code fragments from ASTs and PDGs of its methods. A pair of code fragments is represented as a feature vector, and a large collection of such pair vectors are used to train a supervised learning classifiers to identify clone types. Some of the new features we extract from ASTs and PDGs are given in Table I. A. AST and PDG A Program Dependency Graph (PDG) represents control and data dependencies in a program. The nodes of a PDG represent the statements and conditions in a program. Control dependencies represent flow of control information. Data dependencies represent data flow information. A PDG is generated by semantics-aware techniques from the source code [14]. Semantic information contained in a program are captured from PDGs. Therefore, we hypothesize that using features from PDGs will be beneficial in detecting semantic clones. An Abstract Syntax Tree (AST) is a tree representation of the abstract syntactic structure of the source code. Each node of the tree represents a construct occurring in the source code. All of the source code is parsed to convert into an TABLE I SOME AST AND PDG FEATURES AST/PDG Features Description AST No. of Constructors Counts this ( [ Expression, Expression ] ) ; in method blocks. No. of Field Accesses Counts Expression. Identifier in method blocks. No. of Super Constructor Invocations Counts [ Expression. ] super ( [ Expression, Expression ] ) ; in method blocks. No. of Super Method Invocations Counts [ ClassName. ] super. Identifier ( [ Expression, Expression ] ) in method blocks. PDG Decl Assign DA if Assignment comes after declaration else NA Control Decl DC if Declaration comes after Control (e.g. i <count, for, while, if, switch etc.) else NA Control Assign CA if Assignment comes after Control else NA Expr Dec ED if Declaration comes after Expression else NA abstract syntax tree or parse tree for extraction of features. V. OUR WORK We use similarity metrics to help identify Type-4 clones automatically. The proposed method consists of the following steps. Figure 1 illustrates the workflow of our approach. Step 1. Perform lexical analysis and normalization. We transform and normalize all source files into special token sequences to detect not only identical clones but also similar ones. Step 2. Detect method blocks. This step needs lexical analysis and syntactic analysis to detect every block from the given source files. Step 3. Build pairwise method blocks. This step pairs each method with another method block. We compare all blocks in pairs to judge whether or not the two method blocks are identical using a classification algorithm. Step 4. Extract features for two method blocks. In this step, features are extracted from each code fragment using the Java Development Tool (JDT). These features improve the accuracy of Type-3 and Type-4 clone detection. Step 5. Represent pair instances as one vector. In this step, we create pairs of instances from the original data. Two original instances are represented by feature vectors X=(x 1,x 2,x 3,...,x n ) and Y=(y 1,y 2,y 3,...,y n ). We make a pair instance Z=(X,Y) and represent it as a vector using one of the following compostion possibilities: Z =(x 1 y 1,x 2 y 2,x 3 y 3,...,x n y m ),, (1) Z =(x 1,x 2,x 3,...,x n,y 1,y 2,y 3,...,y n ). (2) We perform normalization for Eq. 1 by dividing by the greatest value of each feature so that all feature values are between 0 and 1. The representation of a pair instance by Eq. 1 leads to lower performance than the representation of pair instance by Eq. 2. However, implementing Eq. 2 straightforwardly causes the following problems: The dimension of the feature space becomes 2n and the computational cost becomes expensive. We also have found that using Eq. 1 with our new features achieves higher performance than using Eq. 2. We show only results of using Eq. 1 due to lack of room in this paper.

3 Fig. 1. The overview of the proposed work Step 6. Detect similar blocks using a classification algorithm. After feature extraction, the pair instance is represented as a vector using one of the Eq. 1 or Eq. 2. We feed this data to a classifier. Our labeled data which contains pair instances have match and non-match as class labels. We train and test our classification models using fifteen machine learning algorithms. These include a recently published classification algorithm such as Xgboost [4], and others such as Extra Trees [15] and Rotation Forest [16]. Xgboost is short for (extreme Gradient Boosting). Gradient boost defines an objective function that contains two parts: training loss and regularization [4]. Extra Trees [15] is a tree-based ensemble method for supervised classification and regression problems. Rotation Forest [16] is a classifier ensemble based on feature extraction. It applies Principal Components Analysis (PCA) to each subset of a feature set, which is randomly split into (k) subsets and then trains a decision tree classifier on each subset. Random Forest creates multiple trees for classification. Random Committee [17] generates an ensemble of classifiers as well. In this paper, we train a Random Committee classifier with randomly created subsets instead of a decision tree as in the Rotation Forest classifier, to produce better accuracy. Other algorithms we use include SVM [18], Linear Discriminant Analysis (LDA) [21], Instance Based Learner (IBK) [22], Lazy (K ) [23], Decision Trees, Naïve Bayes [19], Multilayer perceptron (MLP) [20], Bagging [24], and LogitBoost [25]. We compare these classification algorithms in their ability to detect clones and show that supervised machine learning methods using our novel semantic features perform much better than using just traditional features. VI. EVALUATION Our primary goal is to improve clone detection accuracy for Type-3 and Type-4 clones using classification algorithms and comparing with the state-of-the-art. As our target we use IJaDataset 2.0 [13], a large inter-project Java repository called BigCloneBench 1 containing 25,000 open-source projects from SourceForge and Google Code. We use only a part of the dataset as described below. 1 TABLE II RESULTS USING PAIR INSTANCES VECTORS (EQ. 1) Algorithm Features Type of Clone Precision Recall F- Measure False 90.5% 86.0% 88.2% Rotation Forest Traditional Features (10) VST3 85.1% 82.7% 83.8% ST3 70.4% 62.6% 66.3% MT3 56.8% 48.0% 52.0% WT3/4 52.9% 79.8% 63.6% False 85.6% 67.1% 75.2% Syntactic Features (42) VST3 91.8% 89.1% 90.4% ST3 78.4% 80.6% 79.5% MT3 69.5% 64.5% 66.9% WT3/4 73.5% 82.2% 77.6% False 90.7% 86.7% 88.6% Syntactic + Semantic Features (70) VST3 93.1% 93.9% 93.5% ST3 82.2% 86.7% 84.4% MT3 82.9% 79.3% 81.1% WT3/4 91.3% 90.3% 90.8% False 94.5% 93.6% 94.0% Random Forest Traditional Features (10) VST3 85.3% 82.3% 83.8% ST3 70.7% 62.0% 66.1% MT3 56.8% 48.4% 52.3% WT3/4 52.7% 80.2% 63.6% False 85.4% 67.1% 75.1% Syntactic Features (42) VST3 92.3% 88.6% 90.4% ST3 78.7% 80.6% 79.6% MT3 69.8% 64.5% 67.1% WT3/4 73.3% 82.9% 77.8% False 90.4% 86.8% 88.6% Syntactic + Semantic Features (70) VST3 93.9% 93.1% 93.5% ST3 82.0% 86.6% 84.3% MT3 82.9% 79.1% 81.0% WT3/4 91.3% 91.0% 91.1% False 93.7% 93.9% 93.8% Xgboost Traditional Features (10) VST3 87.2% 84.4% 85.8% ST3 59.2% 71.5% 64.6% MT3 43.4% 53.8% 48.1% WT3/4 79.3% 50.6% 61.8% False 66.9% 88.0% 76.0% Syntactic Features (42) VST3 86.9% 93.7% 90.2% ST3 79.4% 78.4% 78.9% MT3 63.1% 68.6% 65.7% WT3/4 82.5% 72.6% 77.3% False 88.3% 88.0% 88.1% Syntactic + Semantic Features (70) VST3 93.0% 94.4% 93.9% ST3 87.5% 85.0% 86.2% MT3 81.3% 81.7% 81.5% WT3/4 90.2% 81.7% 85.7% False 93.9% 93.9% 93.9% A. Big IJaDataset This dataset represents a real use case of clone detection. This benchmark was built by mining IJaDataset for functions. The published version of the benchmark considers 44 target functionalities [14]. We run our programs on a standard workstation with Intel(R) Core(TM) i CPU, 32 GB of memory and 500 GB solid state drive. B. Experimental Setup For experiments, we consider all types of clone lengths in BigCloneBench that are 6 lines or 50 tokens or greater, which

4 Fig. 2. Results of accuracy for each classification algorithm using Pair Instances Vector (Eq.1) is standard minimum clone size for benchmarking [10]. There is no agreement on when a clone is no longer syntactically similar, and it is also hard to separate the Type-3 and Type-4 clones in the IJaDataset [13]. In IJaDataset, the authors divided Type-3 and Type-4 clones into four classes based on their syntactical similarity [13] as following: Very Strongly Type-3 (VST3) clones that have syntactic similarity [90% - 100%), Strongly Type-3 (ST3) when similarity is in [70% - 90%), Moderately Type-3 when similarity is in [50% - 70%) and Weakly Type-3/Type-4 (WT3/4) with similarity in (0%-50%]. The IJaDataset computes syntactic similarity using Overlap, Cosine or Jaccard calculations to create candidate clones and inspects the candidate clones using experts before labeling them into the 4 categories. C. Experiments We randomly extract sample data for 4,000 pair instances for each of VST3, ST3, MT3 and W3/4 clones and false positives from IJaDataset dataset. We extract ASTs and PDGs features for the source code. We extract 10 traditional features from each method, 32 syntactic features and 28 semantic features. Our baseline uses traditional features which consist of 10 features We train and test each classifier by adding syntactic and semantic features. Then, we represent a pair instance as a vector by the two composition functions. We build clone detection models to compare the impact of the three sets of features: traditional features, syntactic features, and syntactic with semantic features. We conduct 15 sets of code clone detection experiments. Models of the classifiers are produced and tested using cross-validation with 10 folds, using Weka and R where we ensure that the ratio between match and non-match classes is the same in each fold and the same as in the overall dataset. Results for only 3 sets of experiments are given in Table II. Table II shows the precision, recall, and F-Measure of code clone detection experiments. The highest precision, recall and F-Measure of the three sets of features are shown in bold. The results are given in Table II using pair instance vectors by Eq. 1. Figures 2 shows the comparison for all fifteen classifiers using Product feature vectors. Rotation Forest, Random Forest and Xgboost classifiers produce the best results among the classifiers. This figure also shows that each classifier s performance improves substantially as syntactic and semantic features are added. On average, the performance improves by 10.8% when syntactic features are added and another 19.2% when semantic features are added in terms of accuracy over all classification algorithms, using product composition. This proves our hypothesis that adding more complex features beyond those used traditionally is extremely helpful in code clone detection. Figure 3 2 shows comparison of our results with the state-of-the-art detectors based on recall. Complete results for comparison are available for recall only. Our approach gets first position as detector for MT3 and WT3/4 clones. NiCad is the best among the clone detectors for VST3 and ST3 clones and our approach ranks second among clone detectors. Results for precision and F-Measure for existing detectors are incomplete since several prior papers do not report them properly. CCFinder has been found to be 60%-72% in precision and 0%-67% in F-Measure [13]. The precision and F-Measure of NiCad are between 80%-96% and 0%-98%, respectively. SourcererCC has a precision at 91% and between 0%-92% in F-Measure as reported [1]. The highest precision in Decard is 93% and its F-Measure between 2%-74%. Our method using Eq. 1 has a precision between 81%-93% and an F-Measure between 82%-94% and our method using Eq. 2 has a very strong precision between 89%-96% and an F- Measure between 90%-95%. Precision and F-Measure results for various detectors are given in Table III. We conclude that our novel features improve both syntactic and semantic clone detection substantially. Table III clearly establishes our approach as being able to detect all classes of clone detection problems and consistently produce the best results. In this table, precision values have been taken from published sources. Prior authors have not reported precision 2 Results of existing detectors are obtained from Sajnani et al. [10]. Note: the existing detectors use approximately between 1 million and 100 million lines of code and our methodology uses approximately 95,519,780 lines of code (40,000 method blocks).

5 TABLE III BIGCLONEBENCH RECALL, PRECISION AND F-MEASURE MEASUREMENTS. EXISITING DETECTORS RESULTS ARE OBTAINED FROM SAJNANI ET AL. [13]. Tool Type Recall Precision F-Measure of Clone SourcererCC VST3 93% 92% ST3 61% 73% MT3 5% 91% (as reported) 9% CCFinder VST3 62% 61% 67% ST3 15 % 24% 25% MT3 1% 60% 72% (as reported) 2% Deckard VST3 62% 74% ST3 31% 47% MT3 12% 93% (as reported) 21% WT3/4 1% 2% iclones VST3 82% ST3 24% MT3 0% (Unreported) (Unreported) WT3/4 0% NiCad VST3 100% 89% 98% ST3 95% 87% 95% MT3 1% 80% 96% (as reported) 2% Our Method Using Eq. 1 (Xgboost Algorithm) VST3 94% 93% 94% with all features ST3 85.0% 88% 86% MT3 82% 81% 82% WT3/4 82% 90% 86% Our Method Using Eq. 2 (Xgboost Algorithm) VST3 97% 96% 95% with all features ST3 90% 93% 91% MT3 91% 89% 90% WT3/4 94% 96% 95% Fig. 3. BigCloneBench Recall Measurements. Existing Detectors Results are obtained from Sajnani et al. [13]. consistently as we see in the table. We have computed F- measure ourselves the best we can. We present the best results in bold. VII. CONCLUSION This paper proposes an efficient metrics-based approach for clone detection, which is able to detect all of the types of clones. The novel aspect of our proposed method is the extraction of features from ASTs and PDGs. We learn clone detection models using a number of classification algorithms. Our experiments demonstrate that using pairwise instances can improve detection of source code clones and indicate that using our new features achieves higher performance for all classifiers we tested. In this paper, we demonstrate the following. We extract new syntactic and semantic features and represent pair method blocks as a vector using product, or list as given by Eqs. 1 and 2 and show that syntactic and semantic features by Eq. 1 significantly improves the performance on average of all the classifiers by 19.2% in accuracy. We conclude that Xgboost is an excellent classification algorithm for detection of Type-3 and 4 clones. This is our first step in using machine learning for code clone detection. In the future, we plan to extend our work to further improve detecting Type-3 and Type-4 clones using unsupervised and supervised learning algorithms. REFERENCES [1] Roy, Chanchal K., and James R. Cordy. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. Program Comprehension, ICPC. The 16th IEEE International Conf. on. IEEE, [2] Baxter, Ira D., et al. Clone detection using abstract syntax trees. Software Maintenance, Proc., Int.l Conf. on. IEEE, [3] Rattan, D., R. Bhatia, and M. Singh. Software clone detection: A systematic review. Information and Soft. Tech (2013): [4] Chen, T., and C. Guestrin. Xgboost: A scalable tree boosting system. arxiv preprint arxiv: (2016). [5] Li, Z., et al. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Trans. on soft. Eng (2006): [6] Koschke, R., R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees th Working Conf. on Reverse Engin. IEEE, [7] Higo, Y., et al. Incremental code clone detection: A pdg-based approach. 18th Working Conf. on Reverse Eng. IEEE, [8] Krutz, Daniel E., and E. Shihab. CCCD: Concolic code clone detection. WCRE [9] Sheneamer, A., and J. Kalita. Code clone detection using coarse and fine-grained hybrid approaches. IEEE 7th Int. Conf. on Intelligent Comp. and Info. Sys. (ICICIS). IEEE, [10] Saini, V., et al. SourcererCC and SourcererCC-I: tools to detect clones in batch mode and during software development. Proc. of the 38th Int. Conf. on Soft. Eng. Companion. ACM, [11] Komondoor, R., and S. Horwitz. Using slicing to identify duplication in source code. Int. Static Analysis Symposium. Springer Berlin Heidelberg, [12] Kodhai, E., et al. Detection of type-1 and type-2 code clones using textual analysis and metrics. Recent Trends in Information, Telec. and Comp. (ITC), 2010 Int. Conf. on. IEEE, [13] Ambient software evolution group: IJaDataset January 2013 [14] Keivanloo, I., F. Zhang, and Y. Zou. Threshold-free code clone detection for a large-scale heterogeneous Java repository IEEE 22nd Int. Conf. on Soft. Analysis, Evol., and Reengin. (SANER). IEEE, [15] Geurts, P., D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning 63.1 (2006): [16] Rodriguez, J. Jos, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: A new classifier ensemble method. IEEE trans. on pattern analysis and machine intelligence (2006): [17] Witten, Ian H., and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, [18] Fan, R., et al. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9.Aug (2008): [19] John, George H., and Pat Langley. Estimating continuous distributions in Bayesian classifiers. Proc. of the 11th conf. on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., [20] Ruck, Dennis W., et al. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans. on Neural Nets. 1.4 (1990): [21] Morrison, Donald F. Multivariate statistical methods. 3. New York, NY. Mc (1990). [22] Aha, D., Kibler H. Instance-based learning algorithms Machine learning (1990) 6:37-66 [23] Cleary, John G., and Leonard E. Trigg. K*: An instance-based learner using an entropic distance measure. Proc. of the 12th Int. Conf. on Machine learning. Vol [24] Breiman, L. Bagging predictors. Machine learning 24.2 (1996): [25] Friedman, J., T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The annals of statistics 28.2 (2000): [26] Weiser, M. Program slicing. Proc. of the 5th int. conf. on Soft. eng. IEEE Press, [27] Wang, X., et al. Can I clone this piece of code here?. Proc. of the 27th IEEE/ACM Int. Conf. on Automated Soft. Eng., 2012.

Token based clone detection using program slicing

Token based clone detection using program slicing Token based clone detection using program slicing Rajnish Kumar PEC University of Technology Rajnish_pawar90@yahoo.com Prof. Shilpa PEC University of Technology Shilpaverma.pec@gmail.com Abstract Software

More information

Rearranging the Order of Program Statements for Code Clone Detection

Rearranging the Order of Program Statements for Code Clone Detection Rearranging the Order of Program Statements for Code Clone Detection Yusuke Sabi, Yoshiki Higo, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, Japan Email: {y-sabi,higo,kusumoto@ist.osaka-u.ac.jp

More information

Detection of Non Continguous Clones in Software using Program Slicing

Detection of Non Continguous Clones in Software using Program Slicing Detection of Non Continguous Clones in Software using Program Slicing Er. Richa Grover 1 Er. Narender Rana 2 M.Tech in CSE 1 Astt. Proff. In C.S.E 2 GITM, Kurukshetra University, INDIA Abstract Code duplication

More information

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price. Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which

More information

SourcererCC -- Scaling Code Clone Detection to Big-Code

SourcererCC -- Scaling Code Clone Detection to Big-Code SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories

More information

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones Detection using Textual and Metric Analysis to figure out all Types of s Kodhai.E 1, Perumal.A 2, and Kanmani.S 3 1 SMVEC, Dept. of Information Technology, Puducherry, India Email: kodhaiej@yahoo.co.in

More information

Software Clone Detection. Kevin Tang Mar. 29, 2012

Software Clone Detection. Kevin Tang Mar. 29, 2012 Software Clone Detection Kevin Tang Mar. 29, 2012 Software Clone Detection Introduction Reasons for Code Duplication Drawbacks of Code Duplication Clone Definitions in the Literature Detection Techniques

More information

A Technique to Detect Multi-grained Code Clones

A Technique to Detect Multi-grained Code Clones Detection Time The Number of Detectable Clones A Technique to Detect Multi-grained Code Clones Yusuke Yuki, Yoshiki Higo, and Shinji Kusumoto Graduate School of Information Science and Technology, Osaka

More information

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection of Potential

More information

Detection and Analysis of Software Clones

Detection and Analysis of Software Clones Detection and Analysis of Software Clones By Abdullah Mohammad Sheneamer M.S., University of Colorado at Colorado Springs, Computer Science, USA, 2012 B.S., University of King Abdulaziz, Computer Science,

More information

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, 2 Head of Department, Department of Computer Science & Engineering, Universal Institute of Engineering

More information

An Exploratory Study on Interface Similarities in Code Clones

An Exploratory Study on Interface Similarities in Code Clones 1 st WETSoDA, December 4, 2017 - Nanjing, China An Exploratory Study on Interface Similarities in Code Clones Md Rakib Hossain Misu, Abdus Satter, Kazi Sakib Institute of Information Technology University

More information

A Survey of Software Clone Detection Techniques

A Survey of Software Clone Detection Techniques A Survey of Software Detection Techniques Abdullah Sheneamer Department of Computer Science University of Colorado at Colo. Springs, USA Colorado Springs, USA asheneam@uccs.edu Jugal Kalita Department

More information

ISSN: (PRINT) ISSN: (ONLINE)

ISSN: (PRINT) ISSN: (ONLINE) IJRECE VOL. 5 ISSUE 2 APR.-JUNE. 217 ISSN: 2393-928 (PRINT) ISSN: 2348-2281 (ONLINE) Code Clone Detection Using Metrics Based Technique and Classification using Neural Network Sukhpreet Kaur 1, Prof. Manpreet

More information

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY

MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY Anna Corazza, Sergio Di Martino, Valerio Maggio Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello, Fabrizio Silverstri JIMSE 2012 August 28, 2012

More information

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech Clone Detection and Maintenance with AI Techniques Na Meng Virginia Tech Code Clones Developers copy and paste code to improve programming productivity Clone detections tools are needed to help bug fixes

More information

Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering

Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering Jeffrey Svajlenko Chanchal K. Roy Department of Computer Science, University of Saskatchewan, Saskatoon,

More information

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE OF MASTER OF

More information

A Tree Kernel Based Approach for Clone Detection

A Tree Kernel Based Approach for Clone Detection A Tree Kernel Based Approach for Clone Detection Anna Corazza 1, Sergio Di Martino 1, Valerio Maggio 1, Giuseppe Scanniello 2 1) University of Naples Federico II 2) University of Basilicata Outline Background

More information

Code duplication in Software Systems: A Survey

Code duplication in Software Systems: A Survey Code duplication in Software Systems: A Survey G. Anil kumar 1 Dr. C.R.K.Reddy 2 Dr. A. Govardhan 3 A. Ratna Raju 4 1,4 MGIT, Dept. of Computer science, Hyderabad, India Email: anilgkumar@mgit.ac.in, ratnaraju@mgit.ac.in

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

A Measurement of Similarity to Identify Identical Code Clones

A Measurement of Similarity to Identify Identical Code Clones The International Arab Journal of Information Technology, Vol. 12, No. 6A, 2015 735 A Measurement of Similarity to Identify Identical Code Clones Mythili ShanmughaSundaram and Sarala Subramani Department

More information

COMPARISON AND EVALUATION ON METRICS

COMPARISON AND EVALUATION ON METRICS COMPARISON AND EVALUATION ON METRICS BASED APPROACH FOR DETECTING CODE CLONE D. Gayathri Devi 1 1 Department of Computer Science, Karpagam University, Coimbatore, Tamilnadu dgayadevi@gmail.com Abstract

More information

Large-Scale Clone Detection and Benchmarking

Large-Scale Clone Detection and Benchmarking Large-Scale Clone Detection and Benchmarking A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy in

More information

Towards the Code Clone Analysis in Heterogeneous Software Products

Towards the Code Clone Analysis in Heterogeneous Software Products Towards the Code Clone Analysis in Heterogeneous Software Products 11 TIJANA VISLAVSKI, ZORAN BUDIMAC AND GORDANA RAKIĆ, University of Novi Sad Code clones are parts of source code that were usually created

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

DCC / ICEx / UFMG. Software Code Clone. Eduardo Figueiredo.

DCC / ICEx / UFMG. Software Code Clone. Eduardo Figueiredo. DCC / ICEx / UFMG Software Code Clone Eduardo Figueiredo http://www.dcc.ufmg.br/~figueiredo Code Clone Code Clone, also called Duplicated Code, is a well known code smell in software systems Code clones

More information

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Lecture 25 Clone Detection CCFinder Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic) Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002 Recap of Polymetric

More information

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes 1 K. Vidhya, 2 N. Sumathi, 3 D. Ramya, 1, 2 Assistant Professor 3 PG Student, Dept.

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Code Clone Detection on Specialized PDGs with Heuristics

Code Clone Detection on Specialized PDGs with Heuristics 2011 15th European Conference on Software Maintenance and Reengineering Code Clone Detection on Specialized PDGs with Heuristics Yoshiki Higo Graduate School of Information Science and Technology Osaka

More information

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I.

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I. International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Mapping Bug Reports to Relevant Files and Automated

More information

arxiv: v1 [cs.se] 20 Dec 2015

arxiv: v1 [cs.se] 20 Dec 2015 SourcererCC: Scaling Code Clone Detection to Big Code Hitesh Sajnani * Vaibhav Saini * Jeffrey Svajlenko Chanchal K. Roy Cristina V. Lopes * * School of Information and Computer Science, UC Irvine, USA

More information

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique 1 Syed MohdFazalulHaque, 2 Dr. V Srikanth, 3 Dr. E. Sreenivasa Reddy 1 Maulana Azad National Urdu University, 2 Professor,

More information

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering Volume 8, No. 5, May-June 2017 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN No. 0976-5697 Design Code Clone Detection System uses

More information

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code DCCD: An Efficient and Scalable Distributed Code Clone Detection Technique for Big Code Junaid Akram (Member, IEEE), Zhendong Shi, Majid Mumtaz and Luo Ping State Key Laboratory of Information Security,

More information

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1635-1649 Research India Publications http://www.ripublication.com Study and Analysis of Object-Oriented

More information

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools Jeffrey Svajlenko Chanchal K. Roy University of Saskatchewan, Canada {jeff.svajlenko, chanchal.roy}@usask.ca Slawomir

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones Overview Code Clones Definition and categories Clone detection Clone removal refactoring Spiros Mancoridis[1] Modified by Na Meng 2 Code Clones Code clone is a code fragment in source files that is identical

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Code Clone Detector: A Hybrid Approach on Java Byte Code

Code Clone Detector: A Hybrid Approach on Java Byte Code Code Clone Detector: A Hybrid Approach on Java Byte Code Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted By

More information

Detection and Behavior Identification of Higher-Level Clones in Software

Detection and Behavior Identification of Higher-Level Clones in Software Detection and Behavior Identification of Higher-Level Clones in Software Swarupa S. Bongale, Prof. K. B. Manwade D. Y. Patil College of Engg. & Tech., Shivaji University Kolhapur, India Ashokrao Mane Group

More information

Searching for Configurations in Clone Evaluation A Replication Study

Searching for Configurations in Clone Evaluation A Replication Study Searching for Configurations in Clone Evaluation A Replication Study Chaiyong Ragkhitwetsagul 1, Matheus Paixao 1, Manal Adham 1 Saheed Busari 1, Jens Krinke 1 and John H. Drake 2 1 University College

More information

Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering

Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering Scalable Code Clone Detection and Search based on Adaptive Prefix Filtering Manziba Akanda Nishi a, Kostadin Damevski a a Department of Computer Science, Virginia Commonwealth University Abstract Code

More information

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Emerging Approach

More information

Folding Repeated Instructions for Improving Token-based Code Clone Detection

Folding Repeated Instructions for Improving Token-based Code Clone Detection 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation Folding Repeated Instructions for Improving Token-based Code Clone Detection Hiroaki Murakami, Keisuke Hotta, Yoshiki

More information

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning

More information

Impact of Dependency Graph in Software Testing

Impact of Dependency Graph in Software Testing Impact of Dependency Graph in Software Testing Pardeep Kaur 1, Er. Rupinder Singh 2 1 Computer Science Department, Chandigarh University, Gharuan, Punjab 2 Assistant Professor, Computer Science Department,

More information

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

Accuracy Enhancement in Code Clone Detection Using Advance Normalization Accuracy Enhancement in Code Clone Detection Using Advance Normalization 1 Ritesh V. Patil, 2 S. D. Joshi, 3 Digvijay A. Ajagekar, 4 Priyanka A. Shirke, 5 Vivek P. Talekar, 6 Shubham D. Bankar 1 Research

More information

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India Volume 3, Issue 11, November 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Study of Different

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions 2016 IEEE International Conference on Big Data (Big Data) Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions Jeff Hebert, Texas Instruments

More information

Feature Selection and Classification for Small Gene Sets

Feature Selection and Classification for Small Gene Sets Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,

More information

An Approach to Detect Clones in Class Diagram Based on Suffix Array

An Approach to Detect Clones in Class Diagram Based on Suffix Array An Approach to Detect Clones in Class Diagram Based on Suffix Array Amandeep Kaur, Computer Science and Engg. Department, BBSBEC Fatehgarh Sahib, Punjab, India. Manpreet Kaur, Computer Science and Engg.

More information

A Lazy Approach for Machine Learning Algorithms

A Lazy Approach for Machine Learning Algorithms A Lazy Approach for Machine Learning Algorithms Inés M. Galván, José M. Valls, Nicolas Lecomte and Pedro Isasi Abstract Most machine learning algorithms are eager methods in the sense that a model is generated

More information

On Refactoring for Open Source Java Program

On Refactoring for Open Source Java Program On Refactoring for Open Source Java Program Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1, Katsuro Inoue 1 and Yoshio Kataoka 3 1 Graduate School of Information Science and Technology, Osaka University

More information

Mondrian Forests: Efficient Online Random Forests

Mondrian Forests: Efficient Online Random Forests Mondrian Forests: Efficient Online Random Forests Balaji Lakshminarayanan Joint work with Daniel M. Roy and Yee Whye Teh 1 Outline Background and Motivation Mondrian Forests Randomization mechanism Online

More information

A Novel Technique for Retrieving Source Code Duplication

A Novel Technique for Retrieving Source Code Duplication A Novel Technique for Retrieving Source Code Duplication Yoshihisa Udagawa Computer Science Department, Faculty of Engineering Tokyo Polytechnic University Atsugi-city, Kanagawa, Japan udagawa@cs.t-kougei.ac.jp

More information

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES SANTHOSH PATHICAL 1, GURSEL SERPEN 1 1 Elecrical Engineering and Computer Science Department, University of Toledo, Toledo, OH, 43606,

More information

Neural Detection of Semantic Code Clones via Tree-Based Convolution

Neural Detection of Semantic Code Clones via Tree-Based Convolution Neural Detection of Semantic Code Clones via Tree-Based Convolution Hao Yu 1,2, Wing Lam 3, Long Chen 2, Ge Li 1,4 *, Tao Xie 1,3, and Qianxiang Wang 5 1 Key Laboratory of High Confidence Software Technologies

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Refactoring Support Based on Code Clone Analysis

Refactoring Support Based on Code Clone Analysis Refactoring Support Based on Code Clone Analysis Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1 and Katsuro Inoue 1 1 Graduate School of Information Science and Technology, Osaka University, Toyonaka,

More information

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN

International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February ISSN International Journal of Scientific & Engineering Research, Volume 8, Issue 2, February-2017 164 DETECTION OF SOFTWARE REFACTORABILITY THROUGH SOFTWARE CLONES WITH DIFFRENT ALGORITHMS Ritika Rani 1,Pooja

More information

International Journal of Advance Research in Engineering, Science & Technology

International Journal of Advance Research in Engineering, Science & Technology Impact Factor (SJIF): 4.542 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 4, Issue5,May-2017 Software Fault Detection using

More information

Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018

Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018 Code Duplication++ Status Report Dolores Zage, Wayne Zage, Nathan White Ball State University November 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication

More information

Finding Extract Method Refactoring Opportunities by Analyzing Development History

Finding Extract Method Refactoring Opportunities by Analyzing Development History 2017 IEEE 41st Annual Computer Software and Applications Conference Finding Extract Refactoring Opportunities by Analyzing Development History Ayaka Imazato, Yoshiki Higo, Keisuke Hotta, and Shinji Kusumoto

More information

Software Clone Detection Using Cosine Distance Similarity

Software Clone Detection Using Cosine Distance Similarity Software Clone Detection Using Cosine Distance Similarity A Dissertation SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF DEGREE OF MASTER OF TECHNOLOGY IN COMPUTER SCIENCE & ENGINEERING

More information

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies Ripon K. Saha Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {ripon.saha,

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Tensor Sparse PCA and Face Recognition: A Novel Approach

Tensor Sparse PCA and Face Recognition: A Novel Approach Tensor Sparse PCA and Face Recognition: A Novel Approach Loc Tran Laboratoire CHArt EA4004 EPHE-PSL University, France tran0398@umn.edu Linh Tran Ho Chi Minh University of Technology, Vietnam linhtran.ut@gmail.com

More information

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm Master Thesis Title Type-3 Code Clone Detection Using The Smith-Waterman Algorithm Supervisor Prof. Shinji KUSUMOTO by Hiroaki MURAKAMI February 5, 2013 Department of Computer Science Graduate School of

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Compiling clones: What happens?

Compiling clones: What happens? Compiling clones: What happens? Oleksii Kononenko, Cheng Zhang, and Michael W. Godfrey David R. Cheriton School of Computer Science University of Waterloo, Canada {okononen, c16zhang, migod}@uwaterloo.ca

More information

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA More Learning Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA 1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

More information

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM XIAO-DONG ZENG, SAM CHAO, FAI WONG Faculty of Science and Technology, University of Macau, Macau, China E-MAIL: ma96506@umac.mo, lidiasc@umac.mo,

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Mondrian Forests: Efficient Online Random Forests

Mondrian Forests: Efficient Online Random Forests Mondrian Forests: Efficient Online Random Forests Balaji Lakshminarayanan (Gatsby Unit, UCL) Daniel M. Roy (Cambridge Toronto) Yee Whye Teh (Oxford) September 4, 2014 1 Outline Background and Motivation

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Incremental Clone Detection and Elimination for Erlang Programs

Incremental Clone Detection and Elimination for Erlang Programs Incremental Clone Detection and Elimination for Erlang Programs Huiqing Li and Simon Thompson School of Computing, University of Kent, UK {H.Li, S.J.Thompson}@kent.ac.uk Abstract. A well-known bad code

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization 2017 24th Asia-Pacific Software Engineering Conference CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization Yuichi Semura, Norihiro Yoshida, Eunjong Choi and Katsuro Inoue Osaka University,

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Code Similarity Detection by Program Dependence Graph

Code Similarity Detection by Program Dependence Graph 2016 International Conference on Computer Engineering and Information Systems (CEIS-16) Code Similarity Detection by Program Dependence Graph Zhen Zhang, Hai-Hua Yan, Xiao-Wei Zhang Dept. of Computer Science,

More information

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu Deckard: Scalable and Accurate Tree-based Detection of Code Clones Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu The Problem Find similar code in large code bases, often referred to as

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information