Automatic Clone Recommendation for Refactoring Based on the Present and the Past

Size: px
Start display at page:

Download "Automatic Clone Recommendation for Refactoring Based on the Present and the Past"

Transcription

1 Automatic Clone Recommendation for Refactoring Based on the Present and the Past Ruru Yue, Zhe Gao, Na Meng Yingfei Xiong, Xiaoyin Wang J. David Morgenthaler Key Laboratory of High Confidence Software Technologies (Peking University), MoE Institute of Software, EECS, Peking University, Beijing , China Department of Computer Science, Virginia Tech, Blacksburg, VA, 24060, Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249, Google, Mountain View, CA 94043, Abstract When many clones are detected in software programs, not all clones are equally important to developers. To help developers refactor code and improve software quality, various tools were built to recommend clone-removal refactorings based on the past and the present information, such as the cohesion degree of individual clones or the co-evolution relations of clone peers. The existence of these tools inspired us to build an approach that considers as many factors as possible to more accurately recommend clones. This paper introduces CREC, a learning-based approach that recommends clones by extracting features from the current status and past history of software projects. Given a set of software repositories, CREC first automatically extracts the clone groups historically refactored (R-clones) and those not refactored (NR-clones) to construct the training set. CREC extracts 34 features to characterize the content and evolution behaviors of individual clones, as well as the spatial, syntactical, and co-change relations of clone peers. With these features, CREC trains a classifier that recommends clones for refactoring. We designed the largest feature set thus far for clone recommendation, and performed an evaluation on six large projects. The results show that our approach suggested refactorings with 83% and 76% F-scores in the within-project and crossproject settings. CREC significantly outperforms a state-of-theart similar approach on our data set, with the latter one achieving 70% and 50% F-scores. We also compared the effectiveness of different factors and different learning algorithms. I. INTRODUCTION Code clones (or duplicated code) present challenges to software maintenance. They may require developers to repetitively apply similar edits to multiple program locations. When developers fail to consistently update clones, they may incompletely fix a bug or introduce new bugs [24], [33], [39], [42]. To mitigate the problem, programmers sometimes apply clone removal refactorings (e.g., Extract Method and Form Template Method [12]) to reduce code duplication and eliminate repetitive coding effort. However, as studied by existing work [15], [27], [51], not all clones need to be refactored. Many clones are not worthwhile to be refactored or difficult to be refactored [27], and programmers often only refactor a small portion of all clones [51]. Thus, when a clone detection tool reports lots of clones in a program, it would be difficult for the developer to go through the (usually long) list and locate the clones that should be refactored. A technique that automatically recommends only the important clones for refactoring would be valuable. Researchers have realized this problem and various tools were developed to recommend clones for refactoring. Given a project, most current tools recommend clones based on factors either representing the present status or past evolution of the project. For example, some tools suggest refactorings based on the code smell presented in a program, e.g., high cohesion and low coupling of clones [16], [19]. Some other tools provide recommendation based on the past co-evolution relationship between clones [18], [36]. The success of these tools indicates that both the present and the past of programs are useful in recommending clones, and it naturally raises a question: can we build an approach that considers as many factors as possible to more accurately recommend clones? To take multiple factors into account, we need a model that aggregates all factors properly. Machine learning provides a good way to build a such model. Existing work [51] showed that using machine learning to integrate factors could lead to good accuracy in recommending clones. However, machine learning requires a large set of labelled instances for training. In practice, we often lack enough human power to manually build such a training set; neither have we found any existing work that automatically builds the set. In this paper, we present CREC, a learning-based approach that automatically extracts refactored and non-refactored clones groups from software repositories, and trains models to recommend clones for refactoring. Specifically as shown in Fig. 1, there are three phases in our approach. Given open source project repositories, CREC first models clone genealogies for each repository as prior work does [27] and identifies clone groups historically refactored (R-clones) and others not (NR-clones) in Phase I. CREC then extracts 34 features to characterize both kinds of groups and to train a model in Phase II. Intuitively, developers may decide refactorings based on (1) the cost of conducting refactorings (e.g., the divergence degree among clones), and (2) any potential benefit (e.g., better software maintenability). To consider as many factors as possible, we thus include 17 per-clone features to characterize each clone s content and evolution history, and 17 clone-group features

2 Repositories of Training Projects Phase I: Clone Data Preparation Clone Genealogy Analysis Clone Labeling Refactored Clones Feature Extraction Phase II: Training Non-Refactored Clones Per-Clone Features (17) Code Feature (11) History Features (6) Machine Learning Repositories of Testing Projects Clone Genealogy Analysis Clone Group Features (17) Location Features (6) Difference Features (6) Co-Change Features (5) Classifier Phase III: Testing Clones to Refactor Clones not to Refactor Fig. 1: CREC has three phases: clone data preparation, training, and testing. Phase I detects refactored and non-refactored clones in the repositories of training projects. Phase II trains a classifier with machine learning based on the two types of clones. Phase III recommends clones for refactoring using the trained classifier. to capture the relative location, syntactic difference, and cochange relationship between clone peers. The 34 features in total describe both the present status and past evolution of single clones and clone peers. Finally in Phase III, to recommend clones given a project s repository, CREC performs clone genealogy analysis, extracts features from the clone genealogies, and uses the trained model for refactoring-decision prediction. We evaluated CREC with R-clones and NR-clones extracted from six open source projects in the within-project and crossproject settings. Based on our dataset and extracted features, we also evaluated the effectiveness of different feature subsets and machine learning algorithms for clone refactoring recommendation. Our evaluation has several findings. (1) CREC effectively suggested clones for refactoring with 83% F-score in the within-project setting and 76% F-score in the crossproject setting. (2) CREC significantly outperforms a state-ofthe-art learning-based approach [51], which has 70% F-score in the within-project setting and 50% F-score in the crossproject setting. (3) Our automatically extracted refactored clones achieve a precision of 92% on average. (4) Clone history features and co-change features are the most effective features, while tree-based machine learning algorithms are more effective than SMO and Naive Bayes. In summary, we have made the following contributions: We developed CREC, a novel approach that (1) automates the extraction of R-clones from repositories in an accurate way, and (2) recommends clones for refactoring based on a comprehensive set of characteristic features. We designed 34 features the largest feature set thus far to characterize the content and evolution of individual clones, as well as the spatial, syntactical, and coevolution relationship among clones. We conducted a systematic evaluation to assess (1) the precision of automatic R-clone extraction, (2) CREC s effectiveness of clone recommendation in comparison with a state-of-the-art approach [51], and (3) the tool s sensitivity to the usage of different feature sets and various machine learning algorithms. II. APPROACH CREC consists of three phases. In this section, we will first discuss how CREC automatically extracts R-clones and NRclones from software history (Section II-A). Then we will explain our feature extraction to characterize clone groups (Section II-B). Finally, we will expound on the training and testing phases (Section II-C and II-D). A. Clone Data Preparation This phase contains two steps. First, CREC conducts clone genealogy analysis to identify clone groups and extract their evolution history (Section II-A1). Second, based on the established clone genealogies, CREC implements several heuristics to decide whether a clone group was refactored or not by developers (Section II-A2). 1) Clone Genealogy Analysis: Clone genealogy describes how clones evolve over time. Kim et al. invented an approach to model clone genealogies based on software version history [27]. As the original approach implementation is not available, we reimplemented the approach with some improvement on multi-version clone detection. G 1 G 2 G 3 A B C D V i A B C D V i+1 V i+2 Fig. 2: An exemplar clone lineage Fig. 2 illustrates a clone lineage modeled by our analysis. Suppose there are three clone groups detected in a project s A B C

3 different versions: G 1 = {A, B, C, D}, G 2 = {A, B, C }, and G 3 = {A, B }. By comparing clones string similarity across versions, we learnt that G 1 and G 2 have three similar but different matching clone members: (A, A ), (B, B ), and (C, C ); while G 2 and G 3 have two matching members: (A, A ) and (B, B ). Therefore, we correlate the three clone groups. Furthermore, we infer that G 1 evolves to G 2 because the majority of their members match. G 2 evolves to G 3, because two members seem to consistently change with the third member (C ) unchanged. Therefore, there are mainly two parts in our clone genealogy analysis: multi-version clone detection and cross-version clone linking. Multi-Version Clone Detection. We utilized SourcererCC [44], a recent scalable and accurate near-miss clone detector, to find block-level clones. As with prior work [43], [44], we configured SourcererCC to report clones with at least 30 tokens or 6 lines of code to avoid trivial code duplication. We also set SoucererCC s minimum similarity threshold as 80% to reduce false alarms. Since the whole history may contain a huge number of commits, to efficiently establish clone genealogies, we sampled versions in chronological order based on the amount of code revision between versions. Specifically, CREC samples the oldest version of a project by default, denoting it as S 1. Then CREC enumerates subsequent versions until finding one that has at least 200 changed lines in comparison with S 1, denoting it as S 2. By selecting S 2 as another sample, CREC further traverses the follow-up versions until finding another version that is different from S 2 by at least 200 lines of code. This process continues until CREC arrives at the latest version. Cross-Version Clone Linking. We used an existing approach [52] to link the clone groups detected in different sampled versions, in order to restore the clone evolution history. This approach matches clones across versions based on the code location and textual similarity. For example, suppose that two clones A and B are found in file F of version V i ; while one clone A is also found in F, but of version V i+1. Since A and A are located in the same file, and if A has more content overlap with A than B does, we link A with A across versions, and conclude that A evolves to A. With such clone linking, we can further link clone groups across versions based on the membership of linked clones. 2) Clone Labeling: To train a classifier for clone refactoring recommendation, we need to label clones either as refactored or not refactored by developers. Intuitively, if (1) certain clones have duplicated code removed and method invocations added, and (2) the newly invoked method has content similar to the removed code, we infer that developers might have applied Extract Method refactorings to these clones for duplication removal. Based on this intuition, we developed CREC to assess any given clone group G of version V i with the following three criteria: i. At least two clone instances in G have code size reduced when evolving into clones of version V i+1. We refer to this subset of clones as C = {c 1, c 2,...}(C G). ii. Within C, at least two clone instances (from the same or different source files) add invocations to a method m() during the evolution. We refer to the corresponding clone subset as C = {c 1, c 2,...} (C C). iii. Within C, at least two clone members have their removed code similar to m() s body. Here, we set the similarity threshold as 0.4, because the implementation of m() may extract the commonality of multiple clones while parameterizing any differences, and thus become dissimilar to each original clone. This threshold setting will be further discussed in Section III-C. When G satisfies all above criteria, CREC labels it as Rclone ; otherwise, it is labeled as NR-clone. B. Feature Extraction With the ability to detect so many clones, determining which of them to refactor can be challenging. Prior research shows that developers base their refactoring decisions on the potential efforts and benefits of conducting refactorings [28], [51]. Our intuition is that by extracting as many features as possible to reflect the present status and history of programs, we can capture the factors that may affect developers refactoring decisions, and thus accurately suggest clones for refactoring. Specifically, we extract five categories of features: two categories modeling the current version and historic evolution of individual clones, and three categories reflecting the location, code difference, and co-change relationship among clone peers of each group. Table I summarizes the 34 features of these 5 categories. 1) Clone Code Features: We believe that the benefits and efforts of refactoring a clone are relevant to the code size, complexity, and the clone s dependence relationship with surrounding code. For instance, if a clone is large, complex, and independent of its surrounding context, the clone is likely to be refactored. Therefore, we defined 11 features (F1-F11) to characterize the content of individual clones. 2) Clone History Features: A project s evolution history may imply its future evolution. Prior work shows that many refactorings are driven by maintenance tasks such as feature extensions or bug fixes [45]. It is possible that developers refactor a clone because the code has recently been modified repetitively. To model a project s recent evolution history, we took the latest 1/10 of the project s sampled commits, and extracted 6 features (F12-F17). 3) Relative Location Features among Clones: Clones located in the same or closely related program components (e.g., packages, files, or classes) can share program context in common, and thus may be easier to refactor for developers. Consequently, we defined 6 features (F18-F23) to model the relative location information among clones, characterizing the potential refactoring effort. 4) Syntactic Difference Features among Clones: Intuitively, dissimilar clone instances are usually more difficult to refactor than similar ones, because developers have to resolve the differences before extracting common code as a new method. To estimate the potential refactoring effort, we defined 6

4 Feature Category Clone Code (11) Clone History (6) Relative Locations among Clones (6) Syntactic Differences among Clones (6) Features TABLE I: The 34 features used in CREC Rationale F1: Lines of code in a clone. A larger clone is more likely to be refactored to significantly reduce code size. F2: Number of tokens. Similar to F1. F3: Cyclomatic complexity. The more complexity a clone has, the harder it is to maintain the clone, and thus the more probably developers will refactor. F4: Number of field accessed by a clone. The more fields (e.g., private fields) a clone accesses, the more this clone may depend on its surrounding program context. Consequently, refactoring this clone may require more effort to handle the contextual dependencies. F5: Percentage of method invocation statements among all statements in a clone. Similar to F4, this feature also measures the dependence a clone may have on its program context. F6: Percentage of code lines of a clone out of its container method. Similar to F4 and F5, this feature indicates the context information of a clone in its surrounding method and how dependent the clone may be on the context. F7: Percentage of arithmetic statements among all statements in a clone. Arithmetic operations (e.g., + and / ) may indicate intensive computation, and developers may want to refactor such code to reuse the computation. F8: Whether a complete control flow block is contained. If a clone contains an incomplete control flow, such as an else-if branch, it is unlikely that developers will refactor the code. F9: Whether a clone begins with a control statement. Beginning with a control statement (e.g., for-loop) may indicate separable program logic, which can be easily extracted as a new method. F10: Whether a clone follows a control statement. Similar to F9, a control statement may indicate separable logic. F11: Whether a clone is test code. Developers may intentionally create clones in test methods to test similar or relevant functionalities, so they may not want to remove the clones. F12: Percentage of file-existence commits among all checked If a file is recently created, developers may not refactor any of its code. historical commits. F13: Percentage of file-change commits among all checked If a file has been changed a lot, developers may want to refactor the code to commits. facilitate software modification. F14: Percentage of directory-change commits among all Similar to F13, this feature also models historical change activities. checked commits. F15: Percentage of recent file-change commits among the Compared with F13, this feature focuses on a shorter history, because a file s most recent 1/4 checked commits. most recent updates may influence its future more than the old ones. F16: Percentage of recent directory-change commits among Compared with F14, this feature also measures how actively a directory was the most recent 1/4 checked commits. modified, but characterizes a shorter history. F17: Percentage of a file s maintainers among all developers. The more developers modify a file, the more likely that the file is buggy [37]; and thus, developers may want to refactor such files. F18: Whether clone peers in a group are in the same directory It seems easier to refactor clones within the same directory, because clones share the directory as their common context. F19: Whether the clone peers are in the same file. Similar to F18, clones co-existence in the same file may also indicate less refactoring effort. F20: Whether the clone peers are in the same class hierarchy Similar to F18 and F19, if clones share their accesses to many fields and methods defined in the same hierarchy, such clones may be easier to refactor. F21: Whether the clone peers are in the same method Similar to F18-F20, clones within the same method possibly share many local F22: Distance of method names measures the minimal Levenshtein distance [32] (i.e., textual edit distance) among the names of methods containing clone peers. F23: Possibility of the clone peers from intentionally copied directories. variables, indicating less effort to apply refactorings. As developers define method names as functionality descriptors, similar method names usually indicate similar implementation, and thus may imply less refactoring effort. The more similar two directories layout are to each other, the more likely that developers intentionally created the cloned directories, and the less likely that clones between these directories are to be refactored. F24: Number of instances in a clone group The size of a clone group can affect developers refactoring decision. For instance, refactoring a larger group may lead to more benefit (e.g., reduce code size), while refactoring a smaller group may require for less effort. F25: Number of total diffs measures the number of differential multisets reported by MCIDiff [34]. F26: Percentage of partially-same diffs measures among all reported differential multisets, how many multisets are partially same. F27: Percentage of variable diffs measures among all differential multisets, how many multisets contain variable identifiers. F28: Percentage of method diffs measures among all differential multisets, how many multisets have method identifiers. F29: Percentage of type diffs measures among all differential multisets, what is the percentage of multisets containing type tokens. F30: Percentage of all-change commits among all checked commits. F31: Percentage of no-change commits among all checked commits. Co-Changes F32: Percentage of 1-clone-change commit among all checked among Clones commits. (5) F33: Percentage of 2-clone-change commits among all checked commits. F34: Percentage of 3-clone-change commits among all checked commits. More differences may demand for more refactoring effort. Partially-same multisets indicate that some instances in a clone group are identical to each other while other instances are different. Developers may easily refactor the clone subset consisting of identical clones without dealing with the difference(s) in other clone peers. If developers resolve each variable diff by renaming the corresponding variables consistently or declaring a parameter in the extracted method, this feature characterizes such effort. If developers resolve each method diff by creating a lambda expression or constructing template methods, this feature reflects the amount of such effort. Different from how variable and method differences are processed, type differences are possibly resolved by generalizing the types into a common super type. This feature reflects such potential refactoring effort. The more often all clones are changed together, the more possibly developers will refactor the code. Similar to F30, this feature also characterizes the evolution similarity among clone peers. Different from F30, this feature characterize the inconsistency evolution of clones. If clones divergently evolve, they are less likely to be refactored. Similar to F32, this feature also captures the maintenance inconsistency among clones. Similar to F32 and F33.

5 features (F24-F29) to model differences among clones, such as the divergent identifier usage of variables, methods, and types. Specifically to extract differences among clones, we used an off-the-shelf tool MCIDiff [34]. Given a clone group G = {c 1, c 2,..., c n }, MCIDiff tokenizes each clone, leverages the longest common subsequence (LCS) algorithm to match token sequences, and creates a list of matched and differential multisets of tokens. Intuitively, if a token commonly exists in all clones, it is put into a matched multiset; otherwise, it is put to a differential multiset. Furthermore, there is a special type of differential multisets called partially same to record tokens that commonly exist in some ( 2) but not all clones in a group. Matched and partially-same multisets may imply zero refactoring effort, because developers do not need to resolve differences before refactoring the tokens corresponding code. Meng et al. showed that variable differences among clones were easier to resolve than method or type differences [36]. To more precisely characterize the potential refactoring effort, we also classified identifier tokens into three types: variables, methods, and types. As MCIDiff does not identify different types of tokens, we extended it to further analyze the syntactic meaning of identifiers. With more detail, our extension first parsed the Abstract Syntax Tree (AST) of each clone using Eclipse ASTParser [1]. By implementing an ASTVisitor to traverse certain types of ASTNodes, such as FieldAccess, MethodInvocation, and TypeLiteral, we determined whether certain token(s) used in these ASTNodes are names of variables, methods, or types. In this way, we further classified identifier tokens and labeled MCIDiff s differential multisets with variable, method, and type accordingly. 5) Co-Change Features among Clones: If multiple clone peers of the same group are frequently changed together, developers may want to refactor the code to reduce or eliminate future repetitive editing effort [35]. Thus, we defined 5 features (F30-F34) to reflect the co-change relationship among clones. C. Training To train a binary-class classifier for clone recommendation, we first extracted feature vectors for both R-clones and NR-clones in the training set. Then for each feature vector we constructed one data point in the following format: < feature vector, label >, where feature vector consists of 34 features; and label is set to 1 or 0 depending on whether f eature vector is derived from R-clones or NR-clones. We used Weka [2] a software library implementing a collection of machine learning algorithms to train a classifier based on training data. By default, we use AdaBoost [13], a state-of-the-art machine learning algorithm, because prior research shows that it can be successfully used in many applications [11]. Specifically, we configured AdaBoost to take a weak learning algorithm, decision stump [22], as its weak learner, and call the weak learner 50 times. D. Testing Given the clone groups in the testing set, CREC first extracts features based on the current status and evolution history of these clones. Next, CREC feeds the features to the trained classifier to decide whether the clone groups should be refactored or not. If a group is predicted to be refactored with over 50% likelihood, CREC recommends the group for refactoring. When multiple groups are recommended, CREC further ranks them based on the likelihood outputs by the classifier. III. EXPERIMENTS In this section, we first introduce our data set for evaluation (Section III-A), and then describe our evaluation methodology (Section III-B). Based on the methodology, we estimated the precision of reporting refactored clone groups (Section III-C). Next, we evaluated the effectiveness of CREC and compare it with an existing clone recommendation approach (Section III-D) Finally, we identified important feature subsets (Section III-E) and suitable machine learning algorithms (Section III-F) for clone refactoring recommendation. A. Data Set Construction Our experiments used the repositories of six open source projects: Axis2, Eclipse.jdt.core, Elastic Search, JFreeChart, JRuby, and Lucene. We chose these projects for two reasons. First, they have long version histories, and most of these projects were also used by prior research [36], [46], [49] [51]. Second, these projects belong to different application domains, which makes our evaluation results representative. Table II presents relevant information of these projects. Column Domain describes each project s domain. Study Period is the evolution period we investigated. # of Commits is the total number of commits during the studied period. # of Clone Groups is the total number of clone groups detected in each project s sampled versions. # of R-clones means the number of refactored clone groups automatically extracted by CREC. To evaluate CREC s effectiveness, we need both R-clones and NR-clones to construct the training and testing data sets. As shown in Table II, there are a small portion of detected clone groups reported as R-clones. To create a balanced data set of positive and negative examples, we randomly chose a subset of clone groups from clone genealogies without R- clones. The number of NR-clones in the subset is equal to that of reported R-clones. Each time when we trained a classifier, we used a portion of this balanced data set as training data. For testing data, we used the remaining portion of the data set as the testing set, except the R-clones false positives confirmed by our manual inspection (See Section III-C). We did not manually inspect the NR-clones in the testing set. This is because existing study [51] has revealed only a very small portion of clone groups (less than 1%) are actually refactored by developers. Therefore, the portion of false NR-clones (i.e., missing true R-clones) in the testing set cannot exceed 1%, which would have only marginal effect on our results. B. Evaluation Methodology CREC automatically extracts clone groups historically refactored (R-clones) from software repositories, and considers the

6 TABLE II: Subject projects Project Domain Study Period # of Commits # of Clone Groups # of R-clones Axis2 core engine for web services 09/02/ /06/ Eclipse.jdt.core integrated development environment 06/05/ /22/ Elastic Search search engine 02/08/ /05/ JFreeChart open-source framework for charts 06/29/ /24/ JRuby an implementation of the ruby language 09/10/ /04/ Lucene information retrieval library 09/11/ /06/ remaining clone groups as not refactored (NR-clones). Before using the extracted groups to construct the training set, we first assessed CREC s precision of R-clone extraction. Additionally, with the six project s clone data, we conducted within-project and cross-project evaluation. In within-project evaluation, we used ten-fold validation to evaluate recommendation effectiveness. In cross-project evaluation, we iteratively used five projects R-clones and NR-clones as training data, and used the remaining project s clone data for testing. In both cases, the clone recommendation effectiveness is evaluated with three metrics: precision, recall, and F-score. Precision (P) measures among all the recommended clone groups for refactoring, how many groups were actually refactored by developers: P = # of refactored clone groups 100%. (1) Total # of recommended groups Recall (R) measures among all known refactored clone groups, how many groups are suggested by a refactoring recommendation approach: R = # of refactored clone groups suggested Total # of known refactored groups 100%. (2) F-score (F) is the harmonic mean of precision and recall. It combines P and R to measure the overall accuracy of refactoring recommendation as below: F = 2 P R 100%. (3) P + R Precision, recall, and F-score all vary within [0%, 100%]. The higher precision values are desirable, because they demonstrate better accuracy of refactoring recommendation. As the within-project and cross-project evaluation were performed with six projects, we obtained six precision, recall, and F- score values respectively, and treated the averages of all six projects as the overall effectiveness measurement. C. Precision of Reporting R-clones CREC relies on a similarity threshold parameter l th to determine R-clones (see Section II-A2). With the threshold, CREC checks whether the implementation of a newly invoked method is sufficiently similar to a removed code snippet. In our experiment, we changed l th from 0.3 to 0.5 with 0.1 increment, and manually examined the reported R-clones in the six projects to estimate CREC s precision. In Table III, # of Reported shows the total number of reported R-clones by CREC. #of Confirmed is the number of reported R-clones manually confirmed. Precision measures CREC s precision. As l th s value increases, the total number of reported groups goes down, while the precision goes up. To achieve a good balance between the total number of reported refactored groups and the precision, we set l th to 0.4 by default. This is because when l th = 0.4, the number of R- clones reported by CREC and the 92% precision are sufficient for constructing the training set. TABLE III: CREC s precision of reporting R-clones Threshold Value # of Reported # of Confirmed Precision % % % D. Effectiveness Comparison of Clone Recommendation Approaches Wang et al. developed a machine learning-based clone recommendation approach [51], which extracted 15 features for cloned snippets and leveraged C4.5 to train and test a classifier. Among the 15 features, 10 features represent the current status of individual clones, 1 feature reflects the evolution history of individual clones, 3 features describe the spatial relationship between clones, and 1 feature reflects any syntactic difference between clones. Wang et al. s feature set is much smaller than our 34-feature set. It fully overlaps with our feature category of clone content, and matches three out of our six spatial location-related features. Wang et al. s feature set characterizes individual clones more comprehensively than the relationship between clone peers (11 features vs. 4 features), and emphasizes clones present versions more than their previous versions (14 features vs. 1 feature). It does not include any feature to model the co-change relationship between clones. In this experiment, we compared CREC with Wang et al. s approach to see how CREC performs differently. Because neither the tool nor data of Wang et al. s paper is accessible even after we contacted the authors, we reimplemented Wang et al. s approach, and used our data set for evaluation. As CREC is different from Wang et al. s approach in terms of both the feature set and machine learning (ML) algorithm, we also developed two variants of our approach: CREC m that uses CREC s ML algorithm (AdaBoost) but the feature set in Wang et al. s work, and CREC f that uses CREC s feature set but the ML algorithm (C4.5) in Wang et al. s work. Next, we evaluated the four clone recommendation approaches with the

7 six projects data in both the within-project and cross-project settings. Note that due to different data sets for evaluation, the effectiveness of Wang et al. s approach in this paper is different from the results reported in their paper. Table IV presents the four approaches effectiveness comparison for within-project prediction. On average, CREC achieved 82% precision, 86% recall, and 83% F-score, which were significantly higher than the 73% precision, 71% recall, and 70% F-score by Wang et al. s approach. CREC improved the overall average F-score of Wang et al. s approach (70%) by 19%. This implies that compared with Wang et al. s approach, CREC more effectively recommended clones for refactoring. To further analyze why CREC worked differently from Wang et al. s approach, we also compared their approach with CREC m and CREC f. As shown in Table IV, the F-score of CREC m was a little higher than Wang et al. s approach while CREC f worked far more effectively than both of them. The comparison indicates that (1) both our feature set and AdaBoost positively contribute to CREC s better effectiveness over Wang et al. s approach; and (2) our feature set can more significantly improve Wang et al. s approach than AdaBoost. Table V shows different approaches evaluation results for cross-project prediction. CREC achieved 77% precision, 77% recall, and 76% F-score on average. In comparison, Wang et al. s approach acquired 53% precision, 50% recall, and 50% F-score. By replacing their feature set with ours, we significantly increased the precision to 72%. We also replaced their C4.5 algorithm with the AdaBoost algorithm used in CREC, and observed that the F-score increased from 50% to 53%. By correlating these observations with the abovementioned observations for the within-project setting, we see that our 34-feature set always works much better than Wang et al. s 15-feature set. Additionally, the replacement of C4.5 with AdaBoost brings small improvement to the effectiveness of Wang et al. s approach. Therefore, the rationale behind our feature set plays a crucially important role in CREC s significant improvement over the state-of-the-art approach. Finding 1: CREC outperforms Wang et al. s approach, mainly because it leverages 34 features to comprehensively characterize the present and past of code clones. On average, CREC achieved 82% precision, 86% recall, and 83% F-score in the within-project setting, which were higher than the 77% precision, 77% recall, and 76% F-score in the cross-project setting. This observation is as expected, because the clones of one project may share certain project-specific commonality. By training and testing a classifier with clone data of the same project, CREC can learn some project-specific information, which allows CREC to effectively suggest clones. CREC worked better for some projects but worse for the others. For instance, CREC obtained the highest F-score (95%) for Lucene and the lowest F-score (66%) for Elastic Search in the within-project setting. This F-score value range means that for the majority of cases, CREC s recommendation is meaningful to developers. It implies two insights. First, developers refactoring decisions are generally predictable, which validates our hypothesis that different developers similarly decide what clones to refactor. Second, some developers may refactor clones more frequently than the others and have personal refactoring preferences. Such personal variance can cause the prediction divergence among different test projects. Finding 2: CREC suggests refactorings with 66-95% F-scores within projects and 54-93% F-scores across projects. It indicates that developers clone refactoring decisions are generally predictable, although the refactoring preference may vary with people and projects. E. Important Feature Subset Identification CREC leveraged 5 categories of features, including 11 features to characterize the content of individual clones, 6 features to reflect individual clones evolution history, 6 features to represent the spatial locations of clone peers, 6 features to capture any syntactic difference between clones, and 5 features to reflect the co-evolution relationship between clones. To explore how each group of features contribute to CREC s overall effectiveness, we also created five variant feature sets by removing one feature category from the original set at a time. We then executed CREC with each variant feature set in both the within-project and cross-project settings (see Fig. 3). In the within-project setting, CREC s all-feature set achieved 83% F-score. Interestingly, ExceptDiff obtained the same F-score, which means that the features capturing syntactic differences between clone peers were not unique to characterize R-clones. This category of features was inferable from other features, and thus removing this category did not influence CREC s effectiveness. ExceptCode and ExceptLocation obtained the highest F-score 84%. This may imply that the two feature subsets negatively affected CREC s effectiveness, probably because the code smell and the location information in clones were the least important factors that developers considered when refactoring code. ExceptHistory obtained the lowest F-score 77% and ExceptCoChange achieved the second lowest F-score 81%. This indicates that the evolution history of individual clones and the co-change relationship of clone peers provide good indicators for clone refactorings. There seems to be a strong positive correlation between developers refactoring decisions and clones evolution history. In the cross-project setting, CREC s all feature set achieved 76% F-score. ExceptDiff obtained exactly the same F-score, probably because the features capturing any syntactic differences between clones are inferable from other features. All other variant feature sets led to lower F-scores. It means that except the syntactic differences between clones, the other four categories of features positively contribute to CREC s prediction performance. Especially, ExceptHistory obtained an F-score much lower than the scores of other feature sets. This phenomenon coincides with our observations in the within-project setting, implying that the evolution history of

8 TABLE IV: Effectiveness comparison of different clone recommendation approaches in the within-project setting CREC CREC m CREC f Wang et al. s Approach (CREC s Feature Set+AdaBoost) (Wang et al. s Feature Set+AdaBoost) (CREC s Feature Set+C4.5) (Wang et al. s Feature Set+C4.5) Test Project P (%) R (%) F(%) P (%) R (%) F(%) P (%) R (%) F(%) P (%) R (%) F(%) Axis Eclipse.jdt.core Elastic Search JFreeChart JRuby Lucene Average TABLE V: Effectiveness comparison of different clone recommendation approaches in the cross-project setting CREC CREC m CREC f Wang et al. s Approach (CREC s Feature Set+AdaBoost) (Wang et al. s Feature Set+AdaBoost) (CREC s Feature Set+C4.5) (Wang et al. s Feature Set+C4.5) Test Project P (%) R (%) F(%) P (%) R (%) F(%) P (%) R (%) F(%) P (%) R (%) F(%) Axis Eclipse.jdt.core Elastic Search JFreeChart JRuby Lucene Average % 90% Within-Project 100% 90% Cross-Project Average Precision Average Recall Average F-score 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% All features ExceptCode ExceptHistory ExceptLocation ExceptDiff ExceptCoChange All features ExceptCode ExceptHistory ExceptLocation ExceptDiff ExceptCoChange Fig. 3: CREC s average effectiveness when using different feature sets code clones are always positive indicators of clone refactorings. ExceptCoChange acquired the second lowest F-score, meaning that the co-evolution relationship of clone peers is also important to predict developers refactoring practice. Finding 3: The clone history features of single clones and co-change features among clone peers were the two most important feature subsets. It seems that developers relied more on clones past than their present to decide refactorings. F. Suitable Model Identification To understand which machine learning (ML) algorithms are suitable for CREC, in addition to AdaBoost, we also experimented with four other supervised learning algorithms: Random Forest [20], C4.5 [41], SMO [40], and Naive Bayes [3]. All these algorithms were executed with their default parameter settings and were evaluated on the six projects for within-project and cross-project predictions. Fig. 4 presents our experiment results. The three bars for each ML algorithm in each chart separately illustrate the average precision, recall, and F-score among six projects. In the within-project setting, Random Forest and C4.5 achieved the highest F-score: 84%. AdaBoost obtained the second best F-score: 83%. The F-score of Naive Bayes was the lowest 64%, and SMO derived the second lowest F- score: 78%. In the cross-project setting, AdaBoost s F-score (76%) was much higher than the F-scores of other algorithms. Random Forest obtained the second highest F-score: 61%. C4.5 acquired the third highest F-score 58%, while SMO and Naive Bayes had the lowest F-score: 53%. Between the two settings, AdaBoost reliably performed well, while Naive Bayes was always the worst ML algorithm for clone refactoring

9 100% 90% Within-Project 100% 90% Cross-Project Average Precision Average Recall Average F-score 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% AdaBoost Random Forest C4.5 SMO Naïve Bayes 0% AdaBoost Random Forest C4.5 SMO Naïve Bayes Fig. 4: CREC s average effectiveness when using different ML algorithms recommendation. AdaBoost, Random Forest and C4.5 are all tree-based ML algorithms, while SMO and Naive Bayes are not. The results in Fig. 4 imply that tree-based ML algorithms are more suitable for clone recommendation than others. Finding 4: AdaBoost reliably suggests clones for refactoring with high accuracy in both within-project and cross-project settings, while Naive Bayes consistently worked poorly. Among the five experimented ML algorithms, tree-based algorithms stably worked better than the other algorithms. IV. RELATED WORK The related work includes clone evolution analysis, empirical studies on clone removals, clone recommendation for refactoring, and automatic clone removal refactoring. A. Clone Evolution Analysis Kim et al. initiated clone genealogy analysis and built a clone genealogy extractor [27]. They observed that extensive refactorings of short-lived clones may not be worthwhile, while some long-lived clones are not easily refactorable. Other researchers conducted similar clone analysis and observed controversial or complementary findings [4], [10], [15], [30], [47]. For instance, Aversano et al. found that most co-changed clones were consistently maintained [4], while Krinke observed that half of the changes to clones were inconsistent [30]. Göde et al. discovered that 88% of clones were never changed or changed only once during their lifetime, and only 15% of all changes to clones were unintentionally inconsistent [15]. Thummalapenta et al. automatically classified clone evolution into three patterns: consistent evolution, independent evolution, and late propagation [47]. Cai et al. analyzed long-lived clones, and reported that the number of developers who modified clones and the time since the last addition or removal of a clone to its group were highly correlated with the survival time of clones [10]. These studies showed that not every detected clone should be refactored, which motivated our research that recommends only clones likely to be refactored. B. Empirical Studies on Clone Removals Various studies were conducted to understand how developers removed clones [14]. Specifically, Göde observed that developers deliberately removed clones with method extraction refactorings, and the removed clones were mostly co-located in the same file [14]. Bazrafshan et al. extended Göde s study, and found that the accidental removals of code clones occurred slightly more often than deliberate removals [8]. Similarly, Choi et al. also observed that most clones were removed with either Extract Method or Replace Method with Method refactorings. Silva et al. automatically detected applied refactorings in open source projects on GitHub, and sent questionnaires to developers for the motivation behind their refactoring practice [45]. They found that refactoring activities were mainly driven by changes in the requirements, and Extract Method was the most versatile refactoring operation serving 11 different purposes. None of these studies automatically recommends clones for refactoring or ranks the recommendations. C. Clone Recommendation for Refactoring Based on code clones detected by various techniques [23], [26], [29], a variety of tools identify or rank refactoring opportunities [5], [7], [16], [18], [19], [48]. For instance, Balazinska et al. defined a clone classification scheme based on various types of differences among clones, and automated the classification to help developers assess refactoring opportunities for each clone group [5]. Balazinska et al. later built another tool that facilitated developers refactoring decisions by presenting the syntactic differences and common contextual dependencies among clones [7]. Higo et al. and Goto et al. ranked clones based on the coupling or cohesion metrics [16], [19]. Tsantalis et al. and Higo et al. further improved the ranking mechanism by prioritizing clones that had been repetitively changed in the past [18], [48]. Wang et al. extracted 15 features to characterize the content and history of individual clones, and the context and difference relationship among clones [51]. Replacing the feature set with CREC s boost the overall effectiveness of Wang et al. s approach. More importantly, our research shows for the first time that history-based features are more important than smell-

10 based features (i.e., clone content, location, or differences) when suggesting relevant clones for refactoring. D. Automatic Clone Removal Refactoring A number of techniques automatically remove clones by applying refactorings [6], [9], [17], [21], [25], [31], [36], [46], [49], [50]. These tools mainly extract a method by factorizing the common part and parameterizing any differences among clones. For example, Tairas et al. built an Eclipse plugin called CeDAR, which unifies clone detection, analysis and refactoring [46]. Juillerat et al. and Balazinska et al. defined extra mechanisms to mask any difference between clones [6], [25]. Several researchers conducted program dependency analysis to further automate Extract Method refactorings for near-miss clones clones that contain divergent program syntactic structures or different statements [9], [21], [31], [31]. Specifically, Hotta et al. safely moved out the distinct statements standing between duplicated code to uniformly extract common code [21]. Krishnan et al. identified the maximum common subgraph between two clones Program Dependency Graphs (PDGs) to minimally introduce parameters to the extracted method [31]. Bian et al. created extra control branches to specially handle distinct statements [9]. Meng et al. detected consistently updated clones in history and automated refactorings to remove duplicated code and to reduce repetitive coding effort [36]. Nguyen et al. leveraged historical software changes to provide API recommendations for developers [38]. Tsantalis et al. automatically detected any differences between clones, and further assessed whether it was safe to parameterize those differences without changing program behaviors [49]. Tsantalis et al. later built another tool that utilized lambda expressions to refactor clones [50]. As mentioned in prior work [36], the above-mentioned research mainly focused on the automated refactoring feasbility instead of desirablity. In contrast, our research investigates the refactoring desirability. By extracting a variety of features that reflect the potential cost and benefits of refactorings to apply, we rely on machine learning to model the implicit complex interaction between features based on clones already refactored or not refactored by developers. In this way, the trained classifier can simulate the human mental process of evaluating refactoring desirability, and further suggest clones that are likely to be refactored by developers. V. THREATS TO VALIDITY We evaluated our approach using the clone data mined from six mature open source projects that have long version history. Our observations may not generalize well to other projects, but the trained classifier can be applied to other projects repositories. We did not manually inspect the non-refactored clones in the testing set, meaning that there may exist some false nonrefactored clones. Please note that the false non-refactored clones in the testing set can only be caused by the true refactored clones that were missed by our automatic extraction process. However, according to existing study [51], only a very small portion of clone groups (less than 1%) are actually refactored by developers. Therefore, the portion of false nonrefactored clone groups (i.e., missing true refactored clones) in the testing set cannot exceed 1%, which would have only marginal effect on our results. By default, CREC recommends clones for refactoring when the trained classifier predicts a refactoring likelihood greater than or equal to 50%. If a given project s version history does not have any commit that changes any clone, it is probable that none of our history-based features are helpful. Consequently, the predicted likelihood can be less than 50%, and CREC does not recommend any clone for refactoring. To at least acquire a ranked list of clones for refactoring in such scenarios, users can tune our likelihood threshold parameter to a smaller value, such as 30%. VI. CONCLUSION This paper presents CREC, a learning-based approach that suggests clones for Extract Method refactoring, no matter whether the clones are located in the same or different files. CREC is inspired by prior work, but improves over the stateof-the-art approach by predicting refactorings based on desirability instead of refactorabililty. Although it is challenging to mimic developers mental process of refactoring decisions, CREC (1) conducts static analysis to characterize the present status and past evolution of software, and (2) nicely combines static analysis with machine learning by training a classifier with the analyzed characteristics. Our evaluation evidences the effectiveness of our approach. More importantly, we observed that history-based features work more effectively than those features extracted from the current version of software. It indicates a very interesting insight: when refactoring software, developers consider more on the past evolution than the current version. By inspecting the scenarios where CREC cannot predict well, we found that the refactoring decisions seemed to be based on new software requirements or feature additions. In the future, we plan to propose approaches to better handle such scenarios. We also plan to conduct user studies with developers to obtain their opinions on the suggested clones for refactoring. Compared with prior work, another significant improvement CREC achieves is its automatic extraction of refactored clones from version history. With our tool and data publicly available ( other researchers can access the data, reuse CREC to collect more data as needed, and further investigate refactoring recommendation and automation. ACKNOWLEDGMENT We thank anonymous reviewers for their thorough and valuable comments on our earlier version of the paper. This work is sponsored by the National Key Research and Development Program under Grant No. 2017YFB , the National Natural Science Foundation of China under Grant No and , NSF Grant No. CCF , ONR Grant No. N , NSF Grant No. CNS and DHS Grant No. DHS-14-ST

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech

Clone Detection and Maintenance with AI Techniques. Na Meng Virginia Tech Clone Detection and Maintenance with AI Techniques Na Meng Virginia Tech Code Clones Developers copy and paste code to improve programming productivity Clone detections tools are needed to help bug fixes

More information

An Empirical Study of Multi-Entity Changes in Real Bug Fixes

An Empirical Study of Multi-Entity Changes in Real Bug Fixes An Empirical Study of Multi-Entity Changes in Real Bug Fixes Ye Wang, Na Meng Department of Computer Science Virginia Tech, Blacksburg, VA, USA {yewang16, nm8247}@vt.edu Hao Zhong Department of Computer

More information

Automatic Identification of Important Clones for Refactoring and Tracking

Automatic Identification of Important Clones for Refactoring and Tracking Automatic Identification of Important Clones for Refactoring and Tracking Manishankar Mondal Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {mshankar.mondal,

More information

Finding Extract Method Refactoring Opportunities by Analyzing Development History

Finding Extract Method Refactoring Opportunities by Analyzing Development History 2017 IEEE 41st Annual Computer Software and Applications Conference Finding Extract Refactoring Opportunities by Analyzing Development History Ayaka Imazato, Yoshiki Higo, Keisuke Hotta, and Shinji Kusumoto

More information

3 Prioritization of Code Anomalies

3 Prioritization of Code Anomalies 32 3 Prioritization of Code Anomalies By implementing a mechanism for detecting architecturally relevant code anomalies, we are already able to outline to developers which anomalies should be dealt with

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search 1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

A Characterization Study of Repeated Bug Fixes

A Characterization Study of Repeated Bug Fixes A Characterization Study of Repeated Bug Fixes Ruru Yue 1,2 Na Meng 3 Qianxiang Wang 1,2 1 Key Laboratory of High Confidence Software Technologies (Peking University), MoE 2 Institute of Software, EECS,

More information

Automatic Labeling of Issues on Github A Machine learning Approach

Automatic Labeling of Issues on Github A Machine learning Approach Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

COMPARISON AND EVALUATION ON METRICS

COMPARISON AND EVALUATION ON METRICS COMPARISON AND EVALUATION ON METRICS BASED APPROACH FOR DETECTING CODE CLONE D. Gayathri Devi 1 1 Department of Computer Science, Karpagam University, Coimbatore, Tamilnadu dgayadevi@gmail.com Abstract

More information

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM Contour Assessment for Quality Assurance and Data Mining Tom Purdie, PhD, MCCPM Objective Understand the state-of-the-art in contour assessment for quality assurance including data mining-based techniques

More information

International Journal for Management Science And Technology (IJMST)

International Journal for Management Science And Technology (IJMST) Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Welfare Navigation Using Genetic Algorithm

Welfare Navigation Using Genetic Algorithm Welfare Navigation Using Genetic Algorithm David Erukhimovich and Yoel Zeldes Hebrew University of Jerusalem AI course final project Abstract Using standard navigation algorithms and applications (such

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Searching for Configurations in Clone Evaluation A Replication Study

Searching for Configurations in Clone Evaluation A Replication Study Searching for Configurations in Clone Evaluation A Replication Study Chaiyong Ragkhitwetsagul 1, Matheus Paixao 1, Manal Adham 1 Saheed Busari 1, Jens Krinke 1 and John H. Drake 2 1 University College

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies

An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies An Automatic Framework for Extracting and Classifying Near-Miss Clone Genealogies Ripon K. Saha Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {ripon.saha,

More information

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Jadson Santos Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte, UFRN Natal,

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Empirical Study on Impact of Developer Collaboration on Source Code

Empirical Study on Impact of Developer Collaboration on Source Code Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra University of Waterloo Waterloo, Ontario a22chopr@uwaterloo.ca Parul Verma University of Waterloo Waterloo, Ontario p7verma@uwaterloo.ca

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

code pattern analysis of object-oriented programming languages

code pattern analysis of object-oriented programming languages code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

QTEP: Quality-Aware Test Case Prioritization

QTEP: Quality-Aware Test Case Prioritization QTEP: Quality-Aware Test Case Prioritization {song.wang,jc.nam,lintan}@uwaterloo.ca Electrical and Computer Engineering, University of Waterloo Waterloo, ON, Canada ABSTRACT Test case prioritization (TCP)

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Lecture 25 Clone Detection CCFinder Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic) Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002 Recap of Polymetric

More information

Intrusion Detection Using Data Mining Technique (Classification)

Intrusion Detection Using Data Mining Technique (Classification) Intrusion Detection Using Data Mining Technique (Classification) Dr.D.Aruna Kumari Phd 1 N.Tejeswani 2 G.Sravani 3 R.Phani Krishna 4 1 Associative professor, K L University,Guntur(dt), 2 B.Tech(1V/1V),ECM,

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

3 Virtual attribute subsetting

3 Virtual attribute subsetting 3 Virtual attribute subsetting Portions of this chapter were previously presented at the 19 th Australian Joint Conference on Artificial Intelligence (Horton et al., 2006). Virtual attribute subsetting

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

A Virtual Laboratory for Study of Algorithms

A Virtual Laboratory for Study of Algorithms A Virtual Laboratory for Study of Algorithms Thomas E. O'Neil and Scott Kerlin Computer Science Department University of North Dakota Grand Forks, ND 58202-9015 oneil@cs.und.edu Abstract Empirical studies

More information

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price.

The goal of this project is to enhance the identification of code duplication which can result in high cost reductions for a minimal price. Code Duplication New Proposal Dolores Zage, Wayne Zage Ball State University June 1, 2017 July 31, 2018 Long Term Goals The goal of this project is to enhance the identification of code duplication which

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

ARiSA First Contact Analysis

ARiSA First Contact Analysis ARiSA First Contact Analysis Applied Research In System Analysis - ARiSA You cannot control what you cannot measure Tom DeMarco Software Grail Version 1.3 Programming Language Java 1.4 Date 2 nd of December,

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects?

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? Saraj Singh Manes School of Computer Science Carleton University Ottawa, Canada sarajmanes@cmail.carleton.ca Olga

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Filtering Bug Reports for Fix-Time Analysis

Filtering Bug Reports for Fix-Time Analysis Filtering Bug Reports for Fix-Time Analysis Ahmed Lamkanfi, Serge Demeyer LORE - Lab On Reengineering University of Antwerp, Belgium Abstract Several studies have experimented with data mining algorithms

More information

Semantics, Metadata and Identifying Master Data

Semantics, Metadata and Identifying Master Data Semantics, Metadata and Identifying Master Data A DataFlux White Paper Prepared by: David Loshin, President, Knowledge Integrity, Inc. Once you have determined that your organization can achieve the benefits

More information

Evolutionary Linkage Creation between Information Sources in P2P Networks

Evolutionary Linkage Creation between Information Sources in P2P Networks Noname manuscript No. (will be inserted by the editor) Evolutionary Linkage Creation between Information Sources in P2P Networks Kei Ohnishi Mario Köppen Kaori Yoshida Received: date / Accepted: date Abstract

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling Natthakul Pingclasai Department of Computer Engineering Kasetsart University Bangkok, Thailand Email: b5310547207@ku.ac.th Hideaki

More information

Evaluation of Business Rules Maintenance in Enterprise Information Systems

Evaluation of Business Rules Maintenance in Enterprise Information Systems POSTER 2015, PRAGUE MAY 14 1 Evaluation of Business Rules Maintenance in Enterprise Information Systems Karel CEMUS Dept. of Computer Science, Czech Technical University, Technická 2, 166 27 Praha, Czech

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Categorizing Migrations

Categorizing Migrations What to Migrate? Categorizing Migrations A version control repository contains two distinct types of data. The first type of data is the actual content of the directories and files themselves which are

More information

Supplementary Materials for DVQA: Understanding Data Visualizations via Question Answering

Supplementary Materials for DVQA: Understanding Data Visualizations via Question Answering Supplementary Materials for DVQA: Understanding Data Visualizations via Question Answering Kushal Kafle 1, Brian Price 2 Scott Cohen 2 Christopher Kanan 1 1 Rochester Institute of Technology 2 Adobe Research

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Songtao Hei. Thesis under the direction of Professor Christopher Jules White

Songtao Hei. Thesis under the direction of Professor Christopher Jules White COMPUTER SCIENCE A Decision Tree Based Approach to Filter Candidates for Software Engineering Jobs Using GitHub Data Songtao Hei Thesis under the direction of Professor Christopher Jules White A challenge

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VI (Nov Dec. 2014), PP 29-33 Analysis of Image and Video Using Color, Texture and Shape Features

More information

Creating a Lattix Dependency Model The Process

Creating a Lattix Dependency Model The Process Creating a Lattix Dependency Model The Process Whitepaper January 2005 Copyright 2005-7 Lattix, Inc. All rights reserved The Lattix Dependency Model The Lattix LDM solution employs a unique and powerful

More information

Oracle9i Data Mining. Data Sheet August 2002

Oracle9i Data Mining. Data Sheet August 2002 Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

An Empirical Study on Clone Stability

An Empirical Study on Clone Stability An Empirical Study on Clone Stability Manishankar Mondal Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {mshankar.mondal, chanchal.roy, kevin.schneider}@usask.ca

More information

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su Radiologists and researchers spend countless hours tediously segmenting white matter lesions to diagnose and study brain diseases.

More information

F. Aiolli - Sistemi Informativi 2006/2007

F. Aiolli - Sistemi Informativi 2006/2007 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C =

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes

Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes Cross Language Higher Level Clone Detection- Between Two Different Object Oriented Programming Language Source Codes 1 K. Vidhya, 2 N. Sumathi, 3 D. Ramya, 1, 2 Assistant Professor 3 PG Student, Dept.

More information

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs : A Distributed Adaptive-Learning Routing Method in VDTNs Bo Wu, Haiying Shen and Kang Chen Department of Electrical and Computer Engineering Clemson University, Clemson, South Carolina 29634 {bwu2, shenh,

More information

Random Search Report An objective look at random search performance for 4 problem sets

Random Search Report An objective look at random search performance for 4 problem sets Random Search Report An objective look at random search performance for 4 problem sets Dudon Wai Georgia Institute of Technology CS 7641: Machine Learning Atlanta, GA dwai3@gatech.edu Abstract: This report

More information

Better Bioinformatics Through Usability Analysis

Better Bioinformatics Through Usability Analysis Better Bioinformatics Through Usability Analysis Supplementary Information Davide Bolchini, Anthony Finkelstein, Vito Perrone and Sylvia Nagl Contacts: davide.bolchini@lu.unisi.ch Abstract With this supplementary

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Plagiarism detection for Java: a tool comparison

Plagiarism detection for Java: a tool comparison Plagiarism detection for Java: a tool comparison Jurriaan Hage e-mail: jur@cs.uu.nl homepage: http://www.cs.uu.nl/people/jur/ Joint work with Peter Rademaker and Nikè van Vugt. Department of Information

More information

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing Asif Dhanani Seung Yeon Lee Phitchaya Phothilimthana Zachary Pardos Electrical Engineering and Computer Sciences

More information

PRC Coordination of Protection Systems for Performance During Faults

PRC Coordination of Protection Systems for Performance During Faults PRC-027-1 Coordination of Protection Systems for Performance During Faults A. Introduction 1. Title: Coordination of Protection Systems for Performance During Faults 2. Number: PRC-027-1 3. Purpose: To

More information

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools

ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools Jeffrey Svajlenko Chanchal K. Roy University of Saskatchewan, Canada {jeff.svajlenko, chanchal.roy}@usask.ca Slawomir

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Characterizing Smartphone Usage Patterns from Millions of Android Users

Characterizing Smartphone Usage Patterns from Millions of Android Users Characterizing Smartphone Usage Patterns from Millions of Android Users Huoran Li, Xuan Lu, Xuanzhe Liu Peking University Tao Xie UIUC Kaigui Bian Peking University Felix Xiaozhu Lin Purdue University

More information

DRVerify: The Verification of Physical Verification

DRVerify: The Verification of Physical Verification DRVerify: The Verification of Physical Verification Sage Design Automation, Inc. Santa Clara, California, USA Who checks the checker? DRC (design rule check) is the most fundamental physical verification

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Columbia University. Electrical Engineering Department. Fall 1999

Columbia University. Electrical Engineering Department. Fall 1999 Columbia University Electrical Engineering Department Fall 1999 Report of the Project: Knowledge Based Semantic Segmentation Using Evolutionary Programming Professor: Shih-Fu Chang Student: Manuel J. Reyes.

More information

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Jaweria Kanwal Quaid-i-Azam University, Islamabad kjaweria09@yahoo.com Onaiza Maqbool Quaid-i-Azam University, Islamabad onaiza@qau.edu.pk

More information

Integration With the Business Modeler

Integration With the Business Modeler Decision Framework, J. Duggan Research Note 11 September 2003 Evaluating OOA&D Functionality Criteria Looking at nine criteria will help you evaluate the functionality of object-oriented analysis and design

More information

How We Refactor, and How We Know It

How We Refactor, and How We Know It Emerson Murphy-Hill, Chris Parnin, Andrew P. Black How We Refactor, and How We Know It Urs Fässler 30.03.2010 Urs Fässler () How We Refactor, and How We Know It 30.03.2010 1 / 14 Refactoring Definition

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information