Towards scaling up induction of second-order decision tables

Size: px
Start display at page:

Download "Towards scaling up induction of second-order decision tables"

Transcription

1 Towards scaling up induction of second-order decision tables R. Hewett and J. Leuchner Institute for Human and Machine Cognition, University of West Florida, USA. Abstract One of the fundamental challenges for data mining is to enable inductive learning algorithms to operate on very large databases. Ensemble learning techniques such as bagging have been applied successfully to improve accuracy of classification models by generating multiple models, from replicate training sets, and aggregating them to form a composite model. In this paper, we adapt the bagging approach for scaling up and also study effects of data partitioning, sampling, and aggregation techniques for mining very large databases. Our recent work developed SORCER, a learning system that induces a near minimal rule set from a data set represented as a second-order decision table (a database relation in which rows have sets of atomic values as components). Despite its simplicity, experiments show that SORCER is competitive to other, state-of-theart induction systems. Here we apply SORCER using two instance subset selection procedures (random partitioning and sampling with replacement) and two aggregation procedures (majority voting and selecting the model that performs best on a validation set). We experiment with the GIS data set, fi-om the UCI KDD Repository, which contains 581,012 instances of 30x30 meter cells with 54 attributes for classi&ng forest cover types. Performance results are reported including results ftom mining the entire training data set using different compression algorithms in SORCER and published results from neural net and decision tree learners. 1 Introduction The development of inductive learning algorithms that scale up to very large data sets is a fundamental problem in data mining applications. Scalability raises

2 386 Data Mining III the issue of whether an algorithm can be efficient while building the best possible model from a very large data set. To machine learning researchers, very large usually means a data set containing at least 100,000 examples and 25 problem variables. For the KDD (knowledge discovery and data mining) community, data sizes of 100 megabytes (or about one million examples) are considered very large [8]. Although very large data sets can be dealt with by sampling, larger training sets often produce more accurate models, especially with noisy data or data sets with many special cases [11]. Efficiency and accuracy are commonly used for evaluating the effectiveness of scaling up techniques, particularly for classification algorithms. However, data mining also recognizes the importance of the ease with which the resulting models can be interpreted. In fact, it is not uncommon to run a state-of-the-art algorithm over a large data set for several hours and then discard much of the output in order to obtain less accurate but more comprehensible results [10]. Our data mining research into the use of comprehensible models for abstraction of regularities fi-om data has produced SORCER (~econd-q-der Relation Compression for Extraction of &les) [7], a learning system that 7 reduces classification rules from a data sets represented as second-order decision tables. Based on the theoretical framework presented in [9], second-order decision tables are database relations in which tuples (rows) have sets of atomic values as components (entries). Using sets of values, interpreted as disjunctions, provides compact representations that facilitate efficient management and enhance comprehensibility. SORCER s induction algorithm can be viewed as decision table compression in which a table representing training data is transformed into a shorter table of more general rules by merging rows in ways that preserve consistency with the original data. SORCER attempts to generate classifiers with a minimum number of rows. This bias toward fewer rows further facilitates comprehensibility. Despite its simplicity, experiments show that SORCER is competitive to popular state-of-the-art systems [7]. Ensemble learning such as bagging [3] has been applied successfidly to improve accuracy of classification by generating multiple models, fi-omreplicate training sets, and aggregating them to form a composite model. However, the resulting composite models can be quite large and complex. In this paper, we adapt the bagging approach for scaling up SORCER and study the effects of data partitioning, sampling, and aggregation techniques for mining very large databases. We choose bagging over boosting (another ensemble learning method) because bagging can be implemented to process concurrently thus increasing efficiency. We apply SORCER using two instance subset selection procedures (random partitioning and sampling with replacement), and two aggregation procedures (majority voting and selecting the best model that performs best on a validation set). Unlike other bagging-like approaches, here a composite model represented by second-order decision tables in SORCER can be compressed to a single shorter table. Reduction of the size of the model is one way to improve comprehensibility. We describe our experimentation with GIS (Geographic Information System) data, obtained from the UCI KDD Repository [1], which contains 581,012 instances of 30x30 meter cells with 54

3 Data Mining III 38 7 attributes for classifying forest cover types. Our experiments use a version of SORCER whose code has not been optimized for efficiency. Our objective is to investigate data partitioning techniques for scaling up second-order decision table induction. This paper reports on that preliminary effort. For completeness, we give a brief overview of SORCER in Section 2. Section 3 describes our methodology. Experiments and results are given in Section 4. Section 5 discusses related work and conclusions. 2 Second-order decision table induction system, SORCER 2.1 Definitions and terminology We use the terms table (relation) and row (tuple or rule) to refer to the secondorder structures which we now define. Rows are mappings defined on a set of attributes (problem variables) such that the image of an attribute A, denoted r(a), is a subset of A s domain (the values which it may assume). A table is a set of rows. The scheme of a row or table is the set of attributes on which it is defined. The partial ordering covers on the set of all rows (over a fixed scheme) is component-wise set inclusion, i.e., row s is covered by r ifs(a) G r(a) for each attribute A. The meet and join of a pair of rules are their component-wise intersection and union, respectively. Flat rows are those whose components are either singletons or empty. (Empty components represents missing information, unknown values.) The jlat extension of table R is the table consisting of all flat rows covered by at least one row in R. A table S is said to subsume relation R if the flat extension of R is a subset of the flat extension of S. Two relations are equivalent if each subsumes the other. A transformation that transforms table R into table S is equivalence-preserving if R is equivalent to S. A decision table represents a function assigning classifications to conditions and has a scheme consisting of condition attributes and a classl~cation attribute. The classification of a condition c (a row whose classification entry is empty) by decision table T, denoted T(c), is the union of the classifications of all rows of T that cover the condition. A simple condition is a condition with singleton values for all condition attributes. A decision table is consistent if it associates at most one classification to any simple condition. A decision table is complete if it classifies (gives a nonempty value to) all simple conditions. The transformation of a table R into table S is consistency-preserving if(1) every simple condition classified by R is given the same classification(s) by S, and (2) for any simple condition c not classified by R, IS(c) I < Basic algorithm The basic induction algorithm in Figure 1 starts with a flat table of training data and, by repeated transformation, produces a general table (covering more conditions) subsuming the original. At each step, the table is an approximation to the unknown target function. The transformations correspond to a search, through a hypothesis space of second-order tables, for a suitable approximating

4 388 Data Mining III Input: a decision table T output: a decision table R such that R is consistentwith T and the size of R is minimalor near minimalwithin cost constraints. (1) Apply equivalence-preserving transformations, guidedby heuristics, subjectto costconstraints. (2) Infer additionalrulesor additionalattributevaluesfor componentsof individualrules. (3) RepeatSteps (1) and (2) until neither changes the relation. (4) Apply consistency-preserving transformations, guided by heuristics, subject to cost constraints. (5) Go to Step (l). Stop when no further transformation has occurred within the cost constraints. Figure 1: Basic induction. function. Equivalence-preserving transformations may include delete redundant rules (remove rows subsumed by other rows of the table) and merge joinable (replace a pair of rows agreeing on all attributes except one by their join). An example of a consistency-preserving transformation is merge consistent, merge a pair of rules whose join does not introduce inconsistency. Such a pair is said to be consistently joinable, and their merge may add new conditions, generalizing the table, without creating inconsistency. SORCER provides another type of transformation, inclusion of statistically determined rules. An example is add high probability rows ~), where p specifies a minimum accuracy. Currently, SORCER only considers rules with one condition attribute. For example, ifp = 0.90, the rule (xl = a) => (Class = O) is added to the table if (Class = O) for at least 90 Aof the training data set examples in which (A = a). Statistically determined rules may fail to preserve consistency, and, currently, SORCER only applies them to flat tables. Conceptually, equivalence-preserving transformations can be used for data compacting and to identifi meaningful clusters of values, both of which aid comprehensibility. Consistency-preserving transformations can generalize a table to cover more conditions, which may also simpli~ classification rules. Inclusion of statistically determined rules allows creation of simple rules (i.e., based on fewer condition attributes) with a specified levels of accuracy. Time complexity of SORCER s induction depends on the transformations applied. For example, merge joinable is 0(kn2) and merge consistently joinable pairs to a fixed point is 0(kn3), where n is table length and k is the sum of the attributes domain sizes. Details are in [7]. Since many decision problems involving second-order tables (e.g., determining whether a table covers a row) are NP-hard [7], resource constraints (e.g., number of iterations) may be applied for operations likely to be prohibitively expensive. Heuristics based on domain knowledge, such as attributes ranking by discriminatory power, could help select appropriate operation or rows. The rule set produced by the algorithm may not be complete. For conditions not covered by the model, a rule is selected heuristically to provide a classification.

5 Data Mining III 389 The heuristics include a preference for rules that (1) cover the query on more attributes, (2) cover fewer conditions, and (3) give the most common classification appearing in the table. More details of SORCER are in [7]. 3 Methodology The issue in scaling up is often not speed, per se, but the size of the data set that can be handled. Scaling up learning algorithms involves finding technique to make impractical algorithms practical. Many approaches have been proposed for scaling up inductive algorithms, including designing fast algorithms and data partitioning. The first approach either develops efficient algorithms or increases efficiency of existing algorithms. The data partitioning approach uses a divideand-conquer strategy to deal with huge data sets, applying the algorithm to one or more subsets of the data and possibly combining results. Consequently, an algorithm with time complexity worse than linear in the number of examples may be made linear with the constant term dependent on the size of the subsets [5]. A survey paper by Provost and Kolluri [11] provides a comprehensive description of a variety of scaling up techniques. For this paper, we employ the data partitioning. We next describe our scaling up approach and the compression algorithms applied in our experiments. 3.1 Scaling up methods Techniques for data partitioning can be categorized by several dimensions based on how data subsets are (1) separated (e.g., by instances, or features), (2) selected Training Data Set + I Figure 2: A conceptual view of a data partitioning approach.

6 390 Data Mining III (e.g., sampling, partitioning), (3) trained and processed (e.g., concurrently, sequentially - incremental batch learning, model-guided instance selection), and (4) how the resulting models are produced (e.g., combine predictions) [11]. Figure 2 gives a general model of a data partitioning approach. A selection procedure selects one or more subsets Ti (i = 1,..., k) of a large training data set. Each subset T is used as a training set for a learning algorithm A to produce a classification model Ci. An aggregation procedure then uses results from classifiers Ci s to produce afmal classifier, C. One advantage of this model is that it provides an independent multi-subset learning. Thus, each learning process can be run concurrently. Specific methods for scaling up are varied by at least three factors: the learning algorithm, the selection, and the aggregation procedures, as shown in the oval shapes of Figure 2. In general, different learning algorithms can be used to build each classifier. For this paper, we focus on SORCERS basic induction algorithm, which varies depending on the transformations applied (as discussed in Section 3.2). We use two instance subset selection procedures (random partitioning and sampling with replacement) and two aggregation procedures (majority voting and selecting the best model for a validation set). Majority voting combines a set of classifiers by taking the union of rules in the classifiers and resolving inconsistencies by eliminating rules with less frequent classifications. The term combine when applied to classifiers refers to the majority voting aggregation procedure. We describe the four specific combinations of these procedures used in our experiments below. Method 1: Random partition the training set into subsets, obtain a classifier from each subset, and select the classifier that performs best (highest accuracy) on a validation set. Method 2: Random partition the training set into subsets, obtain a classifier from each subset, and combine all classifiers into a final classifier Method 3: Random sample without replacement, greedily cover the training set by incremental combination of a classifier obtained ti-om current sample (Ci) with the current best combined classifier obtained from previous samples (C*), as long as the new combined classifier of Ci and C* (CCO~~) performs better than C* on a validation set. Each time a new classifier is combined into C*, update C* to c..~b. A final classifier is C*. Method 4: Same as Method 3 except that m classifiers (each trained from different samples) are considered for combining with the current best combined classifier at a time, instead of one at a time as in Method 3. Method 2 is most similar to a bagging approach except that in bagging, subsets are randomly sampled with replacement from a training data set. As shown in Figure 2, each of these methods applies the same learning algorithm, for each training subset. We conduct four experiments with three compression algorithms. Each of the three experiments applies each compression algorithm. In the final experiment, for each training subset, all the three compression algorithms are applied to produce three classifiers and the classifier with the

7 Data Mining III 391 Alg. Al A2 A3 Transformations and Operations Merge consistent Merge joinable Merge consistent Add high probability rows p Merge consistent Figure 3: Three compression algorithms. highest accuracy on a validation set is selected. We refer to these experiments as experiments with algorithms Al, A2, A3, and Best, respectively. The next section describes the compression algorithms AI-A Compression algorithms The transformations described in Section 2.2 can be applied in various combinations to create different compression algorithms. Algorithms used in our experiments are summarized in Figure 3. These algorithms are representatives of induction using basic transformations which can be specified easily in SORCER by generating script files of SORCER commands. Al merges pairs of consistently joinable rows until no more consistent joining is possible. It is the simplest compression with generalization. Applying equivalence-preserving transformation, such as merge joinable, gives a classifier that simply remembers all seen cases. A2 fwst merges locally joinable pairs, until no more such joins are possible, and then merges pairs of consistently joinable rows until no more consistent joining is possible. By applying merge joinable before merge consistent, A2 attempts to give priority to generalization according to the structure of the knowledge partially formed by equivalencepreserving transformation of a training data set. A3 adds statistically determined rules whose accuracy exceeds a specified threshold before applying transformations used in Al. For the experiments in this paper, we used the threshold p = 0.9. Since the order of training data may affect the result of compression, we had SORCER shuffle the training data before applying the compression algorithm. 4 Experiments and results The GIS data, obtained fi-om the UCI KDD Repository [1], contains 581,012 instances of 30x30 meter cells with 54 attributes for classifying forest cover types. There are 44 binary attributes, ten attributes with continuous values and seven classes of forest cover types. The class frequencies vary from classes with occurrences of 48.7 %0and 36.5 %oto 0.5 Aof the data. We randomly selected 181,012 of data instances for testing, 395,000 for training, and 5,000 for validation. A random sample of size 15,800 from the training set is used by

8 392 Data Mining III ~ ##w=-: ~ 78 A Ed? A d A Alg.Tr. Time(sec) Ace. o m 65? al 1000 h J) Al w A Cl ITJ u L A E w 60 ~ A A w < Results on the entire 10,~ 5 TO Al ,? 42 ~; A3 x AccuPacq Q I 5@ l@a@b8 le+@ Training Size training data. Figure 4: Training time and accuracy obtained from different training set sizes. SORCER to discretize continuous attributes and the boundaries obtained are used for discretizing the rest of the data. To obtain a consistent classifier, SORCER resolves inconsistencies by retaining instances that occur more frequently. Weranexperiments ona Pentium III, 500MHz PCwith256Mbof memory. We first observe a learning curve for the data set by running SORCER on samples with sizes varying flom 100 instances to the entire training data set. Accuracies are for the testing data set. Figure 4 shows the training time of each sample using algorithms A1-A3 and the average learning curve (over three runs of each algorithm). Most of the training time is used to resolve inconsistency, as shown by Tin Figure 4. A training set of size 15.8 K gives an average accuracy of 70.2 Aand from size K on, accuracy no longer improves. The right of Figure 4 shows that the three algorithms produce, fi-om the entire training data set, classifiers with accuracy of /0with slightly different training times. Based on this result, we decide to use partitioned subsets and 25 random samples of size 15.8 K for each method described in Section 3. Figure 5 summarizes results obtained from each method and algorithm. Loading data and classifier aggregation, each took a few seconds, and since this time is essentially the same for all the algorithms, we exclude them from the training time in Col 2. Col 3 shows the time SORCER took to transform the final classifiers of the size (table length) shown in Col 1 to equivalent classifiers of the size shown in Col 5. Classifiers in Method 1 cannot be compressed finther since they are not combined classifiers. We compare total time (i.e., Cols 3 and 4) of each algorithm with the time spent on training with the entire training data set (as in Figure 4, except for Best, we use an average training time of all algorithms) and show a percent reduction of the training time in Col 4. As shown in Col 6, the accuracies obtained are at worst 1 0/0 and at best 0.1 0/0lower

9 Data Mining III 393 Col Figure 5: Results from data partitioning approaches. than the accuracy obtained fi-om the entire training set. There is no large difference between accuracies obtained from different algorithms. However, Method 4 (with m = 5) seems to give slightly higher accuracy than others, while Method 3 was fastest with only slightly lower accuracy. In general, there is no great lost in accuracy by using a data partitioning technique but there is a large decrease in training time, 91.5% on the average for the fastest method. 5 Related work and conclusions Sampling is a common technique for scaling up classification algorithms to large data sets [4, 11]. However, the question of how large a training sample should be to achieve optimal accuracy (highest achievable with the entire data) is not obvious. Recent work on progressive sampling (PS) [12] provides an efficient search for a suitable sample size. Though we do not explicitly study the effect of PS, we applied its concept to observe SORCERS learning curve and select a sample size for our experiments. Several data partitioning techniques have been proposed for scaling up [11]. Work in ensemble learning has shown that combining the output of a set of classifiers that are independently trained fi-om random data samples can greatly improve accuracy [3]. Like other ensemble learning, bagging has been studied in the context of accuracy improvement. Here we use bagging concept in studying scaling up. Unlike ordinary bagging, we use SORCER to fhrther compressa combined classifierintoa smallermodel for enhanced comprehensibility. Other experiments on the GIS data set have been published. Blackard [2] reported 70 /0 accuracy obtained using neural net back propagation and 58 /0 accuracy using linear discriminant analysis. Gu et al. [6] propose an efficient technique to fmd a good starting sample size for PS. By using a decision tree learner, C5.0, the improved version of C4.5 [13], accuracies of 73% and 75.8%

10 3% Data Mining III were obtained on an initial sample size and the entire training data of 400K instances, respectively. However, these results used different experimental settings and thus, give only a rough idea of where SORCER S performance stands. Our experiment shows that the tradeoff between increased accuracy, using larger training sets, and time efficiency, using smaller training sets, is an important consideration for scaling up learning algorithms. We view this work as a fust step toward scaling up techniques for SORCER. Many improvements are possible including program optimization within SORCER and more efficient ways to deal with inconsistent data. We also plan to investigate feature subset selection for scaling up. References [1] [2] [3] [4] [5] [6] [7] [8] [9] Bay, S.D. The UCI KDD Archive, edu, Blackard, J. A., Comparison of Neural Networks and Discrirninant Analysis in Predicting Forest Cover Types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado, Breiman, L., Bagging predictors. Machine Learning, 24(2), pp , Catlett, J., Megainduction: A test flight. Proceedings of the 8 h International Workshop on Machine Learning, Morgan Kaufmann, pp , Domin o, P., Efficient specific-to-general rule induction. Proceedings of $ the 2 International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, AAAI Press, pp , Gu, B., B. Liu, F. Hu and H. Liu, Efficiently determine the starting sample size for progressive sampling. Proceedings of 12th European Conference on Machine Learning, Freiburg, Germany, Hewett, R. and J. Leuchner, The Power of Second-Order Decision Tables. Proceedings of the 2n~SIAM International Conference on Data Mining, pp ,2002. Huber, P., From large to huge: a statistician s reaction to KDD and DM. Proceedings of the 3r~ International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, AAAI Press, pp , Leuchner, J. and R. Hewett, A Formal Framework for Large Decision Tables. Proceedings of Conference on Knowledge Retrieval, Use and Storage for E@ciency, , [10] Oates, T. and D. Jensen, Large data sets lead to overly complex models: an explanation and a solution. Proceedings of the 4th International Conference on Knowledge Discove~ and Data Mining, pp , [11] Provost, F. and V. Kolluri, A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 2, pp. 1-42, [12] Provost, F., D. Jensen and T. Oates, Efficient progressive sampling. Proceedings of the 5 h International Conference on Knowledge Discoveiy and Data Mining, AAAI/MIT Press, [13] Quinlan, J., C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann, 1993.

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Efficiently Determine the Starting Sample Size

Efficiently Determine the Starting Sample Size Efficiently Determine the Starting Sample Size for Progressive Sampling Baohua Gu Bing Liu Feifang Hu Huan Liu Abstract Given a large data set and a classification learning algorithm, Progressive Sampling

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

A Genetic Algorithm-Based Approach for Building Accurate Decision Trees

A Genetic Algorithm-Based Approach for Building Accurate Decision Trees A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele,, University of Maryland S. Raghavan,, University of Maryland Edward

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse352 Artifficial Intelligence Short Review for Midterm Professor Anita Wasilewska Computer Science Department Stony Brook University Midterm Midterm INCLUDES CLASSIFICATION CLASSIFOCATION by Decision

More information

April 3, 2012 T.C. Havens

April 3, 2012 T.C. Havens April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate

More information

IJMIE Volume 2, Issue 9 ISSN:

IJMIE Volume 2, Issue 9 ISSN: WEB USAGE MINING: LEARNER CENTRIC APPROACH FOR E-BUSINESS APPLICATIONS B. NAVEENA DEVI* Abstract Emerging of web has put forward a great deal of challenges to web researchers for web based information

More information

Distributed Pasting of Small Votes

Distributed Pasting of Small Votes Distributed Pasting of Small Votes N. V. Chawla 1,L.O.Hall 1,K.W.Bowyer 2, T. E. Moore, Jr. 1,and W. P. Kegelmeyer 3 1 Department of Computer Science and Engineering, University of South Florida 4202 E.

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Efficient SQL-Querying Method for Data Mining in Large Data Bases Efficient SQL-Querying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA 1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez

More information

Minimal Test Cost Feature Selection with Positive Region Constraint

Minimal Test Cost Feature Selection with Positive Region Constraint Minimal Test Cost Feature Selection with Positive Region Constraint Jiabin Liu 1,2,FanMin 2,, Shujiao Liao 2, and William Zhu 2 1 Department of Computer Science, Sichuan University for Nationalities, Kangding

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Web Service Usage Mining: Mining For Executable Sequences

Web Service Usage Mining: Mining For Executable Sequences 7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 21-23, 2007 266 Web Service Usage Mining: Mining For Executable Sequences MOHSEN JAFARI ASBAGH, HASSAN ABOLHASSANI

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method. IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED ROUGH FUZZY POSSIBILISTIC C-MEANS (RFPCM) CLUSTERING ALGORITHM FOR MARKET DATA T.Buvana*, Dr.P.krishnakumari *Research

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Boosting Algorithms for Parallel and Distributed Learning

Boosting Algorithms for Parallel and Distributed Learning Distributed and Parallel Databases, 11, 203 229, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Boosting Algorithms for Parallel and Distributed Learning ALEKSANDAR LAZAREVIC

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 77204-3475 {twryu, ceick}@cs.uh.edu

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Introduction Pattern recognition is a set of mathematical, statistical and heuristic techniques used in executing `man-like' tasks on computers. Pattern recognition plays an

More information

Bagging Is A Small-Data-Set Phenomenon

Bagging Is A Small-Data-Set Phenomenon Bagging Is A Small-Data-Set Phenomenon Nitesh Chawla, Thomas E. Moore, Jr., Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer, and Philip Kegelmeyer Department of Computer Science and Engineering University

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques M. Lazarescu 1,2, H. Bunke 1, and S. Venkatesh 2 1 Computer Science Department, University of Bern, Switzerland 2 School of

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Data Analytics and Boolean Algebras

Data Analytics and Boolean Algebras Data Analytics and Boolean Algebras Hans van Thiel November 28, 2012 c Muitovar 2012 KvK Amsterdam 34350608 Passeerdersstraat 76 1016 XZ Amsterdam The Netherlands T: + 31 20 6247137 E: hthiel@muitovar.com

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Progress Report: Collaborative Filtering Using Bregman Co-clustering Progress Report: Collaborative Filtering Using Bregman Co-clustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Efficient Case Based Feature Construction

Efficient Case Based Feature Construction Efficient Case Based Feature Construction Ingo Mierswa and Michael Wurst Artificial Intelligence Unit,Department of Computer Science, University of Dortmund, Germany {mierswa, wurst}@ls8.cs.uni-dortmund.de

More information

A Novel Algorithm for Associative Classification

A Novel Algorithm for Associative Classification A Novel Algorithm for Associative Classification Gourab Kundu 1, Sirajum Munir 1, Md. Faizul Bari 1, Md. Monirul Islam 1, and K. Murase 2 1 Department of Computer Science and Engineering Bangladesh University

More information

2. Literature Review

2. Literature Review Bagging Is A Small-Data-Set Phenomenon Nitesh Chawla l, Thomas E. Moore, Jr., Kevin W. Bowyer2, Lawrence 0. Hall1, Clayton Springe$, and Philip Kegelmeyes ldepartment of Computer Science and Engineering

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values Patrick G. Clark Department of Electrical Eng. and Computer Sci. University of Kansas Lawrence,

More information

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi

More information

Fuzzy Partitioning with FID3.1

Fuzzy Partitioning with FID3.1 Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing

More information

Lecturer 2: Spatial Concepts and Data Models

Lecturer 2: Spatial Concepts and Data Models Lecturer 2: Spatial Concepts and Data Models 2.1 Introduction 2.2 Models of Spatial Information 2.3 Three-Step Database Design 2.4 Extending ER with Spatial Concepts 2.5 Summary Learning Objectives Learning

More information

Ordering attributes for missing values prediction and data classification

Ordering attributes for missing values prediction and data classification Ordering attributes for missing values prediction and data classification E. R. Hruschka Jr., N. F. F. Ebecken COPPE /Federal University of Rio de Janeiro, Brazil. Abstract This work shows the application

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Perceptron-Based Oblique Tree (P-BOT)

Perceptron-Based Oblique Tree (P-BOT) Perceptron-Based Oblique Tree (P-BOT) Ben Axelrod Stephen Campos John Envarli G.I.T. G.I.T. G.I.T. baxelrod@cc.gatech sjcampos@cc.gatech envarli@cc.gatech Abstract Decision trees are simple and fast data

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

A Two Stage Zone Regression Method for Global Characterization of a Project Database

A Two Stage Zone Regression Method for Global Characterization of a Project Database A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,

More information

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Improving Classifier Performance by Imputing Missing Values using Discretization Method Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,

More information

Genetic Programming for Data Classification: Partitioning the Search Space

Genetic Programming for Data Classification: Partitioning the Search Space Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming

More information

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM XIAO-DONG ZENG, SAM CHAO, FAI WONG Faculty of Science and Technology, University of Macau, Macau, China E-MAIL: ma96506@umac.mo, lidiasc@umac.mo,

More information

Feature Construction and δ-free Sets in 0/1 Samples

Feature Construction and δ-free Sets in 0/1 Samples Feature Construction and δ-free Sets in 0/1 Samples Nazha Selmaoui 1, Claire Leschi 2, Dominique Gay 1, and Jean-François Boulicaut 2 1 ERIM, University of New Caledonia {selmaoui, gay}@univ-nc.nc 2 INSA

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

Feature Selection with Adjustable Criteria

Feature Selection with Adjustable Criteria Feature Selection with Adjustable Criteria J.T. Yao M. Zhang Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 E-mail: jtyao@cs.uregina.ca Abstract. We present a

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

CSE 6242/CX Ensemble Methods. Or, Model Combination. Based on lecture by Parikshit Ram

CSE 6242/CX Ensemble Methods. Or, Model Combination. Based on lecture by Parikshit Ram CSE 6242/CX 4242 Ensemble Methods Or, Model Combination Based on lecture by Parikshit Ram Numerous Possible Classifiers! Classifier Training time Cross validation Testing time Accuracy knn classifier None

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Data Mining, Parallelism, Data Mining, Parallelism, and Grids. Queen s University, Kingston David Skillicorn

Data Mining, Parallelism, Data Mining, Parallelism, and Grids. Queen s University, Kingston David Skillicorn Data Mining, Parallelism, Data Mining, Parallelism, and Grids David Skillicorn Queen s University, Kingston skill@cs.queensu.ca Data mining builds models from data in the hope that these models reveal

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information