An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Similar documents
Random Forest A. Fornaser

Nonparametric Classification Methods

8. Tree-based approaches

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Trade-offs in Explanatory

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

arxiv: v1 [stat.ml] 25 Jan 2018

Meta Mining Architecture for Supervised Learning

Network Traffic Measurements and Analysis

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Classification and Trees

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

Using Pairs of Data-Points to Define Splits for Decision Trees

Random Forest Classification and Attribute Selection Program rfc3d

SSV Criterion Based Discretization for Naive Bayes Classifiers

CSE4334/5334 DATA MINING

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

Comparing Univariate and Multivariate Decision Trees *

An introduction to random forests

Research Article A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Classification: Decision Trees

USING ADAPTIVE BAGGING TO DEBIAS REGRESSIONS. Leo Breiman Statistics Department University of California at Berkeley Berkeley, CA 94720

Lecture 2 :: Decision Trees Learning

Naïve Bayes for text classification

Data Mining Lecture 8: Decision Trees

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning Lecture 11

Business Club. Decision Trees

Classification/Regression Trees and Random Forests

Univariate and Multivariate Decision Trees

Weighted decisions in a Fuzzy Random Forest

Combining Models to Improve Classifier Accuracy and Robustness

Performance Evaluation of Various Classification Algorithms

April 3, 2012 T.C. Havens

Classification and Regression Trees

The Basics of Decision Trees

Combining Models to Improve Classifier Accuracy and Robustness 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Nonparametric Approaches to Regression

Statistical Machine Learning Hilary Term 2018

Ensemble Methods, Decision Trees

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Model combination. Resampling techniques p.1/34

BORN AGAIN TREES. Berkeley, CA Berkeley, CA ABSTRACT

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Classification with Decision Tree Induction

Lecture 7: Decision Trees

Supervised vs unsupervised clustering

International Journal of Software and Web Sciences (IJSWS)

Model Based Sampling Fitting an ensemble of models into a single model

NEAREST-INSTANCE-CENTROID-ESTIMATION LINEAR DISCRIMINANT ANALYSIS (NICE LDA) Rishabh Singh, Kan Li (Member, IEEE) and Jose C. Principe (Fellow, IEEE)

CART Bagging Trees Random Forests. Leo Breiman

ICA as a preprocessing technique for classification

The Curse of Dimensionality

Introduction to Classification & Regression Trees

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Information theory methods for feature selection

Artificial Intelligence. Programming Styles

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Using Weighted Nearest Neighbor to Benefit from Unlabeled Data

MASTER. Random forest visualization. Kuznetsova, N.I. Award date: Link to publication

Pruning in Ordered Bagging Ensembles

Data Mining Practical Machine Learning Tools and Techniques

Classification with PAM and Random Forest

Using Iterated Bagging to Debias Regressions

Chapter 5. Tree-based Methods

Machine Learning Lecture 12

Fuzzy Partitioning with FID3.1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

A Comparative Study on Serial Decision Tree Classification Algorithms in Text Mining

Probabilistic Approaches

Bagging and Random Forests

Stat 342 Exam 3 Fall 2014

Classification and Regression via Integer Optimization

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

The Role of Biomedical Dataset in Classification

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Classification. Instructor: Wei Ding

Classification using Weka (Brain, Computation, and Neural Learning)

Classification: Basic Concepts, Decision Trees, and Model Evaluation

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

Deep Boosting. Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research)

Lecture 25: Review I

Decision Trees and Random Subwindows for Object Recognition

Dynamic Ensemble Construction via Heuristic Optimization

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Empirical Comparison of Bagging-based Ensemble Classifiers

Ensemble methods. Ricco RAKOTOMALALA Université Lumière Lyon 2. Ricco Rakotomalala Tutoriels Tanagra -

Transcription:

An Empirical Comparison of Ensemble Methods Based on Classification Trees Mounir Hamza and Denis Larocque Department of Quantitative Methods HEC Montreal Canada Mounir Hamza and Denis Larocque 1 June 2005

Outline - Introduction - Classification Trees - Ensemble Methods - Simulation - Results - Conclusion Mounir Hamza and Denis Larocque 2 June 2005

Supervised classification Y : target variable (categorical for this talk) Introduction X = X, K, X ) : vector of predictors ( 1 p Goal: Based on a sample of ( X, Y ), develop a model to predict future Y values given X. Examples: - Discriminant analysis - Logistic regression - Neural networks - Classification trees - Support vector machines. In this talk: - Empirical comparison of several ensemble methods based on classification trees - Comparison with the results of Lim, Loh and Shih (Machine Learning, 2000) Mounir Hamza and Denis Larocque 3 June 2005

Classification trees Idea: create a partition of the predictor space into a set of rectangles and assign a class to each of them. CART, Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984), work with the notion of nodes impurity. If Y is binary (0,1), Gini s impurity at a given node N is: i (N ) = P 0) P (1) N ( N where P N (i) = proportion of observations of node N such that Y = i. This function attains its minimum 0 when the node is pure (i.e. contains only 0 s or only 1 s). It attains its maximum ¼ when there are as many 0 s as 1 s in the node. Mounir Hamza and Denis Larocque 4 June 2005

At a given node N, the algorithm seeks the best split. A split is a separation of the observations of the node into two parts with respect to a rule defined by one of the predictor variable. If X if continuous or ordinal: X < c If X is nominal: X { c 1, K,ck } The algorithm seeks the best split among all possible splits. For a given split, we can calculate the decrease in impurity i (N ) = i (N ) - P L i ( N L ) P i N ) R ( R where N L et N R are the left and right nodes created after the split and wherep L et P R are the proportions of observations of node N that end up in the left and right nodes after the split. The best split is the one that maximizes the decrease in impurity. Mounir Hamza and Denis Larocque 5 June 2005

In the end, we get a tree G (x) which, for a given x, returns an estimate of P ( Y = 1 x). This estimate is simply the proportion of 1 in the terminal node that is reached when we let x fall into the tree. We can use a cut-point to classify an observation, for example, we could assign class 1 if G (x) > ½ and 0 otherwise. ------------ Advantages of classification trees: - Easy to understand (if there are not too many terminal nodes) - Can handle missing data - Nonparametric Disadvantages: - Not always the best in terms of performance - Unstable The trees are considered unstable in the sense that, if we slightly modify the data, the tree constructed with the modified data can be very different from the one constructed with the original observations. Mounir Hamza and Denis Larocque 6 June 2005

Ensemble methods "Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions" Dietterich (2000). Examples: - Bagging - Boosting - Random Forests - Randomization Mounir Hamza and Denis Larocque 7 June 2005

Bagging Bagging: bootstrap aggregating. Breiman (1996). Idea: generate bootstrap samples, construct a tree with each of them and combine the results. Typical algorithm: ------- For b = 1, K, B, 1) Choose n observations at random and with replacement form the original data. 2) Construct a tree with these observations. ------- To classify an observation, i) use the majority class among the B trees or, since a tree returns an estimate of P ( Y = k X ), ii) we can average those estimates over the B trees and assign the class with the highest estimated probability. Mounir Hamza and Denis Larocque 8 June 2005

Boosting (arcing) Freund and Schapire (1996): AdaBoost.M1 Idea: construct a tree with the original data, give more weights to the data points that were misclassified, construct a new tree with the weighted points. Modify the weights again and so on In the end, combine the results. Breiman (1998): arcing (Adaptively Resample and Combine) Mounir Hamza and Denis Larocque 9 June 2005

Algorithm arc-x4: Breiman (1998). -------- Start with the weights (0) p i = 1/ n ; i = 1, K, n. For k = 1, K, M 1) By sampling from the original data with ( k ) ( k ) replacement and with weights ( p1, K, pn ), obtain a sample of size n. 2) Construct a tree, G k, with this sample. 3) Let m i be the number of times that the i th observation of the original data is misclassified by G 1, K,G k. 4) Update the weights: -------- p ( k + 1) i = 4 (1 + mi ) 4 (1 + mi ) To classify an observation, use the majority class among the M trees. Mounir Hamza and Denis Larocque 10 June 2005

Breiman (2001). Random forests A random forest is a collection of trees { G( x, Θk ), k = 1,... } where Θ 1, Θ2, K are training sets that are independent and identically distributed. Note: Bagging is an example of a random forest. Important result: we can t over-fit when the number of tree increases. Two algorithms are used in Breiman (2001). Here is the description of one of them. It combines Bagging with a particular way of constructing trees. Mounir Hamza and Denis Larocque 11 June 2005

The basic algorithm is (as for Bagging): ------- For b = 1, K, B, 1) Choose n observations at random and with replacement form the original data. 2) Construct a tree with these observations. ------- The difference here is that we inject randomness into the tree construction algorithm itself. At a given node, we will select at random a subset of predictors (of size F ) and let the algorithm find the best split among the selected predictors. The selected predictors can be different at each node. Mounir Hamza and Denis Larocque 12 June 2005

Simulation Data sets: Wisconsin Breast Cancer (BCW) BUPA Liver Disorders (BLD) Boston Housing (BOS) Contraceptive Method Choice (CMC) Statlog DNA (DNA) StatLog Heart Disease (HEA) LED Display (LED) PIMA Indian Diabetes (PID) Image Segmentation (SEG) Attitude toward Smoking Restrictions (SMO) Thyroid Disease (THY) StatLog Vehicle Silhouette (VEH) Congressional Voting Records (VOT) Waveform (WAV) All of the data sets can be obtained from UCI Machine learning repository (http://www.ics.uci.edu/~mlearn/mlsummary.html), with the exception of SMO which can be obtained from http://lib.stat.cmu.edu/datasets/csb/. Mounir Hamza and Denis Larocque 13 June 2005

Table 1: Characteristics of the data sets Data set Size Number of values of the dependent variable Numerical Number of predictors Number of values of the categorical predictors 2 3 4 5 Total number of predictors BCW 683 2 9 9 BLD 345 2 6 6 BOS 506 3 12 1 13 CMC 1473 3 2 3 4 9 DNA 3186 3 0 60 60 HEA 270 2 7 3 2 1 13 LED 6000 10 0 7 7 PID 532 2 7 7 SEG 2310 7 19 19 SMO 2855 3 3 3 1 1 8 THY 7200 3 6 15 21 VEH 846 4 18 18 VOT 435 3 0 16 16 WAV 3600 3 21 21 Mounir Hamza and Denis Larocque 14 June 2005

Methods: - Single tree with pruning (CART) - Bagging with 100 trees (CART) - Arcing (arc-x4) with 100 and 250 trees (CART) - Random forest with 100 trees (R). The number of predictors chosen at random at a given node is log 2 ( M + 1) where M is the total number of predictors. 3 split criteria: Gini, entropy and twoing. With and without linear combinations of predictors. Comparison criterion: classification error estimated with 10-fold cross-validation. For another part of the study, noise was added to data. Mounir Hamza and Denis Larocque 15 June 2005

Lim, Loh and Shih (2000): compared several classification methods with the same data sets. Trees: CART, Splus tree, C4.5, FACT, QUEST, IND, OC1, LMDT, CAL5, T1. Statistical algorithms: linear and quadratic discriminant analysis, nearest-neighbor, logistic discriminant analysis, Flexible discriminant analysis, penalized discriminant analysis, mixture discriminant analysis, POLYCLASS. Neural networks: learning vector quantization (Splus), Radial Basis Function. Mounir Hamza and Denis Larocque 16 June 2005

Results Table 4: Overall ranking of the methods according to the mean classification error and according to the mean rank Sorted according to the mean rank Sorted according to the mean classification error Method Mean rank Method Mean classification error RF 2.38 RF 17.79 BO(100)-NLC-E 3.38 BA-NLC-E 18.61 BO(250)-NLC-E 4.17 BO(250)-NLC-G 18.64 BO(250)-NLC-G 4.56 BO(100)-NLC-E 18.72 BO(100)-NLC-G 4.91 BO(250)-NLC-E 18.76 BA-NLC-E 5.20 BO(100)-NLC-G 18.87 BA-NLC-G 5.67 BA-NLC-G 18.89 ST-NLC-E 7.64 ST-NLC-E 22.71 ST-NLC-G 7.92 ST-NLC-G 23.25 Mounir Hamza and Denis Larocque 17 June 2005

Data set Table 5: Comparison with the results of Lim, Loh and Shih Best classification error obtained in Lim et al. (2000) Method that obtained the best classification error in Lim et al. (2000) (2000) Best classification error in the present study Method that obtained the best classification error in the present study Percent difference between the two BCW 2.78 Neural networks 2.64 RF -5.20 BLD 27.9 Classification tree (0C1) 25.80 RF -7.54 BOS 22.5 Neural networks 19.96 BO(250)-WLC-G -11.29 CMC 43.4 POLYCLASS 47.39 RF 9.19 DNA 4.72 POLYCLASS 3.23 RF -31.51 HEA 14.1 Linear discriminant analysis 17.78 RF 26.08 LED 27.1 Linear discriminant analysis 26.27 RF -3.07 PID 22.1 Linear discriminant analysis 22.93 RF 3.77 SEG 2.21 Nearest Neighbor 1.60 BO(100)-WLC-E -27.52 SMO 30.5 Linear discriminant analysis 34.05 RF 11.62 THY 0.642 Classification tree (CART) 0.28 BO(100)-WLC-E -56.73 VEH 14.5 Quadratic discriminant analysis 23.40 BO(250)-WLC-G 61.41 VOT 4.32 Flexible discriminant analysis 3.91 RF -9.54 WAV 15.1 Neural networks 14.72 RF -2.50 Mean 16.56 17.42-3.06 Median 14.8 18.87-4.14 Minimum 0.642 0.28 Maximu m 43.4 47.39 Mounir Hamza and Denis Larocque 18 June 2005

Table 6: Percent increase in classification error between the original data and the data with noise Classification method Data set ST-NLC- BA- BO(100)- BO(250)- G NLC-G NLC-G NLC-G RF BCW 54.84 113.04 104.55 91.30 72.22 BLD 18.18 16.98 20.00 29.17 28.09 BOS 10.85-1.69 15.53 10.89 1.87 CMC -6.23 2.28 2.18 1.37 3.30 DNA 2.73 41.91 53.38 38.69 43.69 HEA 40.68 17.31 1.79 15.38-2.08 LED -0.68-0.87-4.21-3.99 1.65 PID 19.23 11.81 8.40 6.15 5.74 SEG 120.00 148.00 174.36 128.26 115.56 SMO 8.78 3.07 4.25 5.00 5.56 THY 2156.82 176.00 333.33 292.00 23.08 VEH -0.76 1.82 2.51 3.03-5.26 VOT -9.52 72.73 60.87 68.18 0 WAV 1.32 2.57 5.08 1.97 4.53 Mean 172.59 43.21 55.86 49.10 21.28 Median 9.82 14.40 11.97 13.14 5.04 Minimum -9.52-1.69-4.21-3.99-5.26 Maximum 2156.82 176.00 333.33 292.00 115.56 Mounir Hamza and Denis Larocque 19 June 2005

Table 7: Classification error (in %) for random forests with F variables and with the optimal number of variables Classification error with F variables Value of F Classification error with the optimal number of variables Optimal number of variables Percent increase relative to optimal number of variables BCW 2.64 3 02.34 1 12.50 BLD 25.80 2 25.80 2 0 BOS 21.15 3 19.17 5 10.31 CMC 47.39 2 45.96 3 3.10 DNA 3.23 5 3.11 9 4.04 HEA 17.78 3 15.19 4 17.07 LED 26.27 3 26.02 2 0.96 PID 22.93 3 22.56 1 1.67 SEG 1.95 4 1.69 8 15.38 SMO 34.05 3 30.47 1 11.72 THY 0.36 4 0.25 12 44.44 VEH 26.95 4 24.23 12 11.22 VOT 3.91 4 3.45 5 13.33 WAV 14.72 4 14.72 4 0 Mean 17.79 3.36 16.78 4.93 10.41 Median 19.46 3 17.18 4 10.76 Minimum 0.36 2 0.25 1 0 Maximum 47.39 5 45.96 12 44.44 Mounir Hamza and Denis Larocque 20 June 2005

Conclusion In this study: 1) Random forests were significantly better than Bagging, Boosting and a single tree. 2) The error rate was smaller than the best one obtained by the methods used in Lim, Loh and Shih (2000) for 9 out of 14 data sets. 3) More robust to noise than the other methods. Consequently, random forest is a very good «offthe-shelf» classification method: - Easy to use - No models, no parameters to select except for the number of predictors to choose at random at each node - Relatively robust to noise Mounir Hamza and Denis Larocque 21 June 2005