Classification/Regression Trees and Random Forests

Similar documents
Lecture 19: Decision trees

Lecture 20: Bagging, Random Forests, Boosting

The Basics of Decision Trees

Statistical Methods for Data Mining

Random Forests and Boosting

Random Forest A. Fornaser

8. Tree-based approaches

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Classification and Regression Trees

Data Mining Lecture 8: Decision Trees

Classification and Regression Trees

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Nonparametric Classification Methods

CS 229 Midterm Review

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Tree-based methods for classification and regression

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CS229 Lecture notes. Raphael John Lamarre Townshend

Business Club. Decision Trees

Network Traffic Measurements and Analysis

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Decision tree learning

Ensemble Methods, Decision Trees

Algorithms: Decision Trees

STAT Midterm Research Project: Random Forest. Vicky Xu. Xiyu Liang. Xue Cao

An introduction to random forests

7. Boosting and Bagging Bagging

Decision Tree Learning

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Classification with Decision Tree Induction

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Topics in Machine Learning

Semi-supervised learning and active learning

Performance Evaluation of Various Classification Algorithms

Lecture 7: Decision Trees

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

COMP 465: Data Mining Classification Basics

International Journal of Software and Web Sciences (IJSWS)

Classification: Decision Trees

Decision Trees. Predic-on: Based on a bunch of IF THEN ELSE rules. Fi>ng: Find a bunch of IF THEN ELSE rules to cover all cases as best you can.

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Knowledge Discovery and Data Mining

Decision Trees / Discrete Variables

CS Machine Learning

7. Decision or classification trees

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Classification with PAM and Random Forest

Data Mining Practical Machine Learning Tools and Techniques

Lecture outline. Decision-tree classification

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Decision Tree CE-717 : Machine Learning Sharif University of Technology

STA 4273H: Statistical Machine Learning

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistical Machine Learning Hilary Term 2018

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Distribution-free Predictive Approaches

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Supervised vs unsupervised clustering

Classification. Instructor: Wei Ding

Decision Trees: Discussion

CSE4334/5334 DATA MINING

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Linear Methods for Regression and Shrinkage Methods

Machine Learning. Decision Trees. Manfred Huber

Ensemble Methods: Bagging

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Lecture 2 :: Decision Trees Learning

Nearest neighbor classification DSE 220

Cyber attack detection using decision tree approach

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

SOCIAL MEDIA MINING. Data Mining Essentials

RESAMPLING METHODS. Chapter 05

The Curse of Dimensionality

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Supervised Learning Classification Algorithms Comparison

Chapter 5. Tree-based Methods

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

CSC 411 Lecture 4: Ensembles I

A Program demonstrating Gini Index Classification

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

A Systematic Overview of Data Mining Algorithms

Notes based on: Data Mining for Business Intelligence

Lecture 5: Decision Trees (Part II)

Slides for Data Mining by I. H. Witten and E. Frank

Unsupervised Learning

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Nonparametric Approaches to Regression

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Patrick Breheny. November 10

The exam is closed book, closed notes except your one-page cheat sheet.

Extra readings beyond the lecture slides are important:

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Transcription:

Classification/Regression Trees and Random Forests Fabio G. Cozman - fgcozman@usp.br November 6, 2018

Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series of splits. This process can be drawn as a tree.

Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series of splits. This process can be drawn as a tree. Y = 0 Y = 1 Y = 1 Y = 0 small large square round Size red Shape green Color

Usual drawing: upside down Color red green Size Shape small large square round Y = 0 Y = 1 Y = 1 Y = 0

Possibly continuous features Size red Color green Size small large small large Length Y = 1 Y = 1 Y = 0 < 3 3 Y = 0 Y = 1

Splits are parallel to axes X 2 < 0.65 X 1 < 0.5 X 1 < 0.9 X 2 < 0.25 X 2 < 0.55 B R B R R B X 2 X 1

Some advantages 1 Trees are easy to draw and to understand. Trees can be translated to rules. 2 Trees are more flexible in capturing boundaries between classes. 3 Trees can handle continuous and categorical features.

Growing a tree Finding best tree is computationally hard.

Growing a tree Finding best tree is computationally hard. Basic and popular idea: Select the best split; then split and repeat. Keep doing this until some stopping criterion is met.

The C4.5 algorithm Algorithm developed by Quinlan. Successor to (also popular) ID3. Based on entropy and related ideas.

Possible splits On values of a categorical feature. On thresholds that separate the observations with respect to a continuous feature.

Entropy and information gain Entropy of set of observations D: I (D) = K k=1 D k D log D k 2 D. where D k is the set of observations such that {Y = k}.

Entropy and information gain Entropy of set of observations D: I (D) = K k=1 D k D log D k 2 D. where D k is the set of observations such that {Y = k}. Information gain of a partition of D into datasets D 1,..., D m : I (D) m j=1 D j D I (Dj ).

Gain ratio Information gain is used in ID3. Gain ratio of a partition of D into datasets D 1,..., D m : I (D) m D j j=1 D I (Dj ) m D j j=1 D log. D j 2 D

Algorithm 1 Input: a training dataset.

Algorithm 1 Input: a training dataset. 2 If dataset is empty, stop (actually, stop if size is small ); if all labels in dataset are identical, return a leaf with this label.

Algorithm 1 Input: a training dataset. 2 If dataset is empty, stop (actually, stop if size is small ); if all labels in dataset are identical, return a leaf with this label. 3 For each possible split, compute the gain ratio of the resulting partition of training dataset. 4 Greedy step: Select split with largest gain ratio.

Algorithm 1 Input: a training dataset. 2 If dataset is empty, stop (actually, stop if size is small ); if all labels in dataset are identical, return a leaf with this label. 3 For each possible split, compute the gain ratio of the resulting partition of training dataset. 4 Greedy step: Select split with largest gain ratio. 5 The split corresponds to a node; each element of the partition of the training dataset is an edge to a possible node. Recursive step: Call the algorithm with each such element.

Example: training dataset X 1 X 2... X n Y d 1 a A d 2 a A d 3 c C d 4 b A d 5 b A d 6 c C d 7 c C d 8 b B d 9 b A d 10 c A d 11 c C d 12 a A d 13 b B

Example: training dataset X 1 X 2... X n Y d 1 a A d 2 a A d 3 c C d 4 b A d 5 b A d 6 c C d 7 c C d 8 b B d 9 b A d 10 c A d 11 c C d 12 a A d 13 b B Entropy I (D) = 1.419556.

Example: training dataset X 1 X 2... X n Y d 1 a A d 2 a A d 3 c C d 4 b A d 5 b A d 6 c C d 7 c C d 8 b B d 9 b A d 10 c A d 11 c C d 12 a A d 13 b B Entropy I (D) = 1.419556. Split on X 1 : I (D a ) = 0 I (D b ) = 0.971 I (D c ) = 0.722 Info.Gain: 0.7685.

Additional techniques C4.5 discounts gains when data are missing. C4.5 does some pruning of the classification tree after it is built (check whether splits can be removed). In practice there are many pruning techniques. There is also a C5.0 (commercial), but it is rarely used.

A classification tree Thal:a Ca < 0.5 Ca < 0.5 Slope < 1.5 Oldpeak < 1.1 Chol < 244 MaxHR < 156 MaxHR < 145.5 No Yes No RestBP < 157 No MaxHR < 161.5 Yes No ChestPain:bc Chol < 244 Sex < 0.5 No No No Yes Age < 52 Thal:b ChestPain:a Yes No No No Yes RestECG < 1 Yes Yes Yes Error 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Training Cross Validation Test MaxHR < 161.5 No No Ca < 0.5 ChestPain:bc Thal:a Yes Ca < 0.5 Yes 5 10 15 No Yes Tree Size

CART Classification And Regression Trees: very similar to C4.5. Focus on binary splits (some categorical values may be grouped in the process). Splits are evaluated using Gini index.

Gini index Suppose we have a region containing some observations collected in a dataset D. The Gini index of this region is G = K k=1 D j D ( ) 1 Dj. D

Gini index Suppose we have a region containing some observations collected in a dataset D. The Gini index of this region is G = K k=1 D j D ( ) 1 Dj. D The Gini index is a measure of purity (small if proportions are closer either to zero or to one).

Evaluating a split Consider a node with dataset D (with Gini index g), and a split into two nodes. Suppose the first node has dataset D 1 and Gini index g 1. Suppose the second node has dataset D 2 and Gini index g 2. The split is evaluated as g D 1 D g 1 D 2 D g 2.

Many extensions Tests may involve combinations of features. Construction of tree may be optimal of at least approximately optimal. Missing data may be handled with care.

CART: Regression trees Suppose Y is a continuous variable. Idea is the same: recursively split on features using some criterion. New: Each leaf contains a set of observations that can be averaged (or perhaps a linear regression is run locally, etc).

Splitting Idea is to minimize RSS: (y j ŷ j ) 2. d j D To do so, we must find feature X i and threshold η that minimizes (y j ŷ j ) 2 + (y j ŷ j ) 2. d j D,X i <η d j D,X i η

Splitting Idea is to minimize RSS: (y j ŷ j ) 2. d j D To do so, we must find feature X i and threshold η that minimizes (y j ŷ j ) 2 + (y j ŷ j ) 2. d j D,X i <η d j D,X i η Fact: for each feature it is easy to find the corresponding η.

Pruning Pruning cuts the depth of the tree. A leaf may then contain a set of observations. Usually classification is obtained by voting. Goal is to prevent overfitting: 1 Consider all possible node removals. 2 For each one, compute d j D (y j ŷ j ) 2 α T, where T is the number of leaves of tree and α is a parameter (get ê by cross-validation). 3 Select node removals to minimize this latter quantity. 4 Select α and run the whole process with this α.

A regression tree Years < 4.5 RBI < 60.5 Hits < 117.5 Putouts < 82 Years < 3.5 Years < 3.5 5.487 4.622 5.183 5.394 6.189 Walks < 43.5 Runs < 47.5 6.407 6.015 5.571 6.549 Walks < 52.5 RBI < 80.5 Years < 6.5 7.289 6.459 7.007

A regression tree, after pruning Years < 4.5 5.11 Hits < 117.5 6.00 6.74

Some disadvantages of trees Recall: 1 Trees are easy to understand. 2 Trees can be translated to rules. 3 Trees are more flexible than logistic (basically linear) regression. 4 Trees can handle continuous and categorical features. But: 1 Trees are not very accurate. 2 Trees are not robust; very sensitive to training dataset (variance is quite high).

Bagging trees and random forests A single tree is nice, but not very accurate and very sensitive. How about learning a set of trees and averaging the results? This idea leads to bagging trees and random forests.

Bagging trees If you have several estimates of a quantity, averaging them reduces variance. Idea: to produce several estimates, resample training dataset (inspired by a technique called bootstrap; hence name bootstrap aggregation ).

Bagged regression tree Sample with replacement M datasets. For each one, produce a regression tree. To produce ŷ, start by generating the M values ŷ j (one per tree), then ŷ = (1/M) j ŷ j.

Bagged classification tree Sample with replacement M datasets. For each one, produce a classification tree. To produce ŷ, start by generating the M values ŷ j (one per tree), then take a vote.

Important Bagging is a general idea that can be used with all classifiers and regressors!

Digression: MSE without cross-validation! When we have many datasets, each observation is used in some resampled datasets, and ignored in others. For each observation, produce an average over all trees that were learned without the observation. Then compute the error for this observation. For large M, the sum of squared errors is a good estimate of the MSE (!!).

Random forests Suppose we produce a bagged tree with M trees. But suppose that, when we grow each tree, for each split, we randomly select a subset of m features. The result is a random forest.

Intuition By randomly affecting splits, each tree becomes uncorrelated from the others. Remember: they are all built with the same data. By reducing correlation, variance is reduced. This seems silly and bizarre, but it works quite well.

Gene expression: 15 labels, 500 features Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees Single tree: error rate about 45%.

A note Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.