Lecture 20: Bagging, Random Forests, Boosting

Similar documents
Random Forests and Boosting

Lecture 19: Decision trees

Classification/Regression Trees and Random Forests

The Basics of Decision Trees

Statistical Methods for Data Mining

Random Forest A. Fornaser

STAT Midterm Research Project: Random Forest. Vicky Xu. Xiyu Liang. Xue Cao

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 25: Review I

Nonparametric Classification Methods

Introduction to Classification & Regression Trees

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining Lecture 8: Decision Trees

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Decision Tree Learning

Stat 342 Exam 3 Fall 2014

Model selection and validation 1: Cross-validation

Algorithms: Decision Trees

8. Tree-based approaches

CS 229 Midterm Review

Tree-based methods for classification and regression

Lecture 16: High-dimensional regression, non-linear regression

Lecture 17: Smoothing splines, Local Regression, and GAMs

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Knowledge Discovery and Data Mining

Classification and Regression Trees

Classification and Regression Trees

Allstate Insurance Claims Severity: A Machine Learning Approach

7. Boosting and Bagging Bagging

CSC 411 Lecture 4: Ensembles I

Lecture 06 Decision Trees I

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Business Club. Decision Trees

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Package ordinalforest

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Decision Trees. Predic-on: Based on a bunch of IF THEN ELSE rules. Fi>ng: Find a bunch of IF THEN ELSE rules to cover all cases as best you can.

Classification: Decision Trees

Classification with Decision Tree Induction

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MSA220/MVE440 Statistical Learning for Big Data

Patrick Breheny. November 10

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Nonparametric Approaches to Regression

Lecture 7: Decision Trees

CS229 Lecture notes. Raphael John Lamarre Townshend

Lecture 13: Model selection and regularization

Introduction to Automated Text Analysis. bit.ly/poir599

Support Vector Machines

Statistical Machine Learning Hilary Term 2018

Network Traffic Measurements and Analysis

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Statistical foundations of machine learning

CS Machine Learning

Chapter 7: Numerical Prediction

CSE4334/5334 DATA MINING

Classification. Instructor: Wei Ding

Random Forests for Big Data

Notes based on: Data Mining for Business Intelligence

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Bagging and Random Forests

Decision Tree CE-717 : Machine Learning Sharif University of Technology

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Nonparametric Methods Recap

Faculty of Sciences. Holger Cevallos Valdiviezo

Fast or furious? - User analysis of SF Express Inc

Lecture outline. Decision-tree classification

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Machine Learning: An Applied Econometric Approach Online Appendix

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Probabilistic Approaches

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Random Forest in Genomic Selection

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

RESAMPLING METHODS. Chapter 05

CS294-1 Final Project. Algorithms Comparison

Cost-complexity pruning of random forests

Scaling up Decision Trees

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Variable Selection 6.783, Biomedical Decision Support

A Program demonstrating Gini Index Classification

Decision tree learning

The exam is closed book, closed notes except your one-page cheat sheet.

An introduction to random forests

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Logical Rhythm - Class 3. August 27, 2018

Lecture 7: Linear Regression (continued)

Moving Beyond Linearity

Performance Evaluation of Various Classification Algorithms

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

A proposal of hierarchization method based on data distribution for multi class classification

CART Bagging Trees Random Forests. Leo Breiman

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Transcription:

Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis November 13, 2017 1 / 17

Classification and Regression trees, in a nut shell Grow the tree by recursively splitting the samples in the leaf R i according to X j > s, such that (R i, X j, s) maximize the drop in RSS. Greedy algorithm. Create a sequence of subtrees T 0, T 1,..., T m using a pruning algorithm. Select the best tree T i (or the best α) by cross validation. Why might it be better to choose α instead of the tree T i by cross-validation? 2 / 17

Example. Heart dataset. How do we deal with categorical predictors? Thal:a Ca < 0.5 Ca < 0.5 Slope < 1.5 Oldpeak < 1.1 Chol < 244 MaxHR < 156 MaxHR < 145.5 No Yes No RestBP < 157 No MaxHR < 161.5 Yes No ChestPain:bc Chol < 244 Sex < 0.5 No No No Yes Age < 52 Thal:b ChestPain:a Yes No No No Yes RestECG < 1 Yes Yes Yes 3 / 17

Categorical predictors If there are only 2 categories, then the split is obvious. We don t have to choose the splitting point s, as for a numerical variable. If there are more than 2 categories: Order the categories according to the average of the response: ChestPain : a > ChestPain : c > ChestPain : b Treat as a numerical variable with this ordering, and choose a splitting point s. One can show that this is the optimal way of partitioning. 4 / 17

Missing data Suppose we can assign every sample to a leaf R i despite the missing data. When choosing a new split with variable X j (growing the tree): Only consider the samples which have the variable X j. In addition to choosing the best split, choose a second best split using a different variable, and a third best,... To propagate a sample down the tree, if it is missing a variable to make a decision, try the second best decision, or the third best,... 5 / 17

Bagging Bagging = Bootstrap Aggregating In the Bootstrap, we replicate our dataset by sampling with replacement: Original dataset: x = c(x1, x2,..., x100) Bootstrap samples: boot1 = sample(x, 100, replace = True),..., bootb = sample(x, 100, replace = True). We used these samples to get the Standard Error of a parameter estimate: SE( ˆβ 1 ) 1 B (b) ( ˆβ 1 1 B ˆβ (k) 1 B 1 B )2 b=1 k=1 6 / 17

Bagging In Bagging we average the predictions of a model fit to many Bootstrap samples. Example. Bagging the Lasso Let ŷ L,b be the prediction of the Lasso applied to the bth bootstrap sample. Bagging prediction: ŷ boot = 1 B B ŷ L,b. b=1 7 / 17

When does Bagging make sense? When a regression method or a classifier has a tendency to overfit, Bagging reduces the variance of the prediction. When n is large, the empirical distribution is similar to the true distribution of the samples. Bootstrap samples are like independent realizations of the data. Bagging amounts to averaging the fits from many independent datasets, which would reduce the variance by a factor 1/B. 8 / 17

Bagging decision trees Disadvantage: Every time we fit a decision tree to a Bootstrap sample, we get a different tree T b. Loss of interpretability For each predictor, add up the total amount by which the RSS (or Gini index) decreases every time we use the predictor in T b. Average this total over each Boostrap estimate T 1,..., T B. Fbs RestECG ExAng Sex Slope Chol Age RestBP MaxHR Oldpeak ChestPain Ca Thal 0 20 40 60 80 100 Variable Importance 9 / 17

Out-of-bag (OOB) error To estimate the test error of a bagging estimate, we could use cross-validation. Each time we draw a Bootstrap sample, we only use 63% of the observations. Idea: use the rest of the observations as a test set. OOB error: For each sample x i, find the prediction ŷi b for all bootstrap samples b which do not contain x i. There should be around 0.37B of them. Average these predictions to obtain ŷi oob. Compute the error (y i ŷ oob i ) 2. Average the errors over all observations i = 1,..., n. 10 / 17

Out-of-bag (OOB) error Error 0.10 0.15 0.20 0.25 0.30 Test: Bagging Test: RandomForest OOB: Bagging OOB: RandomForest 0 50 100 150 200 250 300 Number of Trees The test error decreases as we increase B (dashed line is the error for a plain decision tree). 11 / 17

Random Forests Bagging has a problem: The trees produced by different Bootstrap samples can be very similar. Random Forests: We fit a decision tree to different Bootstrap samples. When growing the tree, we select a random sample of m < p predictors to consider in each step. This will lead to very different (or uncorrelated ) trees from each sample. Finally, average the prediction of each tree. 12 / 17

Random Forests vs. Bagging Error 0.10 0.15 0.20 0.25 0.30 Test: Bagging Test: RandomForest OOB: Bagging OOB: RandomForest 0 50 100 150 200 250 300 Number of Trees 13 / 17

Random Forests, choosing m Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees The optimal m is usually around p, but this can be used as a tuning parameter. 14 / 17

Boosting 1. Set ˆf(x) = 0, and r i = y i for i = 1,..., n. 2. For b = 1,..., B, iterate: 2.1 Fit a decision tree ˆf b with d splits to the response r 1,..., r n. 2.2 Update the prediction to: ˆf(x) ˆf(x) + λ ˆf b (x). 2.3 Update the residuals, r i r i λ ˆf b (x i ). 3. Output the final model: B ˆf(x) = λ ˆf b (x). b=1 15 / 17

Boosting, intuitively Boosting learns slowly: We first use the samples that are easiest to predict, then slowly down weigh these cases, moving on to harder samples. 16 / 17

Boosting vs. random forests Test Classification Error 0.05 0.10 0.15 0.20 0.25 Boosting: depth=1 Boosting: depth=2 RandomForest: m= p 0 1000 2000 3000 4000 5000 Number of Trees The parameter λ = 0.01 in each case. We can tune the model by CV using λ, d, B. 17 / 17