Introduction to Classification & Regression Trees

Similar documents
Random Forest A. Fornaser

The Basics of Decision Trees

Decision trees. For this lab, we will use the Carseats data set from the ISLR package. (install and) load the package with the data set

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Statistical Methods for Data Mining

Random Forests and Boosting

Lecture 20: Bagging, Random Forests, Boosting

Tree-based methods for classification and regression

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Lecture 19: Decision trees

STAT Midterm Research Project: Random Forest. Vicky Xu. Xiyu Liang. Xue Cao

Classification with Decision Tree Induction

Regression trees with R

Network Traffic Measurements and Analysis

Nonparametric Approaches to Regression

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Nonparametric Classification Methods

Business Club. Decision Trees

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Data Mining Lecture 8: Decision Trees

Classification and Regression Trees

7. Boosting and Bagging Bagging

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CS 229 Midterm Review

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Stat 342 Exam 3 Fall 2014

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Lecture 25: Review I

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

CSC 411 Lecture 4: Ensembles I

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Nonparametric Methods Recap

Ensemble Methods: Bagging

Random Forest Classification and Attribute Selection Program rfc3d

Introduction to Automated Text Analysis. bit.ly/poir599

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

8. Tree-based approaches

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Knowledge Discovery and Data Mining

Machine Learning: An Applied Econometric Approach Online Appendix

International Journal of Software and Web Sciences (IJSWS)

Lecture 13: Model selection and regularization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

STA 4273H: Statistical Machine Learning

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Applying Supervised Learning

Ensemble methods in machine learning. Example. Neural networks. Neural networks

What is machine learning?

Logical Rhythm - Class 3. August 27, 2018

Statistical Machine Learning Hilary Term 2018

A Practical Tour of Ensemble (Machine) Learning

Notes based on: Data Mining for Business Intelligence

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Classification and Regression Trees

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Classification/Regression Trees and Random Forests

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Lecture 5: Decision Trees (Part II)

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

An introduction to random forests

Patrick Breheny. November 10

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Using Machine Learning to Optimize Storage Systems

Lecture 20: Classification and Regression Trees

Chapter 7: Numerical Prediction

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

1) Give decision trees to represent the following Boolean functions:

Stat 602X Exam 2 Spring 2011

Decision Trees / Discrete Variables

Given a collection of records (training set )

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

CART Bagging Trees Random Forests. Leo Breiman

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Chapter 5. Tree-based Methods

Predictive modelling / Machine Learning Course on Big Data Analytics

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Allstate Insurance Claims Severity: A Machine Learning Approach

Lecture 06 Decision Trees I

Random Forest in Genomic Selection

From Building Better Models with JMP Pro. Full book available for purchase here.

Algorithms: Decision Trees

CS229 Lecture notes. Raphael John Lamarre Townshend

Classification: Decision Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Statistical foundations of machine learning

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Decision Trees. This week: Next week: Algorithms for constructing DT. Pruning DT Ensemble methods. Random Forest. Intro to ML

The Curse of Dimensionality

Data Mining Part 5. Prediction

Transcription:

Introduction to Classification & Regression Trees ISLR Chapter 8 vember 8, 2017

Classification and Regression Trees Carseat data from ISLR package

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node)

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree regression or classification function is nonlinear in predictors

Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree regression or classification function is nonlinear in predictors Captures interactions

Carseat Example library(tree) data(carseats) Carseats = mutate(carseats, High=factor(ifelse ( Carseats$Sales <=8, "", "Yes ")) ) treecarseats = tree(high ~ Price + Income, data=carseats)

Carseat Example plot(treecarseats) text(treecarseats) Price < 925 Yes Price < 142 Income < 605 Income < 625

Partition partitiontree(treecarseats) points(carseats$price,carseats$income, col=carseats$high) 50 100 150 20 60 100 Price Income Yes

Splits treecarseats ## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 400 54150 ( 05900 04100 ) ## 2) Price < 925 62 6624 Yes ( 02258 07742 ) * ## 3) Price > 925 338 43480 ( 06568 03432 ) ## 6) Price < 142 287 38210 ( 06167 03833 ) ## 12) Income < 605 113 12870 ( 07434 02566 ) ## 13) Income > 605 174 24040 ( 05345 04655 ) ## 7) Price > 142 51 3695 ( 08824 01176 ) ## 14) Income < 625 19 000 ( 10000 00000 ) * ## 15) Income > 625 32 3088 ( 08125 01875 ) *

Summary summary(treecarseats) ## ## Classification tree: ## tree(formula = High ~ Price + Income, data = Carseats) ## Number of terminal nodes: 5 ## Residual mean deviance: 118 = 4662 / 395 ## Misclassification error rate: 0325 = 130 / 400

All Variables treecarseats =tree(high ~ -Sales, data=carseats ) summary(treecarseats) ## ## Classification tree: ## tree(formula = High ~ - Sales, data = Carseats) ## Variables actually used in tree construction: ## [1] "ShelveLoc" "Price" "Income" "CompPrice ## [6] "Advertising" "Age" "US" ## Number of terminal nodes: 27 ## Residual mean deviance: 04575 = 1707 / 373 ## Misclassification error rate: 009 = 36 / 400 Overfitting?

Tree ShelveLoc:ac Price < 925 Price < 135 US:a Income < 46 Income < 57 CompPrice Population < 1105 < 2075 Advertising < 135 Price < 109 Yes Yes Yes Yes Yes Yes CompPrice < 1245 Age < 545 CompPrice < CompPrice 1305 < 1225 Price < 1065 Price < 1225 Income < 100 Price < 125 Population < 177 Yes Yes Income < 605 ShelveLoc:a CompPrice < 1475 Yes Yes Price < 1095 Price < 147 CompPrice < 1525 Age < 495 Yes Yes Yes

Classification Error setseed (2) train=sample (1: nrow(carseats ), 200) Carseatstest=Carseats [-train,] treecarseats =tree(high ~ -Sales, data=carseats,subset=train ) treepred=predict (treecarseats,carseatstest,type ="cla table(treepred, Carseatstest$High) ## ## treepred Yes ## 86 27 ## Yes 30 57 (30 + 27) /200 # classification error ## [1] 0285

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k,

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes 4 Using K-fold cross validation, compute average cost-complexity for each k

Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes 4 Using K-fold cross validation, compute average cost-complexity for each k 5 Pick subtree with smallest penalized error

Pruning via Cross Validation ) setseed(2) cvcarseats = cvtree(treecarseats, FUN=prunemisclass) cvcarseats$dev 55 60 65 70 75 80 cvcarseats$dev 55 60 65 70 75 80 5 10 15 0 5 10 15 20 cvcarseats$size cvcarseats$k

Pruned prunecarseats = prunemisclass(treecarseats,best =9) ShelveLoc: Bad,Medium Price < 142 Price < 1425 ShelveLoc: Bad Price < 865 Advertising < 65 Yes Yes Yes Age < 375 CompPrice < 1185 Yes

Miss-classification after Selection treepred=predict(prunecarseats, Carseatstest, type="class") table(treepred,carseatstest$high) ## ## treepred Yes ## 94 24 ## Yes 22 60 (94 +60)/200 # classified Correctly ## [1] 077

Tree with Random Split of Data Price < 925 CompPrice < 1085 ShelveLoc:ac ShelveLoc:aPopulation < 86 Yes Yes Yes Age < 615 Income < 355 Price < 1265 Age < 795 Advertising < 75 Population < 545 Advertising < 16 Advertising < 3 CompPrice < 1315 CompPrice < 1385 CompPrice < 118 Population < 2315 Yes Yes Income < 84 Age < 445 Yes Income < 415

Tree with another Random Split of Data ShelveLoc:ac Advertising < 125 Price < 1425 Price < 1075 Age < 565 US:a CompPrice < 1295 Price < 92 ShelveLoc:a CompPrice < 1195 Yes Yes CompPrice < 127 Price < 1445 ShelveLoc:a Income < 655 CompPrice < 1445 Price < 985 Income < 97CompPrice < 120 Yes Age < 405 Yes Yes Yes Age < 485 Yes Population < 1345

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability)

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets!

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement)

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x)

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune)

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune) Reduce variance by averaging many trees across the bootstrap samples

Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune) Reduce variance by averaging many trees across the bootstrap samples

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting BART (Bayesian Additive Regression Trees)

Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting BART (Bayesian Additive Regression Trees) Combining trees will yield improved prediction accuracy, but with loss of interpretability