Introduction to Classification & Regression Trees

Size: px

Start display at page:

Download "Introduction to Classification & Regression Trees"

Rodney Calvin Kelly
6 years ago
Views:

1 Introduction to Classification & Regression Trees ISLR Chapter 8 vember 8, 2017

2 Classification and Regression Trees Carseat data from ISLR package

3 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0

4 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income

5 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree

6 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node)

7 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes

8 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree

9 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree regression or classification function is nonlinear in predictors

10 Classification and Regression Trees Carseat data from ISLR package Binary Outcome High 1 if Sales > 8, otherwise 0 Fit a Classification tree model to Price and Income Pick a predictor and a cutpoint to split data X j s and X k > s to minimize deviance (or SSE for regression) - leads to a root node in a tree continue splitting/partitioning data until stopping criterion is reached (number of observations in a node > 10 and within node deviance > 001 deviance of the root node) Prediction is mean or proportion of successes of data in terminal nodes Output is a decision tree regression or classification function is nonlinear in predictors Captures interactions

11 Carseat Example library(tree) data(carseats) Carseats = mutate(carseats, High=factor(ifelse ( Carseats$Sales <=8, "", "Yes ")) ) treecarseats = tree(high ~ Price + Income, data=carseats)

12 Carseat Example plot(treecarseats) text(treecarseats) Price < 925 Yes Price < 142 Income < 605 Income < 625

13 Partition partitiontree(treecarseats) points(carseats$price,carseats$income, col=carseats$high) Price Income Yes

14 Splits treecarseats ## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root ( ) ## 2) Price < Yes ( ) * ## 3) Price > ( ) ## 6) Price < ( ) ## 12) Income < ( ) ## 13) Income > ( ) ## 7) Price > ( ) ## 14) Income < ( ) * ## 15) Income > ( ) *

15 Summary summary(treecarseats) ## ## Classification tree: ## tree(formula = High ~ Price + Income, data = Carseats) ## Number of terminal nodes: 5 ## Residual mean deviance: 118 = 4662 / 395 ## Misclassification error rate: 0325 = 130 / 400

16 All Variables treecarseats =tree(high ~ -Sales, data=carseats ) summary(treecarseats) ## ## Classification tree: ## tree(formula = High ~ - Sales, data = Carseats) ## Variables actually used in tree construction: ## [1] "ShelveLoc" "Price" "Income" "CompPrice ## [6] "Advertising" "Age" "US" ## Number of terminal nodes: 27 ## Residual mean deviance: = 1707 / 373 ## Misclassification error rate: 009 = 36 / 400 Overfitting?

17 Tree ShelveLoc:ac Price < 925 Price < 135 US:a Income < 46 Income < 57 CompPrice Population < 1105 < 2075 Advertising < 135 Price < 109 Yes Yes Yes Yes Yes Yes CompPrice < 1245 Age < 545 CompPrice < CompPrice 1305 < 1225 Price < 1065 Price < 1225 Income < 100 Price < 125 Population < 177 Yes Yes Income < 605 ShelveLoc:a CompPrice < 1475 Yes Yes Price < 1095 Price < 147 CompPrice < 1525 Age < 495 Yes Yes Yes

18 Classification Error setseed (2) train=sample (1: nrow(carseats ), 200) Carseatstest=Carseats [-train,] treecarseats =tree(high ~ -Sales, data=carseats,subset=train ) treepred=predict (treecarseats,carseatstest,type ="cla table(treepred, Carseatstest$High) ## ## treepred Yes ## ## Yes ( ) /200 # classification error ## [1] 0285

19 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations

20 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc

21 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k,

22 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes

23 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes 4 Using K-fold cross validation, compute average cost-complexity for each k

24 Cost-Complexity Pruning 1 Grow a large tree on training data, stopping when each terminal node has fewer than some minimum number of observations 2 Prediction for region m is the Class c that max cˆπ mc 3 Snip off the least important splits via cost-complexity pruning to the tree in order to obtain a sequence of best subtrees indexed by cost parameter k, N miss N k T missclassification error penalized by number of terminal nodes 4 Using K-fold cross validation, compute average cost-complexity for each k 5 Pick subtree with smallest penalized error

25 Pruning via Cross Validation ) setseed(2) cvcarseats = cvtree(treecarseats, FUN=prunemisclass) cvcarseats$dev cvcarseats$dev cvcarseats$size cvcarseats$k

26 Pruned prunecarseats = prunemisclass(treecarseats,best =9) ShelveLoc: Bad,Medium Price < 142 Price < 1425 ShelveLoc: Bad Price < 865 Advertising < 65 Yes Yes Yes Age < 375 CompPrice < 1185 Yes

27 Miss-classification after Selection treepred=predict(prunecarseats, Carseatstest, type="class") table(treepred,carseatstest$high) ## ## treepred Yes ## ## Yes (94 +60)/200 # classified Correctly ## [1] 077

28 Tree with Random Split of Data Price < 925 CompPrice < 1085 ShelveLoc:ac ShelveLoc:aPopulation < 86 Yes Yes Yes Age < 615 Income < 355 Price < 1265 Age < 795 Advertising < 75 Population < 545 Advertising < 16 Advertising < 3 CompPrice < 1315 CompPrice < 1385 CompPrice < 118 Population < 2315 Yes Yes Income < 84 Age < 445 Yes Income < 415

29 Tree with another Random Split of Data ShelveLoc:ac Advertising < 125 Price < 1425 Price < 1075 Age < 565 US:a CompPrice < 1295 Price < 92 ShelveLoc:a CompPrice < 1195 Yes Yes CompPrice < 127 Price < 1445 ShelveLoc:a Income < 655 CompPrice < 1445 Price < 985 Income < 97CompPrice < 120 Yes Age < 405 Yes Yes Yes Age < 485 Yes Population < 1345

30 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability)

31 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets!

32 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement)

33 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data

34 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x)

35 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune)

36 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune) Reduce variance by averaging many trees across the bootstrap samples

37 Bagging: Bootstrap Aggregation Splitting data into random partitions and fitting a tree model on each half may lead to very different predictions (high variability) Reduce variability by averaging over multiple training sets! do not have access to multiple training sets so create them via bootstrap samples (sample of size n with replacement) Generate B bootstrap sample of observations from the single training data Calculate predictions for the bth sample ˆf b (x) Bagging (Bootstrap Aggregation) estimate is ˆfbag (x) = 1 B ˆfb (x) Trees are grown deep so little bias (although could prune) Reduce variance by averaging many trees across the bootstrap samples

38 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification

39 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging

40 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests

41 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting

42 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting BART (Bayesian Additive Regression Trees)

43 Next Class Trees are simple to understand, but not as competitive with other supervised learning approaches for prediction/classification Ways to improving Trees through multiple trees in some ensemble: Bagging Random Forests Boosting BART (Bayesian Additive Regression Trees) Combining trees will yield improved prediction accuracy, but with loss of interpretability

Random Forest A. Fornaser

Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University