Tree-based methods for classification and regression

Similar documents
Classification and Regression Trees

Classification and Regression Trees

Model selection and validation 1: Cross-validation

Network Traffic Measurements and Analysis

Data Mining Lecture 8: Decision Trees

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Chapter 5. Tree-based Methods

Decision Tree Learning

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Random Forest A. Fornaser

Knowledge Discovery and Data Mining

Lecture 19: Decision trees

The Basics of Decision Trees

8. Tree-based approaches

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Introduction to Classification & Regression Trees

Classification/Regression Trees and Random Forests

Notes based on: Data Mining for Business Intelligence

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CS229 Lecture notes. Raphael John Lamarre Townshend

Lecture 25: Review I

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Classification: Decision Trees

Nonparametric Classification Methods

Business Club. Decision Trees

Lecture 7: Decision Trees

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Lecture 5: Decision Trees (Part II)

Decision Tree CE-717 : Machine Learning Sharif University of Technology

7. Decision or classification trees

Cross-validation. Cross-validation is a resampling method.

Classification Algorithms in Data Mining

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Nonparametric Methods Recap

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Lecture 20: Bagging, Random Forests, Boosting

CSE4334/5334 DATA MINING

Patrick Breheny. November 10

Simple Model Selection Cross Validation Regularization Neural Networks

Classification with Decision Tree Induction

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lecture 20: Classification and Regression Trees

7. Boosting and Bagging Bagging

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

Machine Learning: An Applied Econometric Approach Online Appendix

Boosting Simple Model Selection Cross Validation Regularization

Statistical Methods for Data Mining

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Artificial Intelligence. Programming Styles

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS 229 Midterm Review

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

International Journal of Software and Web Sciences (IJSWS)

Trade-offs in Explanatory

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Problem 1: Complexity of Update Rules for Logistic Regression

Classification and Regression

Classification with PAM and Random Forest

The Curse of Dimensionality

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Lecture outline. Decision-tree classification

Algorithms: Decision Trees

Cross-validation and the Bootstrap

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Random Forest Classification and Attribute Selection Program rfc3d

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Random Forests and Boosting

An introduction to random forests

Supervised vs unsupervised clustering

Lecture 06 Decision Trees I

CS Machine Learning

Scaling up Decision Trees

Classification. Instructor: Wei Ding

Locality- Sensitive Hashing Random Projections for NN Search

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Topic 7 Machine learning

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

CSE 446 Bias-Variance & Naïve Bayes

Ensemble Methods, Decision Trees

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Stat 342 Exam 3 Fall 2014

Transcription:

Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1

Tree-based methods Tree-based based methods for predicting y from a feature vector x R p divide up the feature space into rectangles, and then fit a very simple model in each rectangle. This works both when y is discrete and continuous, i.e., both for classification and regression Rectangles can be achieved by making successive binary splits on the predictors variables X 1,... X p. I.e., we choose a variable X j, j = 1,... p, divide up the feature space according to Then we proceed on each half X j c and X j > c For simplicity, consider classification first (regression later). If a half is pure, meaning that it mostly contains points from one class, then we don t need to continue splitting; otherwise, we continue splitting 2

Example: simple classification tree Example: n = 500 points in p = 2 dimensions, falling into classes 0 and 1, as marked by colors 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x2 Does dividing up the feature space into rectangles look like it would work here? 3

x.2< 0.111 x.1>=0.4028 x.2>=0.4993 x.1< 0.5998 x.2< 0.598 0 60/0 0 148/0 0 39/0 1 0/71 0 101/0 1 0/81 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x2 4

Example: HCV treatment flow chart (From http://hcv.org.nz/wordpress/?tag=treatment-flow-chart) 5

Classification trees Classification trees are popular because they are interpretable, and maybe also because they mimic the way (some) decisions are made Let (x i, y i ), i = 1,... n be the training data, where y i {1,... K} are the class labels, and x i R p measure the p predictor variables. The classification tree can be thought of as defining m regions (rectangles) R 1,... R m, each corresponding to a leaf of the tree We assign each R j a class label c j {1,... K}. We then classify a new point x R p by ˆf tree (x) = m c j 1{x R j } = c j such that x R j j=1 Finding out which region a given point x belongs to is easy since the regions R j are defined by a tree we just scan down the tree. Otherwise, it would be a lot harder (need to look at each region) 6

Example: regions defined by a tree (From ESL page 306) 7

Example: regions not defined a tree (From ESL page 306) 8

Predicted class probabilities With classification trees, we can also get not only the predicted classes for new points but also the predicted class probabilties Note that each region R j contains some subset of the training data (x i, y i ), i = 1,... n, say, n j points. The predicted class c j is just most common occuring class among these points. Further, for each class k = 1,... K, we can estimate the probability that the class label is k given that the feature vector lies in region R j, P(C = k X R j ), by ˆp k (R j ) = 1 1{y i = k} n j x i R j the proportion of points in the region that are of class k. We can now express the predicted class as c j = argmax k=1,...k ˆp k (R j ) 9

10 Trees provide a good balance Model Estimated assumptions? probabilities? Interpretable? Flexible? LDA Yes Yes Yes No LR Yes Yes Yes No k-nn No No No Yes Trees No Yes Yes Somewhat Predicts well? LDA Depends on X LR Depends on X If properly k-nn tuned Trees?

How to build trees? There are two main issues to consider in building a tree: 1. How to choose the splits? 2. How big to grow the tree? Think first about varying the depth of the tree... which is more complex, a big tree or a small tree? What tradeoff is at play here? How might we eventually consider choosing the depth? Now for a fixed depth, consider choosing the splits. If the tree has depth d (and is balanced), then it has 2 d nodes. At each node we could choose any of p the variables for the split this means that the number of possibilities is p 2 d This is huge even for moderate d! And we haven t even counted the actual split points themselves 11

The CART algorithm The CART algorithm 1 chooses the splits in a top down fashion: then chooses the first variable to at the root, then the variables at the second level, etc. At each stage we choose the split to achieve the biggest drop in misclassification error this is called a greedy strategy. In terms of tree depth, the strategy is to grow a large tree and then prune at the end Why grow a large tree and prune, instead of just stopping at some point? Because any stopping rule may be short-sighted, in that a split may look bad but it may lead to a good split below it 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x2 1 Breiman et al. (1984), Classification and Regression Trees 12

13 Recall that in a region R m, the proportion of points in class k is ˆp k (R m ) = 1 n m x i R m 1{y i = k}. The CART algorithm begins by considering splitting on variable j and split point s, and defines the regions R 1 = {X R p : X j s}, R 2 = {X R p : X j > s} We then greedily chooses j, s by minimizing the misclassification error ( [1 argmin ˆpc1 (R 1 ) ] + [ 1 ˆp c2 (R 2 ) ]) j,s Here c 1 = argmax k=1,...k ˆp k (R 1 ) is the most common class in R 1, and c 2 = argmax k=1,...k ˆp k (R 2 ) is the most common class in R 2

14 Having done this, we now repeat this within each of the newly defined regions R 1, R 2. That is, it again considers splitting all variables j and split points s, within each of R 1, R 2, this time greedily choosing the pair that provides us with the biggest improvement in misclassification error How do we find the best split s? Aren t there infinitely many to consider? No, to split a region R m on a variable j, we really only need to consider n m splits (or n m 1 splits) x2 0.0 0.2 0.4 0.6 0.8 x2 0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1.0 x1 0.2 0.4 0.6 0.8 1.0 x1

15 Continuing on in this manner, we will get a big tree T 0. Its leaves define regions R 1,... R m. We then prune this tree, meaning that we collapse some of its leaves into the parent nodes For any tree T, let T denote its number of leaves. We define T [ C α (T ) = 1 ˆpcj (R j ) ] + α T j=1 We seek the tree T T 0 that minimizes C α (T ). It turns out that this can be done by pruning the weakest leaf one at a time. Note that α is a tuning parameter, and a larger α yields a smaller tree. CART picks α by 5- or 10-fold cross-validation

Example: simple classification tree Example: n = 500, p = 2, and K = 2. We ran CART: x.2< 0.111 x.1>=0.4028 x.2>=0.4993 x.1< 0.5998 x.2< 0.598 0 60/0 0 148/0 0 39/0 1 0/71 0 101/0 1 0/81 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x2 To use CART in R, you can use either of the functions rpart or tree, in the packages of those same names. When you call rpart, cross-validation is performed automatically; when you call tree, you must then call cv.tree for cross-validation 16

Example: spam data Example: n = 4601 emails, of which 1813 are considered spam. For each email we have p = 58 attributes. The first 54 measure the frequencies of 54 key words or characters (e.g., free, need, $ ). The last 3 measure the average length of uninterrupted sequences of capitals; the length of the longest uninterrupted sequence of capitals; the sum of lengths of uninterrupted sequences of capitals (Data from ESL section 9.2.5) An aside: how would we possibly get thousands of emails labeled as spam or not? This is great! Every time you label an email as spam, gmail has more training datafg 17

Cross-validation error curve for the spam data (from ESL page 314): 18

Tree of size 17, chosen by cross-validation (from ESL page 315): 19

20 Other impurity measures We used misclassification error as a measure of the impurity of region R j, 1 ˆp cj (R j ) But there are other useful measures too: the Gini index: K ˆp k (R j ) [ 1 ˆp k (R j ) ], k=1 and the cross-entropy or deviance: K ˆp k (R j ) log {ˆp k (R j ) }. k=1 Using these measures instead of misclassification error is sometimes preferable because they are more sensitive to changes in class probabilities. Overall, they are all pretty similar (Homework 6)

21 Regression trees Suppose that now we want to predict a continuous outcome instead of a class label. Essentially, everything follows as before, but now we just fit a constant inside each rectangle

22 The estimated regression function has the form ˆf tree (x) = m c j 1{x R j } = c j such that x R j j=1 just as it did with classification. The quantities c j are no longer predicted classes, but instead they are real numbers. How would we choose these? Simple: just take the average response of all of the points in the region, c j = 1 n j x i R j y i The main difference in building the tree is that we use squared error loss instead of misclassification error (or Gini index or deviance) to decide which region to split. Also, with squared error loss, choosing c j as above is optimal

How well do trees predict? Trees seem to have a lot of things going in the favor. So how is their predictive ability? Unfortunately, the answer is not great. Of course, at a high level, the prediction error is governed by bias and variance, which in turn have some relationship with the size of the tree (number of nodes). A larger size means smaller bias and higher variance, and a smaller tree means larger bias and smaller variance But trees generally suffer from high variance because they are quite instable: a smaller change in the observed data can lead to a dramatically different sequence of splits, and hence a different prediction. This instability comes from their hierarchical nature; once a split is made, it is permanent and can never be unmade further down in the tree We ll learn some variations of trees have much better predictive abilities. However, their predictions rules aren t as transparent 23

Recap: trees for classification and regression In this lecture, we learned about trees for classification and regression. Using trees, we divide the feature space up into rectangles by making successive splits on different variables, and then within each rectangle (leaf of the tree), the predictive task is greatly simplified. I.e., in classification, we just predict the most commonly occuring class, and in regression, we just take the average response value of points in the region The space of possible trees is huge, but we can fit a good tree using a greedy strategy, as is done by the CART algorithm. It also grows a large tree, and then prunes back at the end, choosing how much to prune by cross-validation Trees are model-free and are easy to interpret, but generally speaking, aren t very powerful in terms of predictive ability. Next time we ll learn some procedures that use trees to make excellent prediction engines (but in a way we lose interpretability) 24

25 Next time: bagging Fitting small trees on bootstrapped data sets, and averaging predictions at the end, can greatly reduce the prediction error (from ESL page 285):